Are localized voices the next frontier for text-to-speech?

As reported by the 2020 Spoken Audio Report, produced by NPR and Edison Research earlier this month, we are listening to more spoken audio than ever before. Spoken audio's share of listening has grown 30% over the past 6 years, 8% this last year.

Text-to-speech technology has come a long way in the last 6 years. Gone are those days when interacting with automated voices was confined to navigation, service centres and buggy early AI-assistants. Text-to-speech has proliferated into 50+ languages, and many more different voices, tones and styles.

SpeechKit works primarily with news publishers who, until now, have been limited to adopting off-the-shelf voices, provided by Amazon Polly and others.  These voices have worked great to increase engagement with news articles and provided readers with an option to listen to audio articles.

However, there’s room for improvement.

Cape Town, home of News24's local voice. Image credit Zoë Reeve.

Resonance keeps us engaged

All brands, whether that be in news media, consumer or any other sector, want to talk to their audience in a voice that resonates. This is the sweet spot of accent, intonation and style which satisfies listeners. Its why great voice artists can charge what they do and (what we’re finding at SpeechKit) it’s what keeps listeners engaged with audio articles.

Studies show that listening comprehension improves when a sentence is spoken in someone’s native accent. We’re more receptive to voices with meaning to us. Having lived much of my childhood in Australia, I live in the UK full-time now, yet my Siri is programmed to speak to me in an Australian accent. For one reason or another, it makes that experience personal – it gives it context and meaning. I find myself enjoying listening to the responses to my (less than meaningful) requests.

Custom localized synthetic voices

We’ve been helping publishers adopt audio articles for the past 3 years and we’re making good progress. Most recently we’ve started to develop custom synthetic voices for our customers who hope to create a voice that better resonates with their audience. Our first publisher, News24, South Africa’s largest news platform needed a South African-accented voice that could pronounce local names in Zulu, Xhosa and Afrikaans.

We developed the voice using a new machine learning technique to model realistic voices, requiring less training data than conventional methods. The voice was launched as a premium feature on a new digital subscription product at the beginning of August. During that month we observed an 407% increase in audio engagement on News24.com when compared to publishers using conventional text-to-speech voices. Furthermore, over the past two months average listen length per article has increased to 2 minutes and 12 seconds.

We’re confident that custom voices, tuned to the ear of the audience, are going to introduce new growth to text-to-speech audio articles. Advances in voice training are providing the efficiencies needed to allow for specialization and providing publishers the quality they need. Technology is allowing us to address the subtleties of voice, encouraging the further adoption of audio articles.

More about SpeechKit

At SpeechKit we’re helping 100’s of news publishers to automate audio versions of their news articles. Sign up for a free trial and instantly start engaging the audio generation.


James Macleod

Co-Founder @ SpeechKit