TikTok's text-to-speech feature lets creators add virtual voiceovers at the touch of a button. It makes videos more accessible and eliminates the need to read.
So why do so many people hate on it?
Well, to start with, it's pretty common for TikTok's text-to-speech to get pronunciations wrong.
'Drop the mic' comes out 'drop the mick'.
It can't work out whether 'read' should be pronounced like 'red' or 'reed'.
And it definitely can't handle 'quinoa'.
The lack of voice options is a problem, too.
Creators currently have access to just one (region-dependent) voice. This can create a jarring effect for the listener, where the voice doesn't match up with the message.
@lucyedwardsblind NEW FEATURE ALERT! This is how you use Text-to-speech #learnontiktok #texttospeech #newfeaturetiktok #newfeature #blind ♬ Mood (feat. iann dior) - 24KGoldn
Take, for example, Scottish TikTok. As on Scottish Twitter, lots of captions are written in regional dialect. They simply wouldn't work with the British-English text-to-speech voice.
The North American text-to-speech voice has a "Valley Girl" speech pattern. Not only is "Valspeak" highly stigmatized in the US, but its chirpy nature simply feels inappropriate with a lot of TikTok content.
And it doesn't help that this replaced a more liked voice — one that was removed because the voice artist said she never gave permission for her voice clone to be used.
@leahova Guys, I had to be super raw here. So please be respectful. #texttospeech ♬ original sound - Leah
TikTok's synthetic voices sound robotic, too. Even if you ignore mispronunciations, their tone and inflections often sound very unnatural.
All of this is especially problematic when you consider that TikTok is a social media network. It's personal by nature. When creators use a synthetic voice rather than their own, it reduces the human connection.
Plus, the read-aloud feature offers limited benefits to TikTok viewers. Only small amounts of text appear on videos and the platform is highly visual anyway, so it isn't empowering users to escape their screen or multitask. Nor is it having a huge impact on accessibility.
Users can't really opt out of listening, either, as text-to-speech is played automatically if enabled by the creator.
Is this the best that TTS has to offer?
In a word, no.
TikTok's text-to-speech feature is no doubt impressive. And with 1.4 billion views on #texttospeech alone, it's almost certainly raised awareness of the technology's benefits. It's updated perceptions, too (for many, synthetic speech still brings Stephen Hawking's voice to mind).
But it doesn't showcase the best of what modern text-to-speech has to offer. Which is fair enough: it's not a major feature, and the current version is fit for purpose.
The most advanced text-to-speech can interpret and read text like a human. Often better. And while that might not be necessary on a social media network, it's crucial when converting articles, guides, and other written content into audio.
At SpeechKit, we use natural language processing (NLP) to apply speech synthesis markup language (SSML) to text inputs. This acts like a virtual voice director, telling the AI voice how to pronounce elements like 'mic', 'quinoa', and 'read' properly.
We also use advanced synthetic voices, developed using deep learning algorithms. These AI voices are trained on real human speech, ensuring more naturalistic inflection and intonation. We even offer voice cloning, allowing creators to "speak" in a voice that truly resonates with their audience.