Timo P Kunz is Co-Founder and CEO of Aflorithmic – the AI-backed startup that enables automated and scalable audio production using synthetic media, voice cloning, and audio mastering

Audio is becoming an increasingly core part of a brand’s marketing arsenal. Consumers already find voice ads less intrusive and more engaging than those they see on TV, in print and online. When the ad is related to the activity that the listener is carrying out at the time, the brand’s message is even further reinforced.

Companies are planning to spend almost $7 billion on digital audio advertising this year. The most progressive brands know that they can make their audio ad dollars go so much further by using text-to-speech.

This powerful tool, also known as synthetic voice, lets companies build scalable audio directly from a written script, without the need for a voice actor or studio. It’s much faster and cheaper to use than traditional voiceover, and can launch brands to a global audience – all while maintaining their authentic voice. 

Aflorithmic founder Timo Kunz

With Google’s recent upgrades for its Speech Services – which has made its voice models even more human-like, it’s now easier than ever for brands to create high-quality, personalized synthetic voice content. But just like any other tool, it takes time and practice to master it. 

Having developed hundreds of synthetic voice solutions for dozens of brands, we’ve discovered that most people tend to make similar mistakes when using text-to-speech for the first time. Unsatisfied with the initial results, some even abandon it before they’ve unlocked its full, amazing potential.

When done right, synthetic voice is practically indistinguishable from the real thing. But to maximize its benefits, there are certain best practices to bear in mind. By following these five simple steps, anyone can create engaging text-to-speech content that sounds natural.

Keep your script short

It’s pretty amazing what synthetic voice is already capable of in 2022. In saying that, even the most advanced models still lack the full dynamic range and nuances of natural human speech. Our research shows that listeners tend to disengage when audio exceeds four or five minutes.

Anyone who has listened to an audio article reader on a website knows that it lacks the peaks, troughs, and overall flow of a human speaker. A well-written article isn’t automatically interesting or engaging when it’s converted to audio. It’s an entirely different medium and creators should treat it as such. 

The easiest and most potent way to do so? With short bursts of content. Bite-sized clips are easier for listeners to digest, enjoy, and more importantly, remember. Less is truly more when it comes to text-to-speech. 

Have multiple, alternating speakers

Human voices are extremely complex. It’s not just the words and sentences that a person uses when speaking, but also how they use them. Rhythm, stress, intonation, and so on, convey meaning and help us detect emphasis, emotion, and even sarcasm.

Synthetic voice is much more limited in these properties, which makes it harder to stay focused on what’s being said. The solution to this is surprisingly simple: have two or more different voices read out the script, taking each part in turns. This is especially handy for formats where longer scripts can’t be avoided. 

Using multiple voices instantly infuses your content with greater vocal diversity, which will pique – and hold – listeners’ interest. And this isn’t unique to text-to-speech. Take the evening news, for example. It’s already common to have two anchors deliver alternate lines to increase viewer engagement. 

One of the strengths of synthetic speech is having access to a diverse library of voices. We work with hundreds of voice actors, all of whom receive royalties each time a clone of their voice is used. Using multiple speakers therefore supports the industry and its workers as a whole, all while making your content more dynamic.

Include more pauses, commas, and short sentences

You’ve just created your first piece of text-to-speech content. You’ve kept the script brief, and used multiple speakers. So why does it still sound a bit clunky?

Synthetic speech doesn’t automatically put emphasis on certain words the way human speakers do. In written text, we can rely on grammatical and stylistic forms (like italics and capitals) to add some rhythm, and the reader’s imagination to add the rest. When converting that text into audio, we have to ramp up the punctuation to “show” the synthetic model what needs to be emphasized.

Commas help place stress on the words that come before them (e.g. “the task will be completed, eventually”). Periods break up monotonous speech (e.g. “Their concerns were real. And had to be taken seriously. After all, everybody wanted to get along. For better, or worse.”).

As a general rule, shorter sentences are more impactful than longer ones with lots of commas (e.g. “The world is changing. Because technology and society are advancing. Sometimes, this can be scary. But on the other hand, the opportunities, are huge.”). Just like in natural speech, these breaks and pauses should vary in length. 

Use phonetic spelling for words that are pronounced incorrectly

No matter which text-to-speech model you use, some fine tuning will always be required before the output is tip-top and ready to go. There are certain words that can be challenging for synthetic voice. Tweaking your text to include phonetic spelling can easily rectify this.

For example, foreign words are likely to be said in the voice’s native language, with a strong and incomprehensible accent. A British English voice model might pronounce the French term “coup de Grâce” more like “coop de grace”. To force the correct pronunciation here, we’d recommend changing the spelling to “koo-de-grahs” in your script.

The same rule can be applied to any mispronounced word. We’ve found that some voices pronounce the word “controller” as “comptroller”. This can again be fixed by typing it out phonetically as “conn-troller”. 

Opt for sound to enrich the experience

Of course, speech is just one element of audio content. To produce a truly captivating audio experience, get creative with sound effects and music. Not only do they add depth, emotion and appeal, they also conveniently cover up some of the artificial sounds of the synthetic voice. Easy-to-use sound effects like bumpers and risers are particularly effective for this, and they also help brands tell their story and generate sensation.

If you haven’t already, consider having a sound logo designed – a second(s)-long clip that identifies your brand. Think of the instantly recognisable sonic branding used by Playstation, X-Box, Intel, T-Mobile, McDonalds, and Netflix. Just like text-to-speech, getting this right can set your brand apart from competitors.

The beauty of synthetic voice is that it lets you create endless hours of consistent audio content, personalized to just about any audience. Want to harness its full potential? Tweak your script, apply breaks, sound effects, and visuals, and your content will truly come to life.

Disclaimer: This article features a client of an Espacio portfolio company.