If you were caught out don’t worry. Deep fakes, both audio and video, are now worryingly accurate. Even when the intention is good (recreating a stage performance that was never recorded), the opportunities for commercial fraud and mass deception are huge.
In fact, Jay-Z has just filed copywrite complaints against a series of fake videos that show him performing other artist’s hits, including We Didn’t Start the Fire.
A brief history of voice synthesis
The Shakespeare recitals were created by a version of WaveNet, Google’s deep learning TTS (text to speech) technology which also generates the voices used by Google Assistant.
WaveNet is a major advance on previous attempts at voice synthesis. These include concatentation, where actor recordings are broken into fragments and then reassembled into set phrases. Think of the disjointed voices used by call centres to read addresses or bank statements in the early 2010s.
Other approaches include parametric speech synthesis, which use a variety of electronic playback methods to imitate the acoustics of the human voice. One of the earliest versions is the Voder, a keyboard-like device that was demonstrated at the world fair in 1939.
Here, the output is synthesized, rather than assembled from a recording. That’s why the voice sounds so robotic. Other famous examples include Speak and Spell, an educational gadget from Texas Instruments that cluttered up the toy boxes of the 1970s and 80s.
When Google first announced WaveNet in 2016, it required too much computational power for practical use in commercial applications. So, what’s the big difference in 2020? First, Google has greatly simplified the training model. Previously, WaveNet required complex linguistic inputs and the supervision of an expert with the assistance of elaborate text-analysis systems.
Since then, Google has developed an architecture called Tacotron that replaces these complex elements with a single neural network trained by data alone, in the form of speech examples and corresponding text transcripts.
The latest version of this model, Tacotron 2, uses a neural network to generate spectograms that represent not only the pronunciation of words, but also subtleties of human speech, including volume, speed and intonation. A modified version of WaveNet uses these spectrograms to synthesise audio that imitates the human voice.
The system even understands the nuances of written punctuation adding pauses and emphasis to generated speech. If you’re not convinced, have a listen to these examples. Even side by side, it’s difficult to tell the difference between the human voice and its synthetic equivalent.
Aside from deep fake anxieties, there are plenty of legitimate commercial opportunities for voice synthesis. Chatbots proliferate, while synthesis for individuals with speech impairments has been around for decades.
Dubbing of TV shows and films is another area for growth. Voice synthesis will also play a central role in our post-covid world as robotics and other forms of automation gain a foothold in the workplace.
But the deep fake crisis won’t go away. Most recently, researchers using the Tacotron 2 model were able to recreate a voice based on just five seconds of recorded audio. Five seconds! There used to be a time when the only people who had to worry about being impersonated were those who had shared many hours online voice recordings.
But the latest advances put all of us in peril. We now live in world where a tiny snippet of your vocal DNA is enough to reproduce your presence across the entirety of social media. Don’t bother looking for the cork. The voice genie is well and truly out of the bottle.