Login / Register

Machine learning is getting really good at copying the human voice

Thanks! Share it with your friends!


You disliked this video. Thanks for the feedback!

Sorry, only registred users can create playlists.

 Computer Science   |   Science   |   Technology
 Find Related Videos  added


Synthetic voices have become ubiquitous. They feed us directions in the morning, shepherd us through phone calls by day, and broadcast the news on smart speakers at night. And as the technology used to make them improves, these voices are becoming more and more human-sounding. This is the final frontier in synthetic speech: replicating not just what we say, but how we say it.

Rupal Patel heads a research group at Northeastern University that studies speech prosodythe changes in pitch, loudness and duration that we use to convey intent and emotion through voice. Sometimes people think of it as the icing on the cake, she explains. You have the message, and now its how you modulate that message, but I really think it's the scaffolding that gives meaning to the message itself.

Patel says she grew interested in prosody after finding it was the only element of vocal communication that seemed to be available to people with some kinds of severe speech disorders. These patients were able to make expressive sounds, even if they could not speak clearly. In 2014, Patel founded a company to build custom synthetic voices for non-speaking individuals. VocaliD has since expanded to commercial brands and influencers.

Synthetic speech has come a long way over the years. At age nine, Siri is the oldest virtual assistantbut in the world of speaking machines, shes a baby. People have been trying to synthesize speech since at least the 18th century, when an Austro-Hungarian inventor built a crude replica of the human vocal tract that could articulate entire phrases (albeit in a monotone).

Current machine-learning techniques can model human speech complete with awkward pauses and lip smacks. Still, training on thousands of samples per second is prohibitively expensive for most real-world systems; researchers, including those at VocaliD, are continually implementing newer and more efficient methods.

But even as the remaining gaps between human and synthetic speech are steadily closing, truly lifelike prosody continues to elude even the most sophisticated systems. Maybe whats still missing requires machines to not only mimic humans, but also to feel like us.

Please visit our website to discover the latest advances in science and technology:
Discover world-changing science with a subscription to Scientific American. Learn more:

Post your comment


Be the first to comment