Leveraging WaveNet to reunite speech-disabled users with their voices
As a teen, Tim Shaw gave everything he had in the pursuit of football, his aspiration was to be a part of the NFL. After getting selected and playing for Penn State in his college years, his aspirations finally became a reality. The Carolina Panthers drafted him at the ripe young age of 23, and he subsequently went on to feature for the Chicago Bears and the Tennessee Titans, where he created records a linebacker. After more than half-a-decade in the NFL, six to be precise, his performance began to show signs of waning. He wasn’t able to tackle like before, his arms just slid off the pullup bar. Back home, he had trouble with bags of groceries, and his legs began to buckle. Eventually in 2013, Tim was cut from the Titans but he made a promise to himself, a vow to make it into another NFL team. Tim put in more effort that ever, however, his performance continued to sharply decline. Five months on, at long last he found out the reason for his dwindling performance – he had been diagnosed with Amyotrophic lateral sclerosis (ALS, usually referred to as Lou Gehrig’s disease). In ALS, the neurons that manage an individual’s voluntary muscles die off, ultimately causing a complete loss of control over one’s own body. ALS has no identified cause, its origins are unknown, and as of present day, it has no cure or treatment.
Currently, Tim is a very capable advocate for ALS research. Earlier on this year, he put out a letter to his younger self propagating acceptance – he believed that he’d otherwise grieve himself to his deathbed. Presently a wheelchair user, he lives under the constant supervision of his parents. Individuals with ALS have issues with movement, and the illness makes speech, swallowing, and even breathing by themselves tough and eventually impossible. The inability to communicate can be one of the most difficult aspects for individuals suffering from ALS and for their families. As Tim states: “it’s beyond frustrating not to be able to express what’s going on in my mind I’m smarter than ever but I just can’t get it out.”
The loss of speech can have catastrophic social consequences. Currently, the primary option available to such people to preserve their voice is message banking, wherein people diagnosed with ALS can electronically record and document personally meaningful phrases leveraging their natural inflection and intonation. Message banking is a cause of great comfort for individuals suffering from ALS and their families, assisting to preserve a fundamental aspect of their identities – their voice – through a very challenging time. However, message banking is lacking in flexibility, have the outcome of a static dataset of phrases. Just think of it. Being informed that you’ll never be able to speak again. Now think of the fact that you were provided the opportunity to preserve your speech by recording as much of it as feasible. How could you go about determining what to record? How would you get a grasp of what you most want to be able to speak in your future? Would it be a story that meant a lot to you personally, a favourite phrase or saying, or a simple “I love you”? The procedure can be time intensive and emotionally draining, particularly as an afflicted person’s voice experiences degradation. And individuals who weren’t able to record phrases in time have no option but to choose a generic computer synthesized voice that doesn’t possess the same power of connection as their own voices.
Google and DeepMind have initiated a collaboration, along with patients like Tim Shaw, to assist in generating technologies that can make it simpler for individuals with speech challenges to communicate and interact with others. There are two-fold hurdles in this regard. To start with, we must possess technology that can go about recognizing the speech of people with non-standardized pronunciation. – something Google AI has been conducting research on through their Project Euphoria. Second, the best case scenario is that we’d prefer people to be able to interact leveraging their original voice. Stephen Hawking, who was also afflicted with ALS, interacted with a famously unnatural sounding text-to-speech synthesizer. Therefore, the second hurdle is customization of text-to-speech tech to the user’s natural speech.
Developing natural sounding speech is thought of as a “grand challenge” within the domain of artificial intelligence. With Tacotron and WaveNet, we’ve witnessed revolutionary breakthroughs with regards to the quality of text-to-speech frameworks. However, while it is feasible to develop authentic sounding voices that sound like particular people in specific contexts – as it was illustrated in DeepMind’s collaboration with John Legend in the previous year, generating synthetic voices needs several hours of studio recording time with a very particular script – which is a luxury that a lot of people with ALS simply don’t possess. Developing machine learning models that need minimal training information is an active sphere of research at DeepMind, and is critical for use cases like this where we are required to recreate a voice with only a handful of audio recordings. DeepMind has helped to accomplish this by leveraging their WaveNet work and the novel strategies illustrated in their research paper, Sample Efficient Adaptive Text-to-speech; where it was illustrated that it was feasible to generate a high quality voice leveraging small amounts of speech information.
Going back to Tim, Tim and his family were instrumental in the research collaboration with Google. The objective was to furnish Tim and his family an opportunity to listen to his authentic speaking voice again. Thanks to Tim’s time in the NFL, he had been in the media, and they had about half-an-hour of high quality audio recordings to work off of. They were able to go about applying the methodologies from WaveNet and TTS to recreate his original voice.
After half-a-year’s effort, Google’s Artificial Intelligence Team paid a visit to Tim and his family to show them the outcome of their research. The meeting was recorded for the new Youtube Originals learning series “The Age of A.I.” for which Robert Downey Jr. was the host. Tim and his family could listen to his original voice for the first time in years, as the model, which had received training on Tim’s NFL media recordings, read out the letter he’d just penned down to his younger self.
“I don’t recall that voice”, Tim stated. His father’s response was “we do.” Later on, Tim recounted that he was elated as it had been such a long time since he had sounded like that. He felt like a new individual. He felt like a missing aspect of his soul had been reinstated. He was thankful that there are individuals in the world that push the envelope to assist other people.
To gain insight into how the technology operates, it’s critical to first comprehend WaveNet. WaveNet is a generative model that received training on several hours of speech and text information from divergent speakers. It can then be inputted random new text for synthesis into a natural-sounding spoken sentence.
Previous year, in DeepMind’s Sample Efficient Adaptive Text-to-speech paper, they demonstrated that it’s feasible to go about training a new voice within the span of minutes, instead of hours, of voice recordings via a procedure referred to as fine-tuning. This consists of initially training a large WaveNet model on up to thousands of speakers, which takes a few days, until it can generate the fundamentals of authentic sounding speech. Following this, they take the minimal corpus of information for the target speaker and smartly adapt the model, altering the weights so they can develop a singular model that is matching with the target speaker. The theory of fine-tuning is a lot like how people go about learning. For instance, if you are making an attempt to learn calculus, you should first comprehend the basic of algebra, and then go about applying these simpler concepts to assist in finding solutions to more complicated equations.
Following this publication, DeepMind persisted in iterating on their models. To start with, they migrated from WaveNet to WaveRNN, which is a more effective text-to-speech model co-developed by Google AI and DeepMind. WaveNet needs a second distillation step to speed it up to serve requests in real-time, which makes fine tuning more of a challenge. WaveRNN, to the contrary, does not need a second training step and can undertake synthesis of speech much quicker than a WaveNet model that hasn’t been distilled.
In addition to accelerating the models by moving to WaveRNN, DeepMind collaborated with Google AI to enhance the quality of the models. Google AI researchers illustrated that a similar fine-tuning strategy could have application to the connected Google Tacotron model, which is leveraged in combination with WaveRNN in the pursuit of synthesis of authentic voices. By bringing together these technologies which had received training on audio clips of Tim Shaw from his time in NFL, they were able to produce an original sounding voice that resembled how Tim used to speak like prior to degradation of his speech. While the voice is not yet 100% perfect, as it lacks the expressiveness, quirks, and controllability of an authentic voice, the thrilling part is that the conjunction of WaveRNN and Tacotron may assist individuals like Tim hold onto a critical aspect of their identity, and the hope is that at one point in time, it will be integrated into speech-production devices.