WaveNet: a generative model for raw audio
This blog post by AICoreSpot introduces WaveNet, a deep generative model of raw audio waveforms. It is demonstrated that WaveNets are capable to produce speech which emulates any human voice and which sounds more organic than the best current text-to-speech systems, minimizing the gap with human performance by over 50%.
It is also illustrated that the same network can be leveraged to synthesize other audio signals like music, and put forth a few striking instances of automatically generated piano pieces.
Talking Machines
Enabling individuals to communicate with machines is a long-standing dream of human-computer interaction. The capability of computers to comprehend natural speech has been revolutionized in the previous few years through the application of deep neural networks, for instance, Google Voice Search. But, producing speech with computers – a procedure typically referred to as speech synthesis or text-to-speech (TTS) – is still majorly based on so-called concatenative TTS, where a very big database of short speech fragments are recorded from a singular speakers and then recombined to form completed utterances. This makes it tough to alter the voice (for instance, swapping to another speaker, or modifying the emphasis or emotion of their speech) without recording to an entirely new database.
This has caused a massive demand for parametric TTS, where all the data needed to produce the data is recorded in the parameters of the model, and the contents and traits of the speech can be managed through the inputs to the model. Although, so far, parametric TTS has had a tendency to sound less organic than concatenative. Current parametric models conventionally produce audio signals by means of passing their outputs via signal processing algorithms referred to as vocoders.
WaveNet modifies this paradigm by directly modelling the raw waveform of the audio signal, one sample at a time. In addition to yielding more organic-sounding speech, leveraging raw waveforms means that WaveNet can go about modelling any type of audio, including music.
Wavenets
Scientists typically avoid modelling of raw audio as it ticks so quickly, usually 16,000 samples per second or more, with critical structure at several time scales. Developing a fully autoregressive model, in which the forecast for each one of these samples is impacted by all prior ones (in statistics-lingo, every predictive distribution is conditioned on all prior observations), is obviously a challenging activity.
However, the PixelRNN and PixelCNN models, released earlier this year, demonstrated that it was doable to produce complicated organic imagery not just one pixel at a time, but one colour-channel at a time, needing thousands of forecasts per image. This provided inspiration to adapt the two-dimensional PixelNets to a one-dimensional Wavenet.
WaveNets are fully convolutional neural networks, where in the convolutional layers possess several dilation factors that enable its receptive field to expand exponentially with depth and encompass thousands of timesteps.
During the time of training, the input sequences are actual waveforms from human speakers. Upon training, we can sample the network to produce synthetic utterances. At every step during the course of sampling, a value is drawn from the probability distribution which has been computed by the network. This value is then inputted back into the input and new forecast for the subsequent step is rendered. Building up samples one step at a time such as this is costly from a computational standpoint, however the researchers discovered it to be essential for producing complicated, realistic-sounding audio.
Enhancing the State-of-the-Art
WaveNet was trained leveraging Google’s TTS datasets so the researchers could assess its performance levels. The figure below displays the quality of WaveNets on a scale ranging from 1 to 5, contrasted with Google’s present best TTS systems (parametric and concatenative) and with human speech leveraging Mean Opinion Scores (MOS). MOS is a standard measure for subjective sound quality tests, and were gotten in blind tests with human subjects (from over 500 ratings on 100 test sentences). As we can observe, WaveNets minimize the gap between the state of the art and human-level performance by over half for both USA English and Mandarin Chinese.
For both English and Chinese, Google’s present TTS systems are viewed among the leading globally, therefore enhancing on both with a singular model is a big accomplishment.
In order to leverage WaveNet to convert text into speech, we have to inform it with what the text is. We perform this by transforming the text into a sequence of linguistic and phonetic features (which consist data about the present phoneme, syllable, word, etc.) and by inputting it into WaveNet. This implies the network’s forecasts are conditioned not just on the prior audio samples, but also on the text we wish for it to utter.
If we train the network with no text sequence, it still produces speech, but now it has to invent what to utter. This has the outcome of some kind of babbling, where are actual words are mixed up with made-up word-like sounds:
At times, non-speech sounds, like breathing and mouth movements are also at times produced by WaveNet, this is reflective of the greater flexibility of a raw-audio model.
A singular WaveNet is capable of learning the traits of several different voices, male and female. To ensure it was aware which voice to leverage for any provided utterance, the network was conditioned on the identity of the speaker. Fascinatingly, it was discovered that training on many speakers made it superior at modelling a single speaker than training on that speaker alone, indicating a kind of transfer learning.
By altering the speaker identity, we can leverage WaveNet to utter the same thing in differing voices.
Likewise, we could furnish extra inputs to the model, like emotions or accents, to make the speech even more diverse and interesting.
Making music
As WaveNets can be leveraged to model any audio signal, it was thought it would also be a fun activity to attempt to generate music. Unlike the TTS experiments, the networks weren’t conditioned on an input sequence informing it what to play (like a musical score); rather, they merely let it produce whatever it desired to. When it received training on a dataset of classical piano music, it generated amazing samples.
WaveNets open up a lot of avenues for TTS, music generation and audio modelling from a holistic sense. The fact that directly generating timestep per timestep with deep neural networks functions at all for 16kHz audio is actually shocking, let alone that it outpaces state-of-the-art TTS systems.