Parallel WaveNet: Fast High-Fidelity Speech Synthesis

11/28/2017 ∙ by Aaron van den Oord, et al. ∙ 0

The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality. The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, and is deployed online by Google Assistant, including serving multiple English and Japanese voices.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent successes of deep learning go beyond achieving state-of-the-art results in research benchmarks, and push the frontiers in some of the most challenging real world applications such as speech recognition

hinton2012deep , image recognition krizhevsky2012imagenet ; szegedy2015going , and machine translation wu2016google . The recently published WaveNet wavenet2016

model achieves state-of-the-art results in speech synthesis, and significantly closes the gap with natural human speech. However, it is not well suited for real world deployment due to its prohibitive generation speed. In this paper, we present a new algorithm for distilling WaveNet into a feed-forward neural network which can synthesise equally high quality speech much more efficiently, and is deployed to millions of users.

WaveNet is one of a family of autoregressive deep generative models that have been applied with great success to data as diverse as text mikolov2010recurrent , images larochelle2011neural ; theis2015generative ; oord2016pixel ; van2016conditional , video kalchbrenner2016video , handwriting graves2013generating as well as human speech and music. Modelling raw audio signals, as WaveNet does, represents a particularly extreme form of autoregression, with up to 24,000 samples predicted per second. Operating at such a high temporal resolution is not problematic during network training, where the complete sequence of input samples is already available and—thanks to the convolutional structure of the network—can be processed in parallel. When generating samples, however, each input sample must be drawn from the output distribution before it can be passed in as input at the next time step, making parallel processing impossible.

Inverse autoregressive flows (IAFs) kingma2016improving

represent a kind of dual formulation of deep autoregressive modelling, in which sampling can be performed in parallel, while the inference procedure required for likelihood estimation is sequential and slow. The goal of this paper is to marry the best features of both models: the efficient training of WaveNet and the efficient sampling of IAF networks. The bridge between them is a new form of neural network distillation 

hinton2015distilling , which we refer to as Probability Density Distillation, where a trained WaveNet model is used as a teacher for a feedforward IAF model.

The next section describes the original WaveNet model, while Sections 3 and 4 define in detail the new, parallel version of WaveNet and the distillation process used to transfer knowledge between them. Section 5 then presents experimental results showing no loss in perceived quality for parallel versus original WaveNet, and continued superiority over previous benchmarks. We also present timings for sample generation, demonstrating more than 1000 speed-up relative to original WaveNet.

2 WaveNet

Autoregressive networks model the joint distribution of high-dimensional data as a product of conditional distributions using the probabilistic chain-rule:

where is the -th variable of and are the parameters of the autoregressive model. The conditional distributions are usually modelled with a neural network that receives as input and outputs a distribution over possible .

WaveNet wavenet2016 is a convolutional autoregressive model which produces all in one forward pass, by making use of causal—or masked—convolutions oord2016pixel ; germain2015made . Every causal convolutional layer can process its input in parallel, making these architectures very fast to train compared to RNNs van2016conditional , which can only be updated sequentially. At generation time, however, the waveform has to be synthesised in a sequential fashion as must be sampled first in order to obtain . Due to this nature, real time (or faster) synthesis with a fully autoregressive system is challenging. While sampling speed is not a significant issue for offline generation, it is essential for real-word applications. A version of WaveNet that generates in real-time has been developed paineFastWaveNet , but it required the use of a much smaller network, resulting in severely degraded quality.

Figure 1: Visualisation of a WaveNet stack and its receptive field wavenet2016 .

Raw audio data is typically very high-dimensional (e.g. 16,000 samples per second for 16kHz audio), and contains complex, hierarchical structures spanning many thousands of time steps, such as words in speech or melodies in music. Modelling such long-term dependencies with standard causal convolution layers would require a very deep network to ensure a sufficiently broad receptive field. WaveNet avoids this constraint by using dilated causal convolutions, which allow the receptive field to grow exponentially with depth.

WaveNet uses gated activation functions, together with a simple mechanism introduced in

oord2016pixel to condition on extra information such as class labels or linguistic features:

(1)

where denotes a convolution operator, and denotes an element-wise multiplication operator.

is a logistic sigmoid function.

represents extra conditioning data. is the layer index. and denote filter and gate, respectively. and are learnable weights. In cases where encodes spatial or sequential information (such as a sequence of linguistic features), the matrix products ( and ) are replaced by convolutions ( and ).

2.1 Higher Fidelity WaveNet

For this work we made two improvements to the basic WaveNet model to enhance its audio quality for production use. Unlike previous versions of WaveNet wavenet2016 , where 8-bit (-law or PCM) audio was modelled with a 256-way categorical distribution, we increased the fidelity by modelling 16-bit audio. Since training a 65,536-way categorical distribution would be prohibitively costly, we instead modelled the samples with the discretized mixture of logistics distribution introduced in salimans2017pixelcnn . We further improved fidelity by increasing the audio sampling rate from 16kHz to 24kHz. This required a WaveNet with a wider receptive field, which we achieved by increasing the dilated convolution filter size from 2 to 3. An alternative strategy would be to increase the number of layers or add more dilation stages.

3 Parallel WaveNet

While the convolutional structure of WaveNet allows for rapid parallel training, sample generation remains inherently sequential and therefore slow, as it is for all autoregressive models which use ancestral sampling. We therefore seek an alternative architecture that will allow for rapid, parallel generation.

Inverse-autoregressive flows (IAFs) kingma2016improving are stochastic generative models whose latent variables are arranged so that all elements of a high dimensional observable sample can be generated in parallel. IAFs are a special type of normalising flow dinh2014nice ; rezende2015variational ; dinh2016density which model a multivariate distribution

as an explicit invertible non-linear transformation

of a simple tractable distribution

(such as an isotropic Gaussian distribution). The resulting random variable

has a log probability:

where is the determinant of the Jacobian of . The transformation is typically chosen so that it is invertible and its Jacobian determinant is easy to compute. In the case of an IAF, is modelled by so that . The transformation has a triangular Jacobian matrix which makes the determinant simply the product of the diagonal entries:

Initially, a random sample is drawn from . The following transformation is applied to :

(2)

The network outputs a sample , as well as and . Therefore, follows a logistic distribution parameterised by and .

While and can be any autoregressive model, we use the same convolutional autoregressive network structure as the original WaveNet wavenet2016 . If an IAF and an autoregressive model share the same output distribution class (e.g., mixture of logistics or categorical) then mathematically they should be able to model the same multivariate distributions. However, in practice there are some differences (see Appendix section A.2). To output the correct distribution for timestep , the inverse autoregressive flow can implicitly infer what it would have output at previous timesteps based on the noise inputs , which allows it to output all in parallel given .

In general, normalising flows might require repeated iterations to transform uncorrelated noise into structured samples, with the output generated by the flow at each iteration passed in as input at the next rezende2015variational one. This is less crucial for IAFs, as the autoregressive latents can induce significant structure in a single pass. Nonetheless we observed that having up to 4 flow iterations (which we implemented by simply stacking 4 such networks on top of each other) did improve the quality. Note that in the final parallel WaveNet architecture, the weights were not shared between the flows.

The first (bottom) network takes as input the white unconditional logistic noise: . Thereafter the output of each network is passed as input to the next network , which again transforms it.

(3)

Because we use the same ordering in all the flows, the final distribution is logistic with location and scale :

(4)
(5)

where is the number of flows and the dependencies on and are omitted for simplicity.

4 Probability Density Distillation

Training the parallel WaveNet model directly with maximum likelihood would be impractical, as the inference procedure required to estimate the log-likelihoods is sequential and slow111In this sense the two architectures are dual to one another: slow training and fast generation with parallel WaveNet versus fast training and slow generation with WaveNet.. We therefore introduce a novel form of neural network distillation hinton2015distilling that uses an already trained WaveNet as a ‘teacher’ from which a parallel WaveNet ‘student’ can efficiently learn. To stress the fact that we are dealing with normalised density models, we refer to this process as Probability Density Distillation (in contrast to Probability Density Estimation). The basic idea is for the student to attempt to match the probability of its own samples under the distribution learned by the teacher.

Given a parallel WaveNet student and WaveNet teacher which has been trained on a dataset of audio, we define the Probability Density Distillation loss as follows:

(6)

where

is the Kullback–Leibler divergence, and

is the cross-entropy between the student and teacher , and is the entropy of the student distribution. When the KL divergence becomes zero, the student distribution has fully recovered the teacher’s distribution. The entropy term (which is not present in previous distillation objectives hinton2015distilling ) is vital in that it prevents the student’s distribution from collapsing to the mode of the teacher (which, counter-intuitively, does not yield a good sample—see Appendix section A.1). Crucially, all the operations required to estimate derivatives for this loss (sampling from , evaluating , and evaluating ) can be performed efficiently, as we will see.

It is worth noting the parallels to Generative Adversarial Networks (GANs goodfellow2014generative ), with the student playing the role of generator, and the teacher playing the role of discriminator. As opposed to GANs, however, the student is not attempting to fool the teacher in an adversarial manner; rather it co-operates by attempting to match the teacher’s probabilities. Furthermore the teacher is held constant, rather than being trained in tandem with the student, and both models yield tractable normalised distributions.

Recently gu2017natmt

has presented a related idea to train feed-forward networks for neural machine translation. Their method is based on conditioning the feedforward decoder on fertility values, which require supervision by an external alignment system. The training procedure also involves the creation of an additional dataset as well as fine-tuning. During inference, their model relies on re-scoring by an auto-regressive model.

Figure 2: Overview of Probability Density Distillation. A pre-trained WaveNet teacher is used to score the samples output by the student. The student is trained to minimise the KL-divergence between its distribution and that of the teacher by maximising the log-likelihood of its samples under the teacher and maximising its own entropy at the same time.

First, observe that the entropy term in Equation 6 can be rewritten as follows:

(7)
(8)

where and are independent samples drawn from the logistic distribution. The second equality in Equation 8 follows because the entropy of a logistic distribution is . We can therefore compute this term without having to explicitly generate .

The cross-entropy term however explicitly depends on , and therefore requires sampling from the student to estimate.

(9)
(10)
(11)
(12)
(13)

For every sample we draw from the student we can compute all in parallel with the teacher and then evaluate very efficiently by drawing multiple different samples from

for each timestep. This unbiased estimator has a much lower variance than naively evaluating the sample under the teacher with Equation

9.

Because the teacher’s output distribution is parameterised as a mixture of logistics distribution, the loss term is differentiable with respect to both and . A categorical distribution, on the other hand, would only be differentiable w.r.t. .

4.1 Additional loss terms

Training with Probability Density Distillation

alone might not sufficiently constrain the student to generate high quality audio streams. Therefore, we also introduce additional loss functions to guide the student distribution towards the desired output space.

Power loss

The first additional loss we propose is the power loss, which ensures that the power in different frequency bands of the speech are on average used as much as in human speech. The power loss helps to avoid the student from collapsing to a high-entropy WaveNet-mode, such as whispering.

The power-loss is defined as:

(14)

where is an example with conditioning from the training set,

and STFT stands for the Short-Term Fourier Transform. We found that

can be averaged over time before taking the Euclidean distance with little difference in effect, which means it is the average power for various frequencies that is important.

Perceptual loss

In the power loss formulation given in equation 14

, one can also use a neural network instead of the STFT to conserve a perceptual property of the signal rather than total energy. In our case we have used a WaveNet-like classifier trained to predict the phones from raw audio. Because such a classifier naturally extracts high-level features that are relevant for recognising the phones, this loss term penalises bad pronunciations. A similar principle has been used in computer vision for artistic style transfer

gatys2015neural

, or to get better perceptual reconstruction losses, e.g., in super-resolution

johnson2016perceptual .

We have experimented with two different ways of using the perceptual loss, the feature reconstruction loss (the Euclidean distance between feature maps in the classifier) and the style loss (the Euclidean distance between the Gram matrices 

johnson2016perceptual ). The latter produced better results in our experiments.

Contrastive loss

Finally, we also introduce a contrastive distillation loss as follows:

(15)

which minimises the KL-divergence between the teacher and student when both are conditioned on the same information (e.g., linguistic features, speaker ID, …), but also maximises it for different conditioning pairs . In order to implement this loss, we use the output of the student and evaluate the waveform twice under the teacher: once with the same conditioning and once with a randomly sampled conditioning input: . The weight for the contrastive term

was set to 0.3 in our experiments. The contrastive loss penalises waveforms that have high likelihood regardless of the conditioning vector.

5 Experiments

In all our experiments we used text-to-speech models that were conditioned on linguistic features (similar to wavenet2016 ), providing phonetic and duration information to the network. We also conditioned the models on pitch information (logarithm of , the fundamental frequency) predicted by a different model. We never used ground-truth information (such as pitch or duration) extracted from human speech for generating audio samples and the test sentences were not present (or similar to those) in the training set.

The teacher WaveNet network was trained for 1,000,000 steps with the ADAM optimiser kingma2014adam with a minibatch size of 32 audio clips, each containing 7,680 timesteps (roughly 320ms). Remarkably, a relatively short snippet of time is sufficient to train the parallel WaveNet to produce long term coherent waveforms. The learning rate was held constant at , and Polyak averaging polyak1992acceleration was applied over the parameters. The model consists of 30 layers, grouped into 3 dilated residual block stacks of 10 layers. In every stack, the dilation rate increases by a factor of 2 in every layer, starting with rate 1 (no dilation) and reaching the maximum dilation of 512 in the last layer. The filter size of causal dilated convolutions is 3. The number of hidden units in the gating layers is 512 (split into two groups of 256 for the two parts of the activation function (1

)). The number of hidden units in the residual connection is 512, and in the skip connection and the

convolutions before the output layer is also 256. We used 10 mixture components for the mixture of logistics output distribution.

The student network consisted of the same WaveNet architecture layout, except with different inputs and outputs and no skip connections. The student was also trained for 1,000,000 steps with the same optimisation settings. The student typically consisted of 4 flows with 10, 10, 10, 30 layers respectively, with 64 hidden units for the residual and gating layers.

Audio Generation Speed

We have benchmarked the sampling speed of autoregressive and distilled WaveNets on an NVIDIA P100 GPU. Both models were implemented in Tensorflow

abadi2016tensorflow and compiled with XLA. The hidden layer activations from previous timesteps in the autoregressive model were cached with circular buffers paineFastWaveNet . The resulting sampling speed with this implementation is 172 timesteps/second for a minibatch of size 1. The distilled model, which is more parallelizable, achieves over 500,000 timesteps/second with same batch size of 1, resulting in three orders of magnitude speed-up.

Method Subjective 5-scale MOS
16kHz, 8-bit -law, 25h data:
LSTM-RNN parametric wavenet2016 3.67 0.098
HMM-driven concatenative wavenet2016 3.86 0.137
WaveNet wavenet2016 4.21 0.081
24kHz, 16-bit linear PCM, 65h data:
HMM-driven concatenative 4.19 0.097
Autoregressive WaveNet 4.41 0.069
Distilled WaveNet 4.41 0.078
Table 1: Comparison of WaveNet distillation with the autoregressive teacher WaveNet, unit-selection (concatenative), and previous results from wavenet2016 . MOS stands for Mean Opinion Score.

Audio Fidelity

In our first set of experiments, we looked at the quality of WaveNet distillation compared to the autoregressive WaveNet teacher and other baselines on data from a professional female speaker wavenet2016 . Table 1 gives a comparison of autoregressive WaveNet, distilled WaveNet and current production systems in terms of mean opinion score (MOS). There is no difference between MOS scores of the distilled WaveNet () and autoregressive WaveNet (), and both are significantly better than the concatenative unit-selection baseline ().

It is also important to note that the difference in MOS scores of our WaveNet baseline result compared to the previous reported result wavenet2016 is due to the improvement in audio fidelity as explained in Section 2.1: modelling a sample rate of kHz instead of kHz and bit-depth of -bit PCM instead of -bit -law.

Parametric Concatenative Distilled WaveNet
English speaker 1 (female - 65h data) 3.88 4.19 4.41
English speaker 2 (male - 21h data) 3.96 4.09 4.34
English speaker 3 (male - 10h data) 3.77 3.65 4.47
English speaker 4 (female - 9h data) 3.42 3.40 3.97
Japanese speaker (female - 28h data) 4.07 3.47 4.23
Table 2: Comparison of MOS scores on English and Japanese with multi-speaker distilled WaveNets. Note that some speakers sounded less appealing to people and always get lower MOS, however distilled parallel WaveNet always achieved significantly better results.

Multi-speaker Generation

By conditioning on the speaker-ids we can construct a single parallel WaveNet model that is able to generate multiple speakers’ voices and their accents. These networks require slightly more capacity than single speaker models and thus had 30 layers in each flow. In Table 2 we show a comparison of such a distilled parallel WaveNet model with two main baselines: a parametric and a concatenative system. In the comparison, we use a number of English speakers from a single model (one of them, English speaker 1, is the same speaker as in Table 1) and a Japanese speaker from another model. For some speakers, the concatenative system gets better results than the parametric system, while for other speakers it is the opposite. The parallel WaveNet model, on the other hand, significantly outperforms both baselines for all the speakers.

Ablation Studies

Preference Scores
versus baseline concatenative system
Method Win - Lose - Neutral
Losses used
KL + Power 60% - 15% - 25%
KL + Power + Perceptual 66% - 10% - 24%
KL + Power + Perceptual + Contrastive (= default) 65% - 9% - 26%
Table 3: Performance with respect to different combinations of loss terms. We report preference comparison scores since their mean opinion scores tend to be very close and inconclusive.

To analyse the importance of the loss functions introduced in Section 4.1 we show how the quality of the distilled WaveNet changes with different loss functions in Table 3 (top). We found that MOS scores of these models tend to be very similar to each other (and similar to the result in Table 1). Therefore, we report subjective preference scores from a paired comparison test (“A/B test”), which we found to be more reliable for noticing small (sometimes qualitative) differences. In these tests, the subjects were asked to listen to a pair of samples and choose which they preferred, though they could choose “neutral” if they did not have any preference.

As mentioned before, the KL loss alone does not constrain the distillation process enough to obtain natural sounding speech (e.g., low-volume audio suffices for the KL), therefore we do not report preference scores with only this term. The KL loss (section  4) combined with power-loss is enough to generate quite natural speech. Adding the perceptual loss gives a small but noticeable improvement. Adding the contrastive loss does not improve the preference scores any further, but makes the generated speech less noisy, which is something most raters do not pay attention to, but is important for production quality speech.

As explained in Section 3, we use multiple inverse-autoregressive flows in the parallel WaveNet architecture: A model with a single flow gets a MOS score of 4.21, compared to a MOS score of 4.41 for models with multiple flows.

6 Conclusion

In this paper we have introduced a novel method for high-fidelity speech synthesis based on WaveNet wavenet2016 using Probability Density Distillation. The proposed model achieved several orders of magnitude speed-up compared to the original WaveNet with no significant difference in quality. Moreover, we have successfully transferred this algorithm to new languages and multiple speakers.

The resulting system has been deployed in production at Google, and is currently being used to serve Google Assistant queries in real time to millions of users222https://deepmind.com/blog/wavenet-launches-google-assistant/. We believe that the same method presented here can be used in many different domains to achieve similar speed improvements whilst maintaining output accuracy.

7 Acknowledgements

In this paper, we have described the research advances that made it possible for WaveNet to meet the speed and quality requirements for being used in production at Google. At the same time, an equivalently significant effort has gone into integrating this new end-to-end deep learning model into the production pipeline, satisfying requirements of not only speed, but latency and reliability, among others. We would like to thank Ben Coppin, Edgar Duéñez-Guzmán, Akihiro Matsukawa, Lizhao Liu, Mahalia Miller, Trevor Strohman and Eddie Kessler for very useful discussions. We also would like to thank the entire DeepMind Applied and Google Speech teams for their foundational contributions to the project and developing the production pipeline.

References

Appendix A Appendix

a.1 Argument against MAP estimation

In this section we make an argument against maximum a posteriori (MAP) estimation for distillation; similar arguments have been made by previous authors in a different setting [24].

The distillation loss defined in Section 4 minimises the KL divergence between the teacher and generator. We could instead have minimised only the cross-entropy between the teacher and generator (the standard distillation loss term [11]), so that the samples by the generator are as likely as possible according to the teacher. Doing so would give rise to MAP estimation. Counter-intuitively, audio samples obtained through MAP estimation do not sound as good as typical examples from the teacher: in fact they are almost completely silent, even if using conditional information such as linguistic features. This effect is not due to adversarial behaviour on the part of the teacher, but rather is a fundamental property of the data distribution which the teacher has approximated.

As an example consider the simple case where we have audio from a white random noise source: the distribution at every timestep is

, regardless of the samples at previous timesteps. White noise has a very specific and perceptually recognizable sound: a continual hiss. The MAP estimate of this data distribution, and thus of any generative model that matches it well, recovers the distribution mode, which is 0 at every timestep: i.e. complete silence. More generally, any highly stochastic process is liable to have a ‘noiseless’ and therefore atypical mode. For the KL divergence the optimum is to recover the full teacher distribution. This is clearly different from any random sample from the distribution. Furthermore, if one changes the representation of the data (e.g., by nonlinearly pre-processing the audio signal), then the MAP estimate changes, unlike the KL-divergence in Equation

6, which is invariant to the coordinate system.

a.2 Autoregressive Models and Inverse-autoregressive Flows

Although inverse-autoregressive flows (IAFs) and autoregressive models can in principle model the same distributions [2], they have different inductive biases and may vary greatly in their capacity to model certain processes. As a simple example consider the Fibonacci series . For an autoregressive model this is easy to model with a receptive field of two: . For an IAF, however, the receptive field needs to be at least size to correctly model terms, leading to a larger model that is less able to generalise.