The widespread adoption of personal assistants has been powered, in large part, by advances in the field of Text-to-Speech (TTS). Many state-of-the-art TTS systems contain a model, referred to as a vocoder, that takes as input audio features derived from a piece of text and outputs synthesized speech. WaveNet oord2016wavenet is a state-of-the art vocoder that is capable of producing synthesized speech with near-human-level quality shen2018natural. The key to the model’s quality is its autoregressive loop but this property makes the model exceptionally challenging to deploy in applications that require real-time output or need to efficiently scale to millions of users since naïve implementations may take tens of minutes to generate ten seconds of speech.
As a result, TTS research has focused on finding alternative vocoder architectures such as Parallel-WaveNet oord2018parallel, WaveRNN kalchbrenner2018efficient, ClariNet ping2018clarinet and WaveGlow prenger2019waveglow that achieve higher performance when deployed on existing hardware. There is a degree of ambiguity as to the highest quality vocoder as audio quality evaluation is subjective but all authors agree that WaveNet produces at least as good if not higher quality audio than the more recent approaches kim2018flowavenet; oord2018parallel; prenger2019waveglow; tian2020featherwave; hsu2020wg.
In this paper, rather than changing the WaveNet architecture to improve inference performance we instead keep it fixed and explore a range of model compression techniques that can yield greater inference performance. Crucially, the techniques we explore are all available in existing deep learning frameworks and are deployable to a wide range of current and future CPUs and neural network accelerators. Finally, motivated by the desire to maintain WaveNet’s quality, we evaluate the impact that these compression techniques have on the perceived fidelity of the synthesized speech.
We examine two main categories of model compression—sparsity and quantization—and explore both their independent and combined impact on model quality. For sparsity we consider iterative and one-shot magnitude-based neural network pruning; and for quantization we explore the INT8, bfloat16, half-precision floating-point with both 16-bit and 32-bit accumulation (FP16.16, FP16.32), 16-bit block floating-point (BFP16), TensorFloat32 (TF32) and single-precision floating-point (FP32) formats. While there has been some work in vocoder pruning of WaveRNN and its variants kalchbrenner2018efficient; valin2019lpcnet; tian2020featherwave, to our knowledge, this is the first paper in which WaveNet pruning results are provided. Additionally, to our knowledge, no other authors compare as wide a range of precisions, nor look at the interactions between pruning and quantization for the WaveNet model.
We include samples generated by the models presented in this work and will release our code 111myrtlesoftware.github.io/wavenet-paper/.
2 Related Work
Numerous attempts have been made to improve vocoder inference performance. The WaveRNN kalchbrenner2018efficient authors note that the time for vocoder inference, , for a target audio sequence can be decomposed into computation time and kernel launch overhead for each of the operations (layers) of the model:
Attempts to optimize a vocoder for deployment aim to reduce at least one of . Many approaches including Parallel-WaveNet, ClariNet, WaveGlow and WaveFlow ping2019waveflow remove the autoregressive component and hence reduce so that many or all of the samples can be generated in parallel. Others, including WaveRNN and its variants LPCNet valin2019lpcnet and FeatherWave tian2020featherwave, keep the autoregressive component but make alterations to the architecture to decrease the product of and . There have also been efforts that focus on reducing by exploiting techniques such as persistent kernels that only launch the kernel once per sequence pharris_2018. In this work we explore reducing without altering the WaveNet architecture by utilising model compression to give greater possibilities in the types of models that can be deployed.
Sparse models offer two potential benefits over their dense counterparts:
The amount of computation can be reduced since, for example, multiplications by zero need not be performed.
Memory bandwidth requirements can be reduced as it is possible to achieve higher compression ratios with sparse matrices.
The first of these reduces but in order to realise either of these benefits, hardware support is usually required. Depending on the type of support, different hardware platforms are amenable to different types of sparsity. At one end of the spectrum, some authors use channel pruning, in which entire convolutional channels are set to zero he2017channel. It is comparatively easy to realise the inference-time performance benefits of channel sparsity but this approach produces a significant degradation in audio quality for WaveNet hussain2020fastwave. Channel sparsity is a special case of block sparsity where for a 2D matrix, blocks of size are enforced to be either all-dense or all-sparse narang2017block. At the other end of the spectrum, sparsity can also be unstructured meaning there are no constraints on the sparsity pattern; this typically results in the smallest quality degradation but is also the most challenging sparsity pattern to deploy efficiently. A hybrid approach is to employ balanced sparsity yao2019balanced; cao2019efficient where each block is independently pruned to the target sparsity percentage but within a block the sparsity is unstructured.
The other principal axes on which neural network sparsity approaches can differ relate to the method of obtaining the sparse network. One way to do this is by utilizing magnitude-based pruning, an approach in which the parameters closest to zero are pruned han2015learning
. Magnitude-based pruning can be broadly classified as either one-shot in which a trained network is pruned to the desired sparsity in a single stephan2015learning or iterative where the level of sparsity increases more gradually over the training process zhu2017prune. However, the division between the two is not always clear-cut. For example, the FeatherWave authors propose a two-stage sparse pruning schedule (TSSP) which combines a one-shot jump to 50% sparsity followed by an iterative pruning stage.
The WaveRNN, LPCNet and FeatherWave models utilise high levels of sparsity to reduce inference time. The WaveRNN authors use iterative pruning to achieve sparsity as high as 96%. They investigate a range of block sparsity patterns including (unstructured), and and find that the latter is the most performant as it more closely mirrors the layout in physical memory. The LPCNet and FeatherWave authors both use 90% sparse networks with a pattern although the latter uses TSSP as discussed above instead of the purely iterative approach.
Quantized models also reduce as the operations are now performed at a numerical format in which the operations are less computationally expensive. This approach has been applied to a wide variety of models including BERT wu2020integer, ResNet and GNMT zafrir2019q8bert
, and sees adoption in widely recognised machine learning benchmarksreddi2019mlperf.
The quantization process includes one or both of:
Reducing the number of bits of the datatype. e.g. use 8 bits instead of 32 bits.
Using a less expensive format. e.g. use integer instead of floating-point.
A simple scheme is to perform all multiplications in the FP16 data format as this is already widely supported on a variety of hardware devices. The results are accumulated in either FP16 or FP32; this distinction matters for the range of representable values and for what precision any activation functions are later performed in. We represent these choices as FP16.16 and FP16.32 to represent using FP16 for multiplies and either FP16 or FP32 for accumulations respectively.
Quantizing to integers is another popular choice for quantization. When quantizing from floating point to integer, it is necessary to use a quantization scheme in which there is no quantization error for the value as these parameters will have an outsized impact on model performance jacob2018quantization especially when quantizing sparse matrices. Running inference at INTX (most often INT8) is widely used for deployment including for models from the domains of Machine Translation wu2016googlehe2019streamingwu2018training and NLP embeddings zafrir2019q8bert.
However, formats other than INTX and the basic FP16 are becoming more widely used as hardware vendors accommodate them. For example, BrainFloat16 is supported by Google TPUs222cloud.google.com/tpu/docs/bfloat16 and Intel engineers have demonstrated a 1.5 speedup using this format for both Parallel-WaveNet and FeatherWave on Xeon CPUs333 intel.com/content/www/us/en/artificial-intelligence/posts/intel-xeon-text-to-speech.html
intel.com/content/www/us/en/artificial-intelligence/posts/intel-xeon-text-to-speech.html. BlockFloat16 is supported on Intel’s Stratix 10 NX FPGA444intel.com/content/www/us/en/products/programmable/fpga/ stratix-10/nx.html and NVIDIA’s Ampere series of GPUs will support the TensorFloat32 format. In the FPGA space, there is work using bespoke formats; for example, hussain2020fastwave achieved a 2 speedup for WaveNet inference quantizing from floating point to fixed point while keeping the range and precision of their 27-bit data type constant.
In many cases, it is possible to perform post-training quantization (PTQ), in which full-precision weights are quantized to the desired precision after training is completed, with minimal loss in model quality. However, when the quality degradation is large, it becomes necessary to perform Quantization Aware Training (QAT) jacob2018quantization. In QAT, the quantization operation for the target precision is simulated during training so that the trained FP32 weights learn to remove the noise injected by the quantization.
|Layer||Type||# Parameters||GOP/second audio|
3.1 Setup and Evaluation
We use the WaveNet architecture as detailed in Figure 1
. This model uses mel spectrograms with 80 filter banks for the conditional features and upsamples them using a 1D transposed convolution with a kernel size of 800 and a stride of 200. The remainder of the architecture is parameterized by the number of channels in its convolutions: the number of skip, residual and audio channels; as well as by the number of repeated layers and dilation cycle parameter. For all experiments we use a model with skip channels, residual channels, audio channels, repeated layers and . This model has 7.2 million parameters, see Table 1. Whilst the model produces audio channels the final audio output has a bit depth of 16 as a -law compounding transform is applied oord2016wavenet.
The weights of the dilation, conditional, residual and skip convolutions in the repeated layers and the out post-processing convolution weights are pruned as combined these layers contain of the total operations; see Table 1. The feature upsample pre-processing weights are also pruned as this layer contains of the total number of parameters despite only requiring of the total operations. Biases for all pruned layers remain dense. Finally, the embedding pre-processing and end post-processing layers remain dense as, although they have comparatively few parameters and operations, pruning them has a disproportionate effect on the final quality.
We focus on two sparsity granularities as these are supported by the major deep learning frameworks including TensorFlow and PyTorch. The first granularity is layer-wise unstructured sparsity. We use iterative magnitude-based pruning where each layer is pruned independently up to a target compression ratio, defined as per Equation2.
Each layer is pruned every 500 steps following the cubic pruning schedule found in TensorFlow zhu2017prune. The second granularity is balanced sparsity where 2 out of every block of 4 values is pruned; denoted 2:4. For this technique we use NVIDIA’s automatic sparsity library and follow their recommended approach: train a dense model, perform one-shot pruning to the target sparsity, and then repeat the training process starting from the pruned parameters.
We use the 7 formats listed in Table 2. For our BFP16 implementation, we use a block size of 10 in the channel dimension. These formats are chosen as they are common amongst deep learning frameworks as well as current and future CPUs and neural network accelerators. PyTorch NEURIPS2019_9015 is used for INT8 and FP32. QPyTorch zhang2019qpytorch is used to simulate bfloat16 intel2020bfloat16, FP16.16, FP16.32, BFP16 song2017computation, and TF32 nvidia2020a100. Simulation is achieved by converting the input activations and parameters of the convolution layers to the target formats. The remaining operations may use a different format as detailed in Table 2.
|Signed 8-bit int||INT8||0||8||INT8|
We use post-training quantization for all formats. We experimented with using QAT for some formats but found that it did not have a significant impact on the final model quality compared to PTQ despite being much harder to integrate into a training pipeline and increasing the barriers to deployment. Hence, we leave more detailed investigations regarding QAT to future work.
We use the LJSpeech ljspeech17 dataset subsampled to 16kHz. Prior to starting this research, 100 samples were randomly selected from the dataset for use as a validation set and 100 samples were randomly selected from the dataset for use as a test set. All remaining samples are used as a training set. No data augmentation is applied.
All experiments use a batch size of 16 distributed across 1–4 GPUs as well as mixed precision training micikevicius2017mixed
. One element in the batch consists of a randomly selected 1 second segment from an audio clip in the training set, padding with zeros where necessary. We use the Adam optimizerkingma2014adam with a fixed learning rate of , , , .
We report both the teacher-forced cross-entropy validation and test loss, the Mean Opinion Score (MOS) and the compression ratio for all experiments. For the sparsity experiments we also report theoretical speedup. Each experiment is repeated 3 times with 3 different random seeds and the model with median validation loss is selected to compute and report all of these metrics.
For each generated sample in the test set the MOS is computed by asking 30 independent Amazon Mechanical Turk workers to rate the sample’s naturalness on a five point scale. The reported MOS is the mean of these scores and a 95% confidence interval is computed using the t-distribution.
The compression ratio is defined as per Equation 2. When referring to compression ratio or model compression ratio the size is defined to be the total size of the model in bits. We also refer to the sparse layer compression ratio where the size is defined to be the size in bits of only the layers that are being pruned. For sparse models, some layers remain dense so the model compression ratio is lower than the sparse layer compression ratio. The theoretical speedup is defined as the number of multiply-adds in the dense model divided by the number in the sparse model blalock2020state.
Note that both the compression ratio and theoretical speedup only provide an upper bound on that achievable in deployment as, in practice, there will be extra overheads. For example, additional memory and hardware is required to store and use sparse parameters.
|Granularity||Structure||Compression Ratio||Val Loss||Test Loss||MOS||Theoretical Speedup|
|Ground Truth (Human)||-||-||-||-|
The results for the sparsity experiments that use the iterative unstructured and one-shot 2:4 magnitude-based pruning techniques are presented in Table 3.
Focusing on the iterative unstructured experiments, we see that the difference between the baseline model and the models that use a sparse layer compression ratio of 2 and 4 respectively is not statistically significant. This means a model compression ratio of up to 3.83 and a theoretical speedup of up to 3.51 can be achieved without a reduction in fidelity of the synthesized audio. The models with a sparse layer compression ratio greater than 4 do exhibit a significant degradation in audio fidelity. However, this reduction in fidelity does lead to a large increase in theoretical speedup. For example, the model that uses a compression ratio of 32 in its sparse layers— the total number of parameters and the total number of operations—has a potential theoretical speedup of 12.95 over the dense baseline model whilst achieving only a 9.1% lower MOS. However, note that MOS is a subjective metric and therefore relative comparisons are difficult to interpret. Figure 2 demonstrates this trade-off between model quality and the theoretical speedup that sparsity offers.
Looking at the one-shot 2:4 magnitude-based pruning result we see that it has a significantly lower MOS than both the baseline and iterative unstructured model with a sparse layer compression ratio of 2. Further, the MOS score is comparable to that of the iterative unstructured models that use a sparse layer compression ratio of 4 and 8. This suggests that the iterative unstructured magnitude-based pruning technique produces higher quality WaveNet models for a fixed compression ratio or of greater theoretical speedup for a fixed quality.
|Format||Compression Ratio||Val Loss||Test Loss||MOS|
|Ground Truth (Human)||-||-|
We present the quantization results in Table 4. We are able to obtain an audio fidelity that matches that of the baseline FP32 model by using the TF32 precision. The other formats are all equivalent to each other but they are all significantly worse than TF32 and FP32. This could be due to the higher compression ratio with these formats compared to TF32 and FP32 causing a higher information loss in the model arithmetic that leads to a quality degradation.
Sparsity and Quantization Combined
|Sparse Layer Compression Ratio 4||2:4|
|Format||Val Loss||Test Loss||CR||MOS||Val Loss||Test Loss||CR||MOS|
Finally, we consider applying both quantization and sparsity to our WaveNet model at the same time. Due to time constraints and the large number of runs required if running every combination of sparsity and precision previously considered, we choose to only investigate two different sparsity levels. We use a per layer compression ratio of 4 since this produced a model with a high compression ratio that maintains a similar quality to the dense baseline. We also investigate with the 2:4 sparsity pattern since we suspect that this will become a popular choice for sparsity with tools supported by NVIDIA for this sparsity level.
The results for these experiments are presented in Table 5. When looking at the models using the sparse layer compression ratio of 4, we find that using quantization does not provide a significant degradation in quality compared to our FP32 baseline in all cases besides INT8. Our best model with this sparsity pattern is the TF32 which achieve MOS comparable with a baseline FP32 dense model whilst having a compression ratio of over 6.
When looking at the models that use the 2:4 sparsity pattern, we find results similar to the ones obtained using the sparse layer compression ratio of 4. The quantized models all have comparable audio fidelity to the FP32 model with no significant change in MOS result, despite having a compression ratio of up to 7.88 in the case of INT8.
4 Conclusions and Future Work
In this work we have shown that model compression can be utilised to create a WaveNet vocoder with a high compression ratio whilst still being capable of near-human quality speech synthesis. Our BFP16 model with a sparse layer compression ratio of 4 synthesizes audio with a quality that is on par with a dense FP32 baseline whilst achieving a compression ratio of 13.84.
We hypothesise that higher compression ratios may be achieved by exploiting more extreme forms of quantization, such as INT4, INT2 and even binary models, although we suspect that the use of QAT will be vital to maintain acceptable audio fidelity in these cases.
We hope that our results encourage the use of model quantization and sparsity to realise the theoretical speedup afforded by them on next generation hardware accelerators to produce high quality text-to-speech systems, although we leave the specifics of such deployments to future work.