In recent years, the quality of text-to-speech (TTS) has been significantly improved by neural vocoders such as WaveNet [van2016wavenet], Parallel WaveNet [oord2017parallel], WaveRNN [kalchbrenner2018efficient], LPCNet [valin2019lpcnet], etc. These neural vocoders are usually used in sequence-to-sequence acoustic models, e.g. Tacotron 2 [shen2018natural] and DurIAN [yu2019durian], to achieve generating human-like speech. The WaveNet vocoder, which is the state of the art model, can generate high-fidelity audio but is hard to deploy for real time services because of its huge computational complexity. The flow based neural vocoders, such as Parallel WaveNet [oord2017parallel], Clarinet [ping2018clarinet], WaveGlow [prenger2019waveglow], are more practicable since they can perform parallel generation on GPU devices. However, these models often suffer from phase issues since the causality prior is ignored. Therefore the generated speech usually sounds muffled compared with the original auto-regressive WaveNet. Generative Adversarial Network (GAN) [goodfellow2014generative] has been adopted to address these issues in Parallel WaveNet [tiangenerative, yamamoto2019probability].
Recently, efficient RNN based sequential neural vocoders, such as WaveRNN, LPCNet and Multi-band WaveRNN [yu2019durian], have been proposed for improving the performance of neural TTS system. The proposed LPCNet is the most lightweight neural vocoder currently, which integrates WaveRNN structured neural synthesis techniques with linear prediction. Meanwhile, an improved sampling strategy, as well as the pre-emphasis prior to -law quantization is introduced for achieving good quality under a small model size. Different from separately predicting the coarse and fine parts of the discretized speech signal in WaveRNN, LPCNet replaces the dual softmax output layer with a single softmax output layer on the -bit -law quantized signal with pre-emphasis. As a result, the LPCNet can produce kHz high quality speech with a complexity less than GFLOPS, which significantly improved the speed of speech synthesis system. On the other hand, the Multi-band WaveRNN is a variant of WaveRNN, which integrates multi-band strategy into WaveRNN based neural vocoder. Compared with WaveRNN, Multi-band WaveRNN can produce multi samples at one sequential step in parallel.
However, neural TTS systems with low computational complexity are very important for practical applications. As reported in [valin2019lpcnet] and [yu2019durian], both LPCNet and Multi-band WaveRNN can not be x faster than real-time when producing kHz high quality speech with a single CPU core, which means that the latency of synthesizing one second speech could be more than ms. Furthermore, there are many applications that require synthesizing speech on edge-devices, such as mobile phones with very limited computational capacity. For this purpose, we propose the FeatherWave vocoder, which merges multi-band processs to LPCNet framework. This makes it possible to match the quality of the state of the art neural vocoder WaveNet with significantly smaller computational load.
The contributions of this paper are summarized as follows: (1) We propose the multi-band (MB) linear prediction (LP) based FeatherWave vocoder. Firstly, we adopt the multi-band signal processing into the LPCNet framework. Then, we combine the -law quantization with MB-LP for efficiently modeling the discretized speech signal. Benefiting from the MB-LP process, the complexity of the proposed model is significantly reduced compared with the conventional LPCNet. (2) We demonstrate that the proposed FeatherWave can be x faster than real-time on two CPU cores by using our engineered streaming inference kernel when generating kHz high-fidelity speech, which achieved a mean opinion score (MOS) of 4.55 in our subjective listening test.
We organize the rest of the paper as follows: in Section 2, we will briefly review the lightweight RNN based neural vocoder, such as Multi-band WaveRNN and LPCNet. Then the proposed method will be given in Section 3. The evaluation of results will be presented in Section 4. Lastly in Section 5, conclusions and future work are presented.
2 Related work
2.1 Multi-band WaveRNN
Compared with Subscale WaveRNN [kalchbrenner2018efficient], which can generate multi samples per step with a subscale dependency scheme, Multi-band WaveRNN exploits multi-band generation strategy with the technique of subband [okamoto2018improving, okamoto2018investigation] to improve generation speed. It predicts all subband signal simultaneously through a multiple softmax output layer in a single recurrent step in WaveRNN. By using this variant of WaveRNN, the length of generated sequence can be down-sampled by a factor of (the number of frequency bands). As a result, the total computational cost can be reduced to approximately GFLOPS [yu2019durian]. Before model training, the original waveform signal should be down-sampled by invertiable analysis filters into subbands waveforms , where
. The joint probability of multi-band signal can be factorised as a product of conditional probabilities of subband signals as described as
where the conditional probability can be modeled by a recurrent neural network (RNN).
The LPCNet makes effort to reduce the computational load of each generation step benefiting from the classical technique of linear prediction. Similar to GlotNet [juvela2019glotnet] and ExcitNet [song2019excitnet] which use the WaveNet to capture the glottal excitation signal, the LPCNet models the discretized excitation signal of LPC filters with a WaveRNN for efficient generation. Instead of open-loop filtering approaches [juvela2018speaker], LPCNet preforms as a closed-loop synthesis of predicting sample by conditioning on the previously sampled excitation and current prediction , which can improve the quality of generated speech.
3 The proposed method
In this section, we present the proposed variant of WaveRNN vocoder, FeatherWave, which further improves the speed of audio generation with multi-band process and maintains the advantages of the LP-structure as LPCNet. Firstly, we introduce the MB-LP framework, which extends the process of the LP coding to multi-band signal. Then, we propose the FeatherWave vocoder which applies the MB-LP framework into the conventional neural vocoder.
3.1 Multi-band Linear Prediction
For the purpose of utilizing linear prediction to obtain good quality and multi-band to speed up synthesis, we introduce multi-band linear prediction (MB-LP) in the proposed model. By adopting LP analysis on multi-band waveform signal, order linear prediction coefficients of each sub frequency band, , can be extracted from the corresponding frequency bins of mel-spectrogram frame. The -th subband signal is down-sampled from the original signal by invertible analysis filters. Under the LP assumption, the corresponding predicted signal and excitation (prediction residual) of -th band can be computed as follows:
In our proposed FeatherWave vocoder, MB-LP is introduced into the conventional WaveRNN vocoder as illustrated in Fig. 1. It consists of a condition network that operates on input frames of mel spectrograms and a sample rate network which produces samples with a multi dual softmax output layer. Similar to the original WaveRNN, the sampling network firstly predicts coarse part of excitation signal and then computes fine part by conditioning on the predicted coarse signal. As indicated in Eq. 3, the subband signal is predicted from the network output excitation and linear predicted signal, which is linearly predicted from previous output signal as show in Eq. 2. As illustrated in Fig. 1, the merge band operation is applied, by using synthesis filters, to reconstruct original waveform signal from the predicted signal of subbands. In this paper, only mel spectrograms, which are widely used in neural TTS systems, are adopted as input conditional features.
3.2.1 Discretized Multi-band Linear Prediction
In LPCNet, a first-order pre-emphasis filter is applied to training data. This pre-emphasis makes it possible to model -bit -law discretized signal with high quality.
As an obvious extension of using this technique to help model learn and generate more efficiently, we also apply this pre-emphasis filter to training signal firstly, and then -law quantize all subbands signals after MB-LP process. Similar to LPCNet, we can model -law discretized signal using smaller model and achieve high-fidelity synthesis with the proposed MB-LP framework. For trading off quality against model size, we adopt -bit -law quantization for each subband signal in the FeatherWave.
3.2.2 Condition Network
For neural vocoder, the intelligibility of generated speech is much sensitive to the structure of condition network. In FeatherWave, instead of using bi-directional RNN, we adopted a stack of convolutional layers as the condition network for the purpose of streaming inference. Specifically, the local acoustic features are firstly operated by five convolution layers so the sample rate network can obtain enough receptive field. We adopt exponential linear unit (ELU) activation after every convolutional layer for more stable training. In order to match the sampling rate of target signal, the outputs of condition network are simply repeated by times before passed into sample rate network. As denotes hop size, the number of repetitions is .
3.2.3 Sample Rate Network
In the sample rate network, predictions computed from linear prediction are conditioned for the manner of closed-loop synthesis by following the method in LPCNet. As a result, the predictions perform as reference signal to compute excitations. This can enhance the performance of model. Besides, the up-sampled features from the output of condition network and the previous generated signal are used as well. All discretized signals are passed into a trainable embedding layer before fed into a GRU cell. Similar to the WaveRNN vocoder, we use dual softmax layer to predict coarse and fine parts of the discretized signal sequentially after a GRU and affine layers. A block sparse pruning[narang2017block] strategy is adopted to sparsify the parameters in the GRU layer for the purpose of speeding up inference. The output of the affine layer is passed into multiple softmax output layers to predict all subband excitations simultaneously. The parameters of model are optimized to minimize the negative log-likelihood (NLL) loss at the training phase.
3.2.4 Generation Method
In typical lightweight neural vocoder where small model is adopted, it is necessary to adjust the sharpness of the output distributions to avoid noise caused by the random sampling process and achieve better quality. In FFTNet [jin2018fftnet] and iLPCNet [hwang2020improving], lowering temperature in the voiced region with a constant factor is exploited for such purpose. Rather than using voiced information, LPCNet adopts pitch correlation to adjust the temperature factor. Furthermore, the distribution is subtracted with a constant threshold to prevent impulse noise caused by low probabilities.
Since only mel-spectrograms are used in condition network, we explore the technique of distribution subtraction carefully for better performance. We observed that a temperature produced good results in the trade-off quality against artifact in generated speech. The subtraction is only performed on the distribution of fine part, which is given as follow:
where denotes the distribution of fine part, and denotes the normalizing operator.
3.3 Two-stage Sparse Pruning
In [kalchbrenner2018efficient], a GRU with block sparse weights is vital for achieving fast inference in neural vocoders. In this work, in order to improve the performance of block sparse strategy, we apply a novel two-stage sparse pruning (TSSP) method to achieve high sparsity ratio in GRU weights.
In the conventional block sparsity pruning methods, a high sparsity ratio (above 40%) usually degrades the model performance as mentioned in [yao2019balanced]. In practice, high sparsity ratio usually hurts the speech quality of neural vocoders, although it could speed up the inference. To address this problem, we adopt a two-stage sparse pruning strategy, which consists of warming-up stage and increasing stage. Firstly, we train sparse model with a warming-up sparsity ratio which is % in our configuration to avoid hurting performance of model in warming-up stage. In the increasing stage, we increase the sparsity ratio progressively by loops to reach the target sparsity ratio, e.g. increasing % sparsity ratio in a loop. We maintain the sparsity ratio with a constant iterations after the warming-up sparsity ratio or the target sparsity ratio of every loop in increasing stage is reached.
4.1 Data Set
In our experiments, we used a Mandarin corpus of 20 hours of recordings, which were recorded by a professional broadcaster. The data we split into a training set and a test set. About hours of recordings were used for model training and the rest were used for testing. All the recordings were down-sampled to kHz sampling rate with -bit format. The 80 order mel-spectrograms were extracted as the conditions for all neural vocoders in our experiments with the method mentioned in [shen2018natural].
4.2 Experimental Setup
To demonstrate that the proposed model accelerates speech synthesis without degrading the speech quality, we chose LPCNet, which is open-sourced and is known as the fastest high-quality neural vocoder, as the baseline. In the LPCNet baseline system, we used the open-sourced implementation111https://github.com/mozilla/LPCNet/ based on the commit and the configuration was exactly the same as its original version. A -dimensional GRU layer with % sparsity ratio before the -dimensional dense GRU layer was used. Since the LPCNet open-sourced implementation can only generates kHz audio, we down-sampled the generated speech of the proposed model for a fair comparison. The original kHz speech of our model is included in the comparison as well. In order to observe the gap between the proposed model and the state-of-the-art neural vocoder, a WaveNet with mixture of logistic (MoL) output layer was also adopted for comparison. For robustness and stability, we chose the MoL WaveNet variant [tiantencent] and all the configurations were the same as mentioned in [tiantencent].
In the proposed FeatherWave vocoder, conv1d layers with kernel size 13 and channel size 256 were used in the condition network. In sample rate network, the final sparsity ratio is set as % in the GRU with 384 hidden units. The dimension of affine layer is . The embedding size for discretized signal is . In this work, we used bands and -bit -law quantization for dual softmax layers, therefore the output dimension of the last FC layer before softmax layer was . For modeling and reconstructing on subband signal, we followed the design of analysis filters and synthesis filters in [nguyen1994near]. Instead of adopting cepstrums [valin2019lpcnet]
, the LP coefficients were estimated from the mel-spectrograms as in[korostikstc].
In the training phase, the Adam [kingma2014adam] optimizer was adopted with a learning rate of . The proposed model was trained on a single GPU with mini-batch size of 1536 samples. The weights of the neural vocoders were randomly initialized with fixed random seed and all the networks were trained with k iterations. In the two-stage sparse pruning of FeatherWave, the target sparsity ratio of the warming-up stage was % with k sparse iterations and continued to the increasing stage after maintaining the current sparsity with k iterations. In every loop of increasing stage, the sparsity ratio was increased by % with k iterations and maintaining the current sparsity with another k iterations. After four loops in increasing stage, the total iterations reached k and the final sparsity ratio was %, which is same as in the LPCNet. The blocks with size were adopted in our pruning experiments.
4.3 Synthesis Speed
|syn. speed||single core||two cores|
We estimated the computational complexity of different vocoders firstly for revealing the speedup of our proposed FeatherWave vocoder. The main complexity of FeatherWave comes from one sparse GRU and four fully-connected layers. We compute it following the method in [valin2019lpcnet], which is given by:
where is the size of the sparse GRU, d is the density of the sparse GRU, is the root of the number of -law levels, is the width of affine layer connected with final fully-connected layer, is the number of frequency bands, and is the sampling rate. In our experiments, we set = , = , = , = and = for = . Therefore, a total complexity of FeatherWave is approximately GFLOPS, which is much smaller than GFLOPS in the conventional LPCNet.
The synthesis speeds over real-time of different vocoders are listed in Table 1. All the speed tests were performed on the Intel Xeon Platinum 8255C CPU. The results show that merging multi-band into LPCNet framework can bring about x speedup when generating kHz speech. When producing high-fidelity kHz speech, FeatherWave can be x faster than real-time using our engineered multi-thread inference kernel on two CPU cores. Additionally, our implementation of Parallel WaveNet [tiangenerative] requires 8 cores to achieve the similar synthesis speed.
Firstly, subjective evaluation was conducted to evaluate the MOS of perceptual quality of the proposed FeatherWave vocoder. In order to perform fair comparison, we randomly selected 40 utterance from test set for MOS testing and 30 native Mandarin speakers participated in the listening test.
The results222A subset of generated samples can be found at the following URL:
https://wavecoder.github.io/FeatherWave/ of the subjective MOS evaluation is presented in Table 2. The results show that the proposed FeatherWave can generate high quality 16 kHz speech with a slightly better MOS than the LPCNet. And when producing high-fidelity speech at higher sampling rate (24 kHz), the proposed FeatherWave achieves a MOS with a small gap to the powerful MoL WaveNet, which consists of 24 dilated conv1d layers. Since we use mel-spectrograms to extract the LP filters, the proposed model doesn’t depend on pitch extraction. The model has less artifact in the generated speech and is easy to build a neural TTS system instead of LPCNet. Furthermore, our model can produce less quantization noise and fidelity loss than LPCNet as -bit -law quantization with dual softmax layer is used instead of -bit one.
|Model||MOS on speech quality|
|FeatherWave (16k)||4.51 0.03|
|FeatherWave (24k)||4.55 0.03|
|MoL WaveNet||4.58 0.02|
We also investigated the effectiveness of two-stage sparse pruning method by objective NLL results. Lower NLL usually indicates better quality of the neural vocoder generated speech [kalchbrenner2018efficient]. It is obviously observed from the results in Table 3 that the model got lower NLL compared with the baseline model after using the proposed two-stage sparse pruning method, which could lower the probability of bad choice in sparse pruning compared with the conventional pruning methods. Finally, we got better speech quality in FeatherWave with this improvement.
|FeatherWave w/o TSSP||4.14|
|FeatherWave w/ TSSP||4.07|
5 Conclusions and future work
In this work, we proposed the FeatherWave vocoder which applies the MB-LP method to the conventional RNN based neural vocoder, such as WaveRNN. For faster generation and utilizing the linearity of the LP filters, we merged multi-band into LPCNet framework which only conditioned on mel spectrograms. Furthermore, we also make other contributions, such as the discretized multi-band linear prediction and two-stage sparse pruning. Our experimental results indicated that the proposed FeatherWave can further reduce the computational cost at speech generation and get higher speech quality compared with the conventional neural vocoders.
In future work, we will investigate FeatherWave with low bit and balanced sparsity [yao2019balanced] pruning training method for deploying on edge-devices.
The authors would like to thank Yi Xie and Ciyong Chen in IAGS, Intel Asia-Pacific Research & Development Co Ltd.. These two members in Intel not only provided the guidance on how to get good performance on the Intel(R) Xeon(R) Scalable Processors, but also helped to optimize/validate our algorithm with Intel(R) Deep Learning Boost using bfloat16 (BF16) format on the upcoming hardware.