A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate

08/09/2021 ∙ by Ahmed Mustafa, et al. ∙ GMX Fraunhofer 0

Recently, GAN vocoders have seen rapid progress in speech synthesis, starting to outperform autoregressive models in perceptual quality with much higher generation speed. However, autoregressive vocoders are still the common choice for neural generation of speech signals coded at very low bit rates. In this paper, we present a GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s. The proposed model is a modified version of the StyleMelGAN vocoder that can run in frame-by-frame manner, making it suitable for streaming applications. The experimental results show that the proposed model significantly outperforms prior autoregressive vocoders like LPCNet for very low bit rate speech coding, with computational complexity of about 5 GMACs, providing a new state of the art in this domain. Moreover, this streamwise adversarial vocoder delivers quality competitive to advanced speech codecs such as EVS at 5.9 kbit/s on clean speech, which motivates further usage of feed-forward fully-convolutional models for low bit rate speech coding.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite decades of extensive work classical speech coders offer very low quality at bit rates under

. New techniques based on the use of neural networks showed breakthrough advancements in this area in recent years, enabling compression factors much higher than conventional approaches, while maintaining acceptable quality. Neural speech coders are based on the classical encoder-decoder scheme: the encoder analyzes the input signal and extracts a set of acoustic features, which are then quantized, coded and transmitted; the decoder reconstructs the input signal using the information contained in the received bit stream. In neural speech coders a generative neural network plays the role of the decoder (i.e., neural vocoder), as illustrated in Figure 

1. It was demonstrated [11, 24] that conditioning a neural vocoder with coded acoustic parameters could produce natural wideband speech at bit rates lower than .

In recent years neural vocoders [25, 19, 9, 23] have revolutionized fields such as text-to-speech, voice conversion and speech enhancement, generating speech of unprecedented high quality. Most of these solutions however, are not suitable for speech coding purposes. This is mainly due to their high computational complexity or very slow generation speed, with clear quality degradation when using coarsely quantized conditioning features.

Figure 1: High-level block-diagram of a neural speech coder.

Neural vocoders based on generative adversarial networks (GANs) 

[6] were recently shown to be competitive and viable alternatives to autoregressive and flow-based models for speech synthesis applications [15, 14, 17]. However, they are by design not suited for streaming or real-time speech communication, since they take the advantage of heavy parallelization for processing large blocks of conditioning information at once. This permits efficient generation of speech waveforms in one shot, but exploits the advantage of having the acoustic features encoding information about future samples, which are not available in a streaming scenario because of the high algorithmic delay they would cause. Moreover, GAN vocoders work particularly well with homogeneous speech representations such as mel-spectrograms, whereas speech coding applications primarily use non-homogeneous (e.g., parametric) speech representations that may not easily condition GAN vocoders for high-quality signal generation.

To solve the above-mentioned issues, our contributions in this work are twofold:

  • We propose Streamwise StyleMelGAN (SSMGAN), a modified StyleMelGAN vocoder for frame-by-frame generation of wideband speech at low delay, with reasonable computational complexity.

  • We demonstrate that SSMGAN is able to generate high-quality speech even when conditioned with a parametric and highly compressed representation provided by the encoder of LPCNet [24], which delivers a bitstream to our StyleMelGAN-based vocoder.

2 Related Works

The research on neural vocoders is a very active field with new models being presented every few months. For this reason, here we only refer to some of the ones which sparked the most attention. The first family to appear was the one of autoregressive models [25, 9, 23], followed by flow-based models [19], and then GANs [15, 27, 4, 14, 17].

The first work to show the feasibility of low bit rate neural speech coding was [11], using a WaveNet decoder. The decoder network’s complexity makes it impossible to deploy it in concrete applications. The complexity issue was partially tackled with a different approach in [13]. Finally the LPCNet model [24] introduced optimizations which made neural speech coding possible on edge device. Moreover, the coding scheme used in LPCNet has a very low bit rate of . The coding parameters include acoustic features classically used in parametric speech coding, i.e. the Bark scale cepstrum, the pitch information and the energy. Table 1 describes in detail these parameters and the bit budget allocated to code them.

Coding Parameter Bits/packet
Pitch lag 6
Pitch modulation 3
Pitch correlation 2
Energy 7
Cepstrum absolute coding 30
Cepstrum delta coding 13

Cepstrum interpolation

Total 64
Table 1: LPCNet coding parameters and their bit allocation for a packet

LPCNet’s decoder is an autoregressive architecture based on WaveRNN generating sample-by-sample wideband speech (). It relies on linear prediction to reduce computational complexity, hence generating the signal in the residual linear prediction domain. The decoding step is divided into two parts: a frame-rate network that computes the conditioning for every

frame using the coded parameters, and a sample-rate network that computes the conditional sampling probabilities. LPCNet predicts the new excitation sample using the previously generated excitation and speech samples, as well as the current linear prediction sample from the

th-order linear prediction.

More recent work [12] presented a new neural speech decoder (Lyra) compressing speech at

. The encoder directly codes stacked mel-spectra and the decoder uses noise suppression and variance regularization to improve the quality of out-of-distribution samples. When compared to the proposed solution, Lyra is conditioned on a substantially different bit stream and works under different conditions (e.g. noisy speech).

To the best of our knowledge, there exists no prior GAN vocoder which allows frame-by-frame generation of speech at low delay or which provides high quality speech synthesis conditioned on a coded bit stream.

3 Streamwise StyleMelGAN Vocoder (SSMGAN)

3.1 Baseline StyleMelGAN

StyleMelGAN [17]

is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity. It employs Temporal Adaptive DE-normalization (TADE) to style a noise vector with the acoustic features of the target speech (e.g., mel-spectrogram) via instance normalization and elementwise modulation. More precisely it learns adaptively the modulation parameters

and from the acoustic features, and then applies the transformation , where is the normalized content of the input activation. For efficient training, multiple random-window discriminators adversarially evaluate the speech signal analyzed by a set of Pseudo-Quadrature Mirror Filters (PQMF) [18]

filter banks, with the generator regularized by a multi-resolution STFT loss. All convolutions in StyleMelGAN are non-causal and run as a moving-average on sliding windows of the input tensors. This results in significant amount of algorithmic delay due to the deep hierarchical structure of the model. In the following, we describe major modifications to this baseline model that enable the generation at very low delay with different acoustic features for conditioning.

3.2 Streamwise Convolution

There are two requirements to operate a convolutional model in streaming manner with low algorithmic delay. First, the dependency on future inputs to predict the current output should be as low as possible. We achieve this by enforcing all convolutions in StyleMelGAN to be causal so that the model has zero delay. The second requirement is to generate the output frame by frame, as the new input information is available. This condition is fulfilled in StyleMelGAN by adding an internal memory buffer to the causal convolutions in inference mode, as described in [20] and illustrated in Figure 2. Each causal convolution stores a buffer containing the last input samples used for generating the previous output frame, and then reused once the new input sample is available. By applying the above modifications to StyleMelGAN, we obtain Streamwise StyleMelGAN (SSMGAN), which is able to generate speech signals frame by frame with no delay between the input conditioning features and the output waveform.

Figure 2: Diagrams for non-streaming convolution (left) and streaming convolution (right)

3.3 Channel Normalization

It is not feasible to run instance normalization [22]

in SSMGAN as the normalization statistics are estimated along the temporal dimension of the input activations. We replace instance normalization with channel normalization 

[16], that estimates the statistics along the channel dimension instead. Interestingly, we found this normalization maintains the model performance and keeps the training fast. It also avoids the creation of subtle clicking artifacts that sometimes occur when training StyleMelGAN with instance normalization on a multi-speaker dataset.

3.4 Modified TADE Residual Block

The TADE residual blocks are slightly modified from the original model, as shown in Figure 3. The complexity in SSMGAN is reduced by using a single TADE conditioning layer and applying the same modulation parameters and twice rather than having two separate TADEs in the residual block. With this modification, the total number of model parameters reduces from to .

Figure 3: Modified TADE residual block for the SSMGAN.

3.5 Multiband Generation

SSMGAN further reduces the complexity compared to the baseline model by introducing multiband synthesis as in [29, 28]. Rather than synthesizing the whole band of the speech signal in time domain at the output sampling rate , the generator outputs simultaneously different sub-bands sampled at Hz, with and . By design, SSMGAN generates the sub-bands as an -channels output, which is then fed to a PQMF synthesis filter-bank to obtain a frame of synthesized speech. Since the PQMF uses a filter prototype with 50% of overlap, it incurs a delay of frame.

3.6 Conditioning on Coded LPCNet Features

Finally, we condition SSMGAN with coded parameters in real-time to run as a speech decoder. Instead of providing the mel-spectrogram as an intermediate representation, the coded parameters obtained by the LPCNet encoder at are introduced to the generator network. The pitch lag was found to be critical for high-quality synthesis, and hence it is processed separately from the rest of the conditioning information. More precisely, the coded cepstral and energy parameters are passed through a simple causal convolutional layer to obtain an channel representation used for conditioning the generation from the prior signal. This prior is not created from latent random noise, but rather from a learned embedding of the pitch lag which is then multiplied elementwise by the pitch correlation. Figure 4 shows the complete architecture of the proposed SSMGAN conditioned on the LPCNet coded parameters. With this setting, SSMGAN can generate wideband speech frames of length and total delay of , where is introduced by the original extraction of the LPCNet coding packets, while are added by the PQMF synthesis filter-bank.

Figure 4: The SSMGAN generator. Dimensions are given for generating frame at sampling rate. The cepstral coefficients pass through a simple convolutional layer to obtain a representation of 80 channels.

4 Experiments

4.1 Experimental setup

The training procedure and hyperparameters are very similar to the ones described in 

[17]. We train SSMGAN using one NVIDIA Tesla V100 GPU on the VCTK corpus [26] at . The conditioning features are calculated as in [23] as described in Section 2. The generator is pretrained for steps using Adam optimizer [10] with learning rate , . When starting the adversarial training, we set and use the multi-scale discriminator described in [15] trained via Adam optimizer with , and same . The batch size is and for each sample in the batch we extract a segments of length . The adversarial training lasts for about steps.

4.2 Subjective evaluation

We conducted a subjective listening test following the ITU-R MUSHRA [5] recommendation comparing classical and neural speech coders. The test set is composed of 12 utterances by 10 different speakers in 4 different languages. All speakers and 3 out of 4 languages are unseen during training. Most of the utterances (10 out of the 12) are coming from unseen proprietary databases. The obtained results with 16 expert listeners are shown in Figure 5.

Figure 5: MUSHRA listening test results using -distribution.

The anchor is generated using the Speex speech decoder employed at a bit rate of . Two state-of-the-art neural decoders were considered: LPCNet at and Lyra at , as well as two classical but still widely used codecs: AMR-WB [1] at and the recent 3GPP EVS [2] at . The condition Lyra at was generated using the release v0.0.1 [7] with the default setting. EVS at works with a variable bit rate (VBR) and that reflects the average bit rate on active frames. During a long inactive phase, EVS switches to a non-transmission mode (DTX), transmitting only periodically packets at a bit rate as low as . Since the test items only contain short pauses between sentences, the DTX mode plays a minor role in this test.

LPCNet was trained on the VCTK dataset. One difference from the original work is that we do not apply a domain adaptation by first training on unquantized and then fine-tuning on quantized features, since this was found to make no difference on VCTK. In addition, since VCTK is noisier and much more diverse than the NTT database used in the original work, we removed the data augmentation since it was found to be detrimental to the final quality111Check our demo samples at the following url: https://fhgspco.github.io/ssmgan_spco/. The publicly availabe version of the Lyra model was not retrained on VCTK, and hence it is not directly comparable with SSMGAN or LPCNet in this case. It was nonetheless taken into consideration as it offers a reproducible benchmark.

4.3 Objective evaluation

Our solution was also compared to the other neural decoders using different objective metrics. Since it is known that objective speech quality models like POLQA [3] are not reliable for non-waveform-preserving coding schemes, and in particular for neural decoders, we also considered the newly introduced objective metric WARP-Q [8], which was designed for this purpose. STOI [21], assessing the speech intelligibility, is also added, and the scores measured on 824 test items of VCTK are reported in Table 2.

Speech decoders POLQA STOI WARP-Q
Speex 2.022 0.720 1.074
AMR-WB 3.202 0.863 0.784
EVS 3.675 0.890 0.805
LPCNet 2.628 0.777 0.915
Lyra 2.649 0.794 0.958
SSMGAN 2.719 0.830 0.826
Table 2:

Average objective scores for neural decoders. For POLQA-MOS and STOI higher scores are better, while for WARP-Q lower scores are better (confidence intervals are negligible).

SSMGAN at scores the best among the neural coding solutions across all three metrics, which is in agreement with the subjective listening test. The results of our MUSHRA listening test show moreover that these objective metrics do not fully reflect the perceived quality of the generated speech, disproportionately disfavouring generative models.

4.4 Complexity

The main contribution to SSMGAN’s computational complexity stems from the convolutions in the TADEResBlocks and the upsampling layers. If denotes the channel dimension, the size of the convolutional kernels, and the dimension of the input features, then (ignoring activations and lower order terms) the evaluation of a TADEResBlock takes multiply accumulate operations (MAC) per output sample. Furthermore, an upsampling layer with kernel size and channel dimension takes MAC. With , , and TADEResBlock output sampling rates of and this accumulates to

A comparison with other neural vocoders used for neural speech coding is given in Table 3

. It should be noted, that the convolutional structure of SSMGAN allows for efficient parallel execution, which gives it a decisive advantage over autoregressive models on GPUs. The current unoptimized PyTorch implementation achieves about real-time frame-by-frame inference using four cores of an Intel(R) Core(TM) i7-6700 3.40GHz CPU. The above complexity calculations show that the next step will be to work on an efficient implementation for mobile devices, which will be the object of a future work.

Model Complexity
SSMGAN (ours)
LPCNet [24]
Multi-band WaveRNN [29]
Table 3: Complexity of common neural vocoders for speech coding.

5 Conclusion

In this paper we introduce SSMGAN, a neural speech decoder generating state-of-the-art quality with low delay, complexity, and working at very low bit rate. We assess the quality against existing neural autoregressive models and modern speech codecs at low bit rate, with both objective scores and subjective listening tests. We show for the first time that GAN-vocoders can perform fast streaming speech synthesis with low algorithmic delay, and that they can achieve high quality synthesis when conditioned on compact parametric speech representations.


  • [1] 3GPP (2009-12) Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; Transcoding functions. TS Technical Report 26.190, 3rd Generation Partnership Project (3GPP). External Links: Link Cited by: §4.2.
  • [2] 3GPP (2014-12) TS 26.445, EVS Codec Detailed Algorithmic Description; 3GPP Technical Specification (Release 12). TS Technical Report 26.445, 3rd Generation Partnership Project (3GPP). External Links: Link Cited by: §4.2.
  • [3] J.G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ullmann, J. Pomy, and M. Keyhl (2013-06) Perceptual Objective Listening Quality Assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I — temporal alignment. journal of the audio engineering society 61 (6), pp. 366–384. Cited by: §4.3.
  • [4] M. Bińkowski, J. Donahue, et al. (2020) High fidelity speech synthesis with adversarial networks. In International Conference on Learning Representations, Cited by: §2.
  • [5] R. BS.1534 (2003) Method for the subjective assessment of intermediate quality levels of coding systems. Technical report ITU-R. Cited by: §4.2.
  • [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, et al. (2014) Generative Adversarial Nets. In Advances in NeurIPS 27, pp. 2672–2680. Cited by: §1.
  • [7] (2021)(Website) External Links: Link Cited by: §4.2.
  • [8] W. A. Jassim, J. Skoglund, M. Chinen, and A. Hines (2021) WARP-Q: Quality Prediction For Generative Neural Speech Codecs. In ICASSP 2021, IEEE International Conference on Acoustics, Speech and Signal Processing, External Links: 2102.10449 Cited by: §4.3.
  • [9] N. Kalchbrenner, E. Elsen, K. Simonyan, Noury, et al. (2018-10–15 Jul) Efficient neural audio synthesis. In

    Proceedings of the 35th International Conference on Machine Learning

    , J. Dy and A. Krause (Eds.),
    Proceedings of Machine Learning Research, Vol. 80, pp. 2410–2419. Cited by: §1, §2.
  • [10] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. ICLR. Cited by: §4.1.
  • [11] W. B. Kleijn, F. S. C. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, and T. C. Walters (2018) WaveNet Based Low Rate Speech Coding. In ICASSP 2018, IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. , pp. 676–680. Cited by: §1, §2.
  • [12] W.B. Kleijn, A. Storus, M. Chinen, T. Denton, F.S.C. Lim, A. Luebs, J. Skoglund, and H. Yeh (2021) Generative speech coding with predictive variance regularization. In ICASSP 2021, IEEE International Conference on Acoustics, Speech and Signal Processing, External Links: 2102.09660 Cited by: §2.
  • [13] J. Klejsa, P. Hedelin, C. Zhou, R. Fejgin, and L. Villemoes (2019) High-quality Speech Coding with SampleRNN. In ICASSP 2019, IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. , pp. 7155–7159. Cited by: §2.
  • [14] J. Kong, J. Kim, and J. Bae (2020) HiFi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 17022–17033. Cited by: §1, §2.
  • [15] K. Kumar, R. Kumar, de T. Boissiere, L. Gestin, et al. (2019) MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Advances in NeurIPS 32, pp. 14910–14921. Cited by: §1, §2, §4.1.
  • [16] F. Mentzer, G. D. Toderici, M. Tschannen, and E. Agustsson (2020) High-fidelity generative image compression. Advances in Neural Information Processing Systems 33. Cited by: §3.3.
  • [17] A. Mustafa, N. Pia, and G. Fuchs (2021) StyleMelGAN: an efficient high-fidelity adversarial vocoder with temporal adaptive normalization. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6034–6038. External Links: Document Cited by: §1, §2, §3.1, §4.1.
  • [18] T. Q. Nguyen (1994) Near-perfect-reconstruction pseudo-QMF banks. IEEE Transactions on Signal Processing 42 (1), pp. 65–76. Cited by: §3.1.
  • [19] R. Prenger, R. Valle, and B. Catanzaro (2019) WaveGlow: A Flow-based Generative Network for Speech Synthesis. In ICASSP 2019, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3617–3621. Cited by: §1, §2.
  • [20] O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and S. Laurenzo (2020) Streaming Keyword Spotting on Mobile Devices. In Proc. Interspeech 2020, pp. 2277–2281. Cited by: §3.2.
  • [21] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen (2011) Algorithm for intelligibility prediction of timefrequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process.,vol, pp. 2125–2136. Cited by: §4.3.
  • [22] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022. Cited by: §3.3.
  • [23] J. Valin and J. Skoglund (2019) LPCNet: Improving Neural Speech Synthesis through Linear Prediction. In ICASSP 2019, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5891–5895. Cited by: §1, §2, §4.1.
  • [24] J.M. Valin and J. Skoglund (2019) A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet. In INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association, pp. 3406–3410. Cited by: 2nd item, §1, §2, Table 3.
  • [25] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, et al. (2016) WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499. Cited by: §1, §2.
  • [26] J. Yamagishi, C. Veaux, and K. MacDonald (2019) CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. Cited by: §4.1.
  • [27] R. Yamamoto, E. Song, and J. Kim (2020) Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. In ICASSP 2020, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6199–6203. Cited by: §2.
  • [28] G. Yang, S. Yang, K. Liu, et al. (2021) Multi-band melgan: faster waveform generation for high-quality text-to-speech. In 2021 IEEE Spoken Language Technology Workshop (SLT), Vol. , pp. 492–498. Cited by: §3.5.
  • [29] C. Yu, H. Lu, N. Hu, M. Yu, et al. (2020) DurIAN: Duration Informed Attention Network for Speech Synthesis. In Proc. Interspeech 2020, pp. 2027–2031. External Links: Document, Link Cited by: §3.5, Table 3.