Despite decades of extensive work classical speech coders offer very low quality at bit rates under
. New techniques based on the use of neural networks showed breakthrough advancements in this area in recent years, enabling compression factors much higher than conventional approaches, while maintaining acceptable quality. Neural speech coders are based on the classical encoder-decoder scheme: the encoder analyzes the input signal and extracts a set of acoustic features, which are then quantized, coded and transmitted; the decoder reconstructs the input signal using the information contained in the received bit stream. In neural speech coders a generative neural network plays the role of the decoder (i.e., neural vocoder), as illustrated in Figure1. It was demonstrated [11, 24] that conditioning a neural vocoder with coded acoustic parameters could produce natural wideband speech at bit rates lower than .
In recent years neural vocoders [25, 19, 9, 23] have revolutionized fields such as text-to-speech, voice conversion and speech enhancement, generating speech of unprecedented high quality. Most of these solutions however, are not suitable for speech coding purposes. This is mainly due to their high computational complexity or very slow generation speed, with clear quality degradation when using coarsely quantized conditioning features.
Neural vocoders based on generative adversarial networks (GANs) were recently shown to be competitive and viable alternatives to autoregressive and flow-based models for speech synthesis applications [15, 14, 17]. However, they are by design not suited for streaming or real-time speech communication, since they take the advantage of heavy parallelization for processing large blocks of conditioning information at once. This permits efficient generation of speech waveforms in one shot, but exploits the advantage of having the acoustic features encoding information about future samples, which are not available in a streaming scenario because of the high algorithmic delay they would cause. Moreover, GAN vocoders work particularly well with homogeneous speech representations such as mel-spectrograms, whereas speech coding applications primarily use non-homogeneous (e.g., parametric) speech representations that may not easily condition GAN vocoders for high-quality signal generation.
To solve the above-mentioned issues, our contributions in this work are twofold:
We propose Streamwise StyleMelGAN (SSMGAN), a modified StyleMelGAN vocoder for frame-by-frame generation of wideband speech at low delay, with reasonable computational complexity.
We demonstrate that SSMGAN is able to generate high-quality speech even when conditioned with a parametric and highly compressed representation provided by the encoder of LPCNet , which delivers a bitstream to our StyleMelGAN-based vocoder.
2 Related Works
The research on neural vocoders is a very active field with new models being presented every few months. For this reason, here we only refer to some of the ones which sparked the most attention. The first family to appear was the one of autoregressive models [25, 9, 23], followed by flow-based models , and then GANs [15, 27, 4, 14, 17].
The first work to show the feasibility of low bit rate neural speech coding was , using a WaveNet decoder. The decoder network’s complexity makes it impossible to deploy it in concrete applications. The complexity issue was partially tackled with a different approach in . Finally the LPCNet model  introduced optimizations which made neural speech coding possible on edge device. Moreover, the coding scheme used in LPCNet has a very low bit rate of . The coding parameters include acoustic features classically used in parametric speech coding, i.e. the Bark scale cepstrum, the pitch information and the energy. Table 1 describes in detail these parameters and the bit budget allocated to code them.
|Cepstrum absolute coding||30|
|Cepstrum delta coding||13|
LPCNet’s decoder is an autoregressive architecture based on WaveRNN generating sample-by-sample wideband speech (). It relies on linear prediction to reduce computational complexity, hence generating the signal in the residual linear prediction domain. The decoding step is divided into two parts: a frame-rate network that computes the conditioning for every
frame using the coded parameters, and a sample-rate network that computes the conditional sampling probabilities. LPCNet predicts the new excitation sample using the previously generated excitation and speech samples, as well as the current linear prediction sample from theth-order linear prediction.
More recent work  presented a new neural speech decoder (Lyra) compressing speech at
. The encoder directly codes stacked mel-spectra and the decoder uses noise suppression and variance regularization to improve the quality of out-of-distribution samples. When compared to the proposed solution, Lyra is conditioned on a substantially different bit stream and works under different conditions (e.g. noisy speech).
To the best of our knowledge, there exists no prior GAN vocoder which allows frame-by-frame generation of speech at low delay or which provides high quality speech synthesis conditioned on a coded bit stream.
3 Streamwise StyleMelGAN Vocoder (SSMGAN)
3.1 Baseline StyleMelGAN
is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity. It employs Temporal Adaptive DE-normalization (TADE) to style a noise vector with the acoustic features of the target speech (e.g., mel-spectrogram) via instance normalization and elementwise modulation. More precisely it learns adaptively the modulation parametersand from the acoustic features, and then applies the transformation , where is the normalized content of the input activation. For efficient training, multiple random-window discriminators adversarially evaluate the speech signal analyzed by a set of Pseudo-Quadrature Mirror Filters (PQMF) 
filter banks, with the generator regularized by a multi-resolution STFT loss. All convolutions in StyleMelGAN are non-causal and run as a moving-average on sliding windows of the input tensors. This results in significant amount of algorithmic delay due to the deep hierarchical structure of the model. In the following, we describe major modifications to this baseline model that enable the generation at very low delay with different acoustic features for conditioning.
3.2 Streamwise Convolution
There are two requirements to operate a convolutional model in streaming manner with low algorithmic delay. First, the dependency on future inputs to predict the current output should be as low as possible. We achieve this by enforcing all convolutions in StyleMelGAN to be causal so that the model has zero delay. The second requirement is to generate the output frame by frame, as the new input information is available. This condition is fulfilled in StyleMelGAN by adding an internal memory buffer to the causal convolutions in inference mode, as described in  and illustrated in Figure 2. Each causal convolution stores a buffer containing the last input samples used for generating the previous output frame, and then reused once the new input sample is available. By applying the above modifications to StyleMelGAN, we obtain Streamwise StyleMelGAN (SSMGAN), which is able to generate speech signals frame by frame with no delay between the input conditioning features and the output waveform.
3.3 Channel Normalization
It is not feasible to run instance normalization 
in SSMGAN as the normalization statistics are estimated along the temporal dimension of the input activations. We replace instance normalization with channel normalization, that estimates the statistics along the channel dimension instead. Interestingly, we found this normalization maintains the model performance and keeps the training fast. It also avoids the creation of subtle clicking artifacts that sometimes occur when training StyleMelGAN with instance normalization on a multi-speaker dataset.
3.4 Modified TADE Residual Block
The TADE residual blocks are slightly modified from the original model, as shown in Figure 3. The complexity in SSMGAN is reduced by using a single TADE conditioning layer and applying the same modulation parameters and twice rather than having two separate TADEs in the residual block. With this modification, the total number of model parameters reduces from to .
3.5 Multiband Generation
SSMGAN further reduces the complexity compared to the baseline model by introducing multiband synthesis as in [29, 28]. Rather than synthesizing the whole band of the speech signal in time domain at the output sampling rate , the generator outputs simultaneously different sub-bands sampled at Hz, with and . By design, SSMGAN generates the sub-bands as an -channels output, which is then fed to a PQMF synthesis filter-bank to obtain a frame of synthesized speech. Since the PQMF uses a filter prototype with 50% of overlap, it incurs a delay of frame.
3.6 Conditioning on Coded LPCNet Features
Finally, we condition SSMGAN with coded parameters in real-time to run as a speech decoder. Instead of providing the mel-spectrogram as an intermediate representation, the coded parameters obtained by the LPCNet encoder at are introduced to the generator network. The pitch lag was found to be critical for high-quality synthesis, and hence it is processed separately from the rest of the conditioning information. More precisely, the coded cepstral and energy parameters are passed through a simple causal convolutional layer to obtain an channel representation used for conditioning the generation from the prior signal. This prior is not created from latent random noise, but rather from a learned embedding of the pitch lag which is then multiplied elementwise by the pitch correlation. Figure 4 shows the complete architecture of the proposed SSMGAN conditioned on the LPCNet coded parameters. With this setting, SSMGAN can generate wideband speech frames of length and total delay of , where is introduced by the original extraction of the LPCNet coding packets, while are added by the PQMF synthesis filter-bank.
4.1 Experimental setup
The training procedure and hyperparameters are very similar to the ones described in. We train SSMGAN using one NVIDIA Tesla V100 GPU on the VCTK corpus  at . The conditioning features are calculated as in  as described in Section 2. The generator is pretrained for steps using Adam optimizer  with learning rate , . When starting the adversarial training, we set and use the multi-scale discriminator described in  trained via Adam optimizer with , and same . The batch size is and for each sample in the batch we extract a segments of length . The adversarial training lasts for about steps.
4.2 Subjective evaluation
We conducted a subjective listening test following the ITU-R MUSHRA  recommendation comparing classical and neural speech coders. The test set is composed of 12 utterances by 10 different speakers in 4 different languages. All speakers and 3 out of 4 languages are unseen during training. Most of the utterances (10 out of the 12) are coming from unseen proprietary databases. The obtained results with 16 expert listeners are shown in Figure 5.
The anchor is generated using the Speex speech decoder employed at a bit rate of . Two state-of-the-art neural decoders were considered: LPCNet at and Lyra at , as well as two classical but still widely used codecs: AMR-WB  at and the recent 3GPP EVS  at . The condition Lyra at was generated using the release v0.0.1  with the default setting. EVS at works with a variable bit rate (VBR) and that reflects the average bit rate on active frames. During a long inactive phase, EVS switches to a non-transmission mode (DTX), transmitting only periodically packets at a bit rate as low as . Since the test items only contain short pauses between sentences, the DTX mode plays a minor role in this test.
LPCNet was trained on the VCTK dataset. One difference from the original work is that we do not apply a domain adaptation by first training on unquantized and then fine-tuning on quantized features, since this was found to make no difference on VCTK. In addition, since VCTK is noisier and much more diverse than the NTT database used in the original work, we removed the data augmentation since it was found to be detrimental to the final quality111Check our demo samples at the following url: https://fhgspco.github.io/ssmgan_spco/. The publicly availabe version of the Lyra model was not retrained on VCTK, and hence it is not directly comparable with SSMGAN or LPCNet in this case. It was nonetheless taken into consideration as it offers a reproducible benchmark.
4.3 Objective evaluation
Our solution was also compared to the other neural decoders using different objective metrics. Since it is known that objective speech quality models like POLQA  are not reliable for non-waveform-preserving coding schemes, and in particular for neural decoders, we also considered the newly introduced objective metric WARP-Q , which was designed for this purpose. STOI , assessing the speech intelligibility, is also added, and the scores measured on 824 test items of VCTK are reported in Table 2.
Average objective scores for neural decoders. For POLQA-MOS and STOI higher scores are better, while for WARP-Q lower scores are better (confidence intervals are negligible).
SSMGAN at scores the best among the neural coding solutions across all three metrics, which is in agreement with the subjective listening test. The results of our MUSHRA listening test show moreover that these objective metrics do not fully reflect the perceived quality of the generated speech, disproportionately disfavouring generative models.
The main contribution to SSMGAN’s computational complexity stems from the convolutions in the TADEResBlocks and the upsampling layers. If denotes the channel dimension, the size of the convolutional kernels, and the dimension of the input features, then (ignoring activations and lower order terms) the evaluation of a TADEResBlock takes multiply accumulate operations (MAC) per output sample. Furthermore, an upsampling layer with kernel size and channel dimension takes MAC. With , , and TADEResBlock output sampling rates of and this accumulates to
A comparison with other neural vocoders used for neural speech coding is given in Table 3
. It should be noted, that the convolutional structure of SSMGAN allows for efficient parallel execution, which gives it a decisive advantage over autoregressive models on GPUs. The current unoptimized PyTorch implementation achieves about real-time frame-by-frame inference using four cores of an Intel(R) Core(TM) i7-6700 3.40GHz CPU. The above complexity calculations show that the next step will be to work on an efficient implementation for mobile devices, which will be the object of a future work.
In this paper we introduce SSMGAN, a neural speech decoder generating state-of-the-art quality with low delay, complexity, and working at very low bit rate. We assess the quality against existing neural autoregressive models and modern speech codecs at low bit rate, with both objective scores and subjective listening tests. We show for the first time that GAN-vocoders can perform fast streaming speech synthesis with low algorithmic delay, and that they can achieve high quality synthesis when conditioned on compact parametric speech representations.
-  (2009-12) Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; Transcoding functions. TS Technical Report 26.190, 3rd Generation Partnership Project (3GPP). External Links: Cited by: §4.2.
-  (2014-12) TS 26.445, EVS Codec Detailed Algorithmic Description; 3GPP Technical Specification (Release 12). TS Technical Report 26.445, 3rd Generation Partnership Project (3GPP). External Links: Cited by: §4.2.
-  (2013-06) Perceptual Objective Listening Quality Assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I — temporal alignment. journal of the audio engineering society 61 (6), pp. 366–384. Cited by: §4.3.
-  (2020) High fidelity speech synthesis with adversarial networks. In International Conference on Learning Representations, Cited by: §2.
-  (2003) Method for the subjective assessment of intermediate quality levels of coding systems. Technical report ITU-R. Cited by: §4.2.
-  (2014) Generative Adversarial Nets. In Advances in NeurIPS 27, pp. 2672–2680. Cited by: §1.
-  (2021)(Website) External Links: Cited by: §4.2.
-  (2021) WARP-Q: Quality Prediction For Generative Neural Speech Codecs. In ICASSP 2021, IEEE International Conference on Acoustics, Speech and Signal Processing, External Links: Cited by: §4.3.
Efficient neural audio synthesis.
Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 2410–2419. Cited by: §1, §2.
-  (2015) Adam: A method for stochastic optimization. ICLR. Cited by: §4.1.
-  (2018) WaveNet Based Low Rate Speech Coding. In ICASSP 2018, IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. , pp. 676–680. Cited by: §1, §2.
-  (2021) Generative speech coding with predictive variance regularization. In ICASSP 2021, IEEE International Conference on Acoustics, Speech and Signal Processing, External Links: Cited by: §2.
-  (2019) High-quality Speech Coding with SampleRNN. In ICASSP 2019, IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. , pp. 7155–7159. Cited by: §2.
-  (2020) HiFi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 17022–17033. Cited by: §1, §2.
-  (2019) MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Advances in NeurIPS 32, pp. 14910–14921. Cited by: §1, §2, §4.1.
-  (2020) High-fidelity generative image compression. Advances in Neural Information Processing Systems 33. Cited by: §3.3.
-  (2021) StyleMelGAN: an efficient high-fidelity adversarial vocoder with temporal adaptive normalization. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6034–6038. External Links: Cited by: §1, §2, §3.1, §4.1.
-  (1994) Near-perfect-reconstruction pseudo-QMF banks. IEEE Transactions on Signal Processing 42 (1), pp. 65–76. Cited by: §3.1.
-  (2019) WaveGlow: A Flow-based Generative Network for Speech Synthesis. In ICASSP 2019, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3617–3621. Cited by: §1, §2.
-  (2020) Streaming Keyword Spotting on Mobile Devices. In Proc. Interspeech 2020, pp. 2277–2281. Cited by: §3.2.
-  (2011) Algorithm for intelligibility prediction of timefrequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process.,vol, pp. 2125–2136. Cited by: §4.3.
-  (2016) Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022. Cited by: §3.3.
-  (2019) LPCNet: Improving Neural Speech Synthesis through Linear Prediction. In ICASSP 2019, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5891–5895. Cited by: §1, §2, §4.1.
-  (2019) A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet. In INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association, pp. 3406–3410. Cited by: 2nd item, §1, §2, Table 3.
-  (2016) WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499. Cited by: §1, §2.
-  (2019) CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. Cited by: §4.1.
-  (2020) Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. In ICASSP 2020, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6199–6203. Cited by: §2.
-  (2021) Multi-band melgan: faster waveform generation for high-quality text-to-speech. In 2021 IEEE Spoken Language Technology Workshop (SLT), Vol. , pp. 492–498. Cited by: §3.5.
-  (2020) DurIAN: Duration Informed Attention Network for Speech Synthesis. In Proc. Interspeech 2020, pp. 2027–2031. External Links: Cited by: §3.5, Table 3.