Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding

06/18/2019 ∙ by Kai Zhen, et al. ∙ Indiana University Bloomington Indiana University ETRI 0

Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single large neural network, it optimizes a series of less-complicated modules in a two-phase training scheme. The proposed method shows better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture. As an end-to-end model, it takes raw PCM signals as an input, but is also compatible with linear predictive coding (LPC), showing better subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved by using only 0.9 million trainable parameters, a significantly less complex architecture than the other DNN-based codecs in the literature.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech coding, where the encoder converts the speech signal into bitstreams and the decoder synthesizes reconstructed signal from received bitstreams, serves an important role for various purposes: to secure a voice communication [1][2], to facilitate data transmission [3], etc. There have been various conventional speech coding methodologies, including linear predictive coding (LPC) [4], adaptive encoding [5], and perceptual weighting [6] among other domain specific knowledge about the speech signals, that are used to construct classic codecs, such as AMR-WB [7] and OPUS [8] with high perceptual quality.

Since the last decade, data-driven approaches have vitalized the use of deep neural networks (DNN) for speech coding. A speech coding system can be formulated by DNN as an autoencoder (AE) with a code layer discretized by vector quantization (VQ)

[9] or bitwise network techniques [10], etc. Many DNN methods [11][12]

take inputs in time-frequency (T-F) domain from short time Fourier transform (STFT) or modified discrete cosine transform (MDCT), etc. Recent DNN-based codecs

[13][14][15][16] model speech signals in time domain directly without T-F transformation. They are referred to as end-to-end methods, yielding competitive performance comparing with current speech coding standards, such as AMR-WB [7].

While DNN serves a powerful parameter estimation paradigm, they are computationally expensive to run on smart devices. Many DNN-based codecs achieve both low bitrates and high perceptual quality, two main targets for speech codecs

[17][18][19], but with a high model complexity. A WaveNet based variational autoencoder (VAE) [16] outperforms other low bitrate codecs in the listening test, however, with 20 millions parameters, a too big model for real-time processing in a resource-constrained device. Similarly, codecs built on SampleRNN [20][21] can also be energy-intensive.

Motivated by DNN based end-to-end codecs [14] and residual cascading [22][23], this paper proposes a “cross-module” residual learning (CMRL) pipeline, which can lower the model complexity while maintaining a high perceptual quality and compression ratio. CMRL hosts a list of less-complicated end-to-end speech coding modules. Each module learns to recover what is failed to be reconstructed by its preceding modules. CMRL differs from other residual learning networks, e.g. ResNet [24], in that rather than adding identical shortcuts between layers, CMRL cascades residuals across a series of DNN modules. We introduce a two-round model training scheme to train CMRL models. In addition, we also show that CMRL is compatible with LPC by having it as one of the modules. With LPC coefficients being predicted, CMRL recovers the LPC residuals which, along with the LPC coefficients, synthesize the decoded speech signal at the receiver side.

The evaluation of the propose method is threefold: objective measures, subjective assessment and model complexity. Comparing with AMR-WB, OPUS, and the recently proposed end-to-end system [14], CMRL showed promising performance both in objective and subjective quality assessments. As for complexity, CMRL contains only 0.9 million model parameters, significantly less complicated than the WaveNet based speech codec [16] and the end-to-end baseline [14].

Figure 1: A schematic diagram for the end-to-end speech coding component module: some channel change steps are omitted.

2 Model description

Before introducing CMRL as a module carrier, we describe the component module to be hosted by CMRL.

2.1 The component module

Recently, an end-to-end DNN speech codec (referred to as Kankanahalli-Net) has shown competitive performance comparable to one of the standards (AMR-WB) [14]. We describe our component model derived from Kankanahalli-Net that consists of bottleneck residual learning [24], soft-to-hard quantization [25]

, and sub-pixel convolutional neural networks for upsampling

[26]. Figure 1 depicts the component module.

2.1.1 Four non-linear mapping types

In the end-to-end speech codec, we take time domain samples per frame, 32 of which are windowed by the either left or right half of a Hann window and then overlapped with the adjacent ones. This forms the input to the first 1-D convolutional layer of

kernels, whose output is a tensor of size


There are four types of non-linear transformations involved in this fully convolutional network: downsampling, upsampling, channel changing, and residual learning. The downsampling operation reduces

down to

by setting the stride

of the convolutional layer to be , which turns an input example into . The original dimension is recovered in the decoder with recently proposed sub-pixel convolution [25], which forms the upsampling operation. The super-pixel convolution is done by interlacing multiple feature maps to expand the size of the window (Figure  2). In our case, we interlace a pair of feature maps, and that is why in Table 1 the upsampling layer reduces the channels from 100 to 50 while recovers the original 512 dimensions from 256.

In this work, to simplify the model architecture we have identical shortcuts only for cross-layer residual learning, while Kankanahalli-Net employs them more frequently. Furthermore, inspired by recent work in source separation with dilated convolutional neural network [27], we use a “bottleneck” residual learning block to further reduce the number of parameters. This can lower the amount of parameters, because the reduced number of channels within the bottleneck residual learning block decreases the depth of the kernels. See Table 1 for the size of our kernels. Likewise, the input tensor is firstly converted to a feature map, and then downsampled to . Eventually, the code vector shrinks down to . The decoding process recovers it back to a signal of size , reversely.

Figure 2: The interlacing-based upsampling process.

2.1.2 Softmax quantization:

The coded output from each encoder is still a real-valued vector of size . Softmax quantization [25] performs scalar quantization by assigning each real value to the nearest representative (Figure  1 (c)). In the proposed system, softmax quantization maps the input scalar to one of the 32 clusters, or quantization levels, which requires bits per dimension. Huffman coding further reduces the bitrate [28].

Figure 3: Cross-module residual learning pipeline

2.2 The module carrier: CMRL

Figure 3 shows the proposed cascaded cross-module residual learning (CMRL) process. In CMRL, each module does its best to reconstruct its input. The procedure in the -th module is denoted as , which estimates the input as . The input for the -th module is defined as


where the first module takes the input speech signal, i.e., . The meaning is that each module learns to reconstruct the residual which is not recovered by its preceding modules. Note that module homogeneity is not required for CMRL: for example, the first module can be very shallow to just estimate the envelope of MDCT spectral structure while the following modules may need more parameters to estimate the residuals.

Each AE decomposes into the encoder and decoder parts:


where denotes the part of code generated by the -th encoder, and .

The encoding process: For a given input signal , the encoding process runs all AE modules in a sequential order. Then, the bistring is generated by taking the encoder outputs and concatenating them: .

The decoding process: Once the bitstring is available on the receiver side, all the decoder parts of the modules, , run to produce the reconstructions which are added up to approximate the initial input signal with the global error defined as


2.2.1 The two-round training scheme

Intra-module greedy training: We provide a two-round training scheme to make CMRL optimization tractable. The first round adopts a greedy training scheme, where each AE tries its best to minimize the error: . The greedy training scheme echoes a divide-and-conquer manner, leading to an easier optimization for each module. The thick gray arrows in Figure 3

show the flow of the backpropagation error to minimize the individual module error with respect to the module-specific parameter set


Cross-module finetuning: The greedy training scheme accumulates module-specific error, which the earlier modules do not have a chance to reduce, thus leading to a suboptimal result. Hence, the second-round cross-module finetuning follows to further improve the performance by reducing the total error:


During the finetuing step, we first (a) initialize the parameters of each module with those estimated from the greedy training step (b) perform cascaded feedforward on all the modules sequentially to calculate the total estimation error in (3) (c) backpropagate the error to update parameters in all modules altogether (thin black arrows in Figure 3). Aside from the total reconstruction error (3), we inherit Kankanahalli-Net’s other regularization terms, i.e., perceptual loss, quantization penalty, and entropy regularizer.

2.3 Bitrate and entropy coding

The bitrate is calculated from the concatenated bitstrings from all modules in CMRL. Each encoder module produces quantized symbols from the softmax quantization process (Figure 1 (e)), where the stride size divides the input dimensionality. Let be the average bit length per symbol after Huffman coding in the -th module. Then, stands for the bits per frame. By dividing the frame rate, , where and denote the overlap size in samples and the sampling rate, respectively, the bitrates per module add up to the total bitrate: , where the overhead to transmit LPC coefficients is =2.4kbps, which is for the case with raw PCM signals as the input.

By having the entropy control scheme proposed in Kankanahalli-Net as the baseline to keep a specific bitrate, we further enhance the coding efficiency by employing the Huffman coding scheme on the vectors. Aside from encoding each symbol (i.e., the softmax result) separately, encoding short sequences can further leverage the temporal correlation in the series of quantized symbols, especially when the entropy is already low [29] [30]. We found that encoding a short symbol sequence of adjacent symbols, i.e., two symbols, can lower down the average bit length further in the low bitrates.

Layer Input shape Kernel shape Output shape
Change channel (512, 1) (9, 1, 100) (512, 100)
1st bottleneck (512, 100)
(9, 100, 20) ]35mm[2]
(9, 20, 20)
(9, 20, 100)
(512, 100)
Downsampling (512, 100) (9, 100, 100) (256, 100)
2nd bottleneck (256, 100)
(9, 100, 20) ]35mm[2]
(9, 20, 20)
(9, 20, 100)
(256, 100)
Change channel (256, 100) (9, 100, 1) (256, 1)
Change channel (256, 1) (9, 1, 100) (256, 100)
1st bottleneck (256, 100)
(9, 100, 20) ]35mm[2]
(9, 20, 20)
(9, 20, 100)
(256, 100)
Upsampling (256, 100) (9, 100, 100) (512, 50)
2nd bottleneck (512, 50)
(9, 50, 20) ]35mm[2]
(9, 20, 20)
(9, 20, 50)
(512, 50)
Change channel (512, 50) (9, 50, 1) (512, 1)
Table 1: Architecture of the component module as in Figure  1. Input and output tensors sizes are represented by (width, channel), while the kernel shape is (width, in channel, out channel).
(a) 8.85kbps
(b) 15.85kbps
(c) 19.85kbps
(d) 23.85kbps
(e) 23.85kbps
Figure 4: MUSHRA test results. From (a) to (d): the performance of CMRL on raw and LPC residual input signals compared against AMR-WB at different bitrates. (e) An additional test shows that the performance of CMRL with the LPC input competes with OPUS, which is known to outperform AMR-WB in 23.85kbps.

3 Experiments

We first show that for the raw PCM input CMRL outperforms AMR-WB and Kankanahalli-Net in terms of objective metrics in the experimental setup proposed in [14], where the use of LPC was not tested. Therefore, for the subjective quality, we perform MUSHRA tests [31] to show that CMRL with an LPC residual input works better than AMR-WB and OPUS at high bitrates.

3.1 Experimental setup

300 and 50 speakers are randomly selected from TIMIT [32]

training and test datasets, respectively. We consider two types of inputs in time-domain: raw PCM and LPC residuals. For the raw PCM input, the data is normalized to have a unit variance, and then directly fed to the model. For the LPC residual input, we conduct a spectral envelope estimation on the raw signals to get LPC residuals and corresponding coefficients. The LPC residuals are modeled by the proposed end-to-end CMRL pipeline, while the LPC coefficients are quantized and sent directly to the receiver side at 2.4 kbps. The decoding process recovers the speech signal based on the LPC synthesis procedure using the LPC coefficients and the decoded residual signals.

We consider four bitrate cases: 8.85 kbps, 15.85 kbps, 19.85 kbps and 23.85 kbps. All convolutional layers in CMRL use 1-D kernel with the size of 9 and the Leaky Relu activation. CMRL hosts two modules: each module is with the topology as in Table

1. Each residual learning block contains two bottleneck structures with the dilation rate of 1 and 2. Note that for the lowest bitrate case, the second encoder downsamples each window to 128 symbols. The learning rate is 0.0001 to train the first module, and 0.00002 for the second module. Finetuning uses 0.00002 as the learning rate, too. Each window contains 512 samples with the overlap size of 32. We use Adam optimizer [33]

with the batch size of 128 frames. Each module is trained for 30 epochs followed by finetuning until the entropy is within the target range.

Figure 5: (a) SNR and PESQ per epoch (b) model complexity

3.2 Objective test

We evaluate 500 decoded utterances in terms of SNR and PESQ with wide band extension (P862.2) [34]. Figure 5

(a) shows the effectiveness of CMRL against a system with a single module in terms of SNR and PESQ values per epoch. The single module is with three more bottleneck blocks and twice more codes for a fair comparison. It is trained for 90 epochs with other hyperparameters are unaltered. For both SNR and PESQ, the plot shows a noticeable performance jump as the second module is included, followed by another jump by finetuning.

Table 2 compares CMRL with AMR-WB and Kankanahalli-Net at four bitrates for the raw PCM input case. CMRL achieves both higher SNR and PESQ at all four bitrate cases. Note that the SNR for CMRL at 8.85 kbps is greater than AMR-WB at 23.85 kbps. CMRL also gives a better PESQ score at 15.85 kbps than AMR-WB at 23.85 kbps.

Metrics SNR (dB) PESQ
Bitrate (kbps) 8.85 15.85 19.85 23.85 8.85 15.85 19.85 23.85
AMR-WB 9.82 11.93 12.46 12.73 3.41 3.99 4.09 4.13
K-Net - - - - 3.63 4.13 4.22 4.30
CMRL 13.45 16.35 17.18 17.33 3.69 4.21 4.34 4.42
Table 2: SNR and PESQ scores on raw PCM test signals.

3.3 Subject test

Figure 4 shows MUSHRA test results done by six audio experts on 10 decoded test samples randomly selected with gender equity. At 19.85 kbps and 23.85 kbps, CMRL with LPC residual inputs outperforms AMR-WB. At lower bitrates though, AMR-WB starts to work better. CMRL on raw PCM is found less favored by listeners. We also compare CMRL with OPUS in the high bitrate where OPUS is known to perform well, and find that CMRL slightly outperforms OPUS111More decoded utterance samples are available at

3.4 Model complexity

The cross-module residual learning simplifies the topology of each component module. Hence, CMRL has less than of the model parameters compared to the WaveNet based codec [16], and outperforms Kankanahalli-Net with less model parameters. Figure 5 (b) summarizes the comparison.

4 Conclusion

In this work, we demonstrated that CMRL as a lightweight model carrier for DNN based speech codecs can compete with the industrial standards. By cascading two end-to-end modules, CMRL achieved a higher PESQ score at 15.85 kbps than AMR-WB at 23.85 kbps. We also showed that CMRL can consistently outperform a state-of-the-art DNN codec in terms of PESQ. CMRL is compatible with LPC, by having it as the first pre-processing module and by using its residual signals as the input. CMRL, coupled with LPC, outperformed AMR-WB in 19.85 kbps and 23.85 kbps, and worked better than OPUS at 23.85 kbps in the MUSHRA test. More work is required to examine other module structures to further improve the performance at low bitrates.

5 Acknowledgements

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (2017-0-00072, Development of Audio/Video Coding and Light Field Media Fundamental Technologies for Ultra Realistic Tera-media). The authors also appreciate Srihari Kankanahalli for valuable discussion and sharing audio samples.


  • [1] P. Noll, “MPEG digital audio coding,” IEEE signal processing magazine, vol. 14, no. 5, pp. 59–81, 1997.
  • [2] K. Brandenburg and G. Stoll, “ISO/MPEG-1 audio: A generic standard for coding of high-quality digital audio,” Journal of the Audio Engineering Society, vol. 42, no. 10, pp. 780–792, 1994.
  • [3] K. R. Rao and J. J. Hwang, Techniques and standards for image, video, and audio coding.   Prentice Hall New Jersey, 1996, vol. 70.
  • [4] D. O’Shaughnessy, “Linear predictive coding,” IEEE potentials, vol. 7, no. 1, pp. 29–32, 1988.
  • [5] B. S. Atal and M. R. Schroeder, “Adaptive predictive coding of speech signals,” Bell System Technical Journal, vol. 49, no. 8, pp. 1973–1986, 1970.
  • [6] M. Schroeder and B. Atal, “Code-excited linear prediction (CELP): High-quality speech at very low bit rates,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’85., vol. 10.   IEEE, 1985, pp. 937–940.
  • [7] B. Bessette, R. Salami, R. Lefebvre, M. Jelinek, J. Rotola-Pukkila, J. Vainio, H. Mikkola, and K. Jarvinen, “The adaptive multirate wideband speech codec (amr-wb),” IEEE transactions on speech and audio processing, vol. 10, no. 8, pp. 620–636, 2002.
  • [8] J.-M. Valin, G. Maxwell, T. B. Terriberry, and K. Vos, “High-quality, low-delay music coding in the opus codec,” arXiv preprint arXiv:1602.04845, 2016.
  • [9] J. Makhoul, S. Roucos, and H. Gish, “Vector quantization in speech coding,” Proceedings of the IEEE, vol. 73, no. 11, pp. 1551–1588, 1985.
  • [10] M. Kim and P. Smaragdis, “Bitwise neural networks,” in

    International Conference on Machine Learning (ICML) Workshop on Resource-Efficient Machine Learning

    , Jul 2015.
  • [11] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton, “Binary coding of speech spectrograms using a deep auto-encoder,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
  • [12] M. Cernak, A. Lazaridis, A. Asaei, and P. Garner, “Composition of deep and spiking neural networks for very low bit rate speech coding,” arXiv preprint arXiv:1604.04383, 2016.
  • [13] W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, and T. C. Walters, “Wavenet based low rate speech coding,” arXiv preprint arXiv:1712.01120, 2017.
  • [14] S. Kankanahalli, “End-to-end optimized speech coding with deep neural networks,” arXiv preprint arXiv:1710.09064, 2017.
  • [15] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “Wavenet vocoder with limited training data for voice conversion,” in Proc. Interspeech, 2018, pp. 1983–1987.
  • [16] Y. L. Cristina Garbacea, Aaron van den Oord, “Low bit-rate speech coding with vq-vae and a wavenet decoder,” in Proc. ICASSP, 2019.
  • [17] I. T. U. T. S. Sector, Coding of Speech at 8 Kbit/s Using Conjugate-structure Algebraic-code-excited Linear Prediction (CS-ACELP): Series G: Transmission Systems and Media, Digital Systems and Networks: Digital Terminal Equipments–Coding of Voice and Audio Signals.   ITU-T, 2013.
  • [18] G. Recommendation, “722.2:“wideband coding of speech at around 16 kbit/s using adaptive multi-rate wideband (amr-wb)”,” 2003.
  • [19] M. Neuendorf, M. Multrus, N. Rettelbach, G. Fuchs, J. Robilliard, J. Lecomte, S. Wilde, S. Bayer, S. Disch, C. Helmrich et al., “MPEG unified speech and audio coding-the iso/mpeg standard for high-efficiency audio coding of all content types,” in Audio Engineering Society Convention 132.   Audio Engineering Society, 2012.
  • [20] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” arXiv preprint arXiv:1612.07837, 2016.
  • [21] Y. L. Cristina Garbacea, Aaron van den Oord, “High-quality speech coding with samplernn,” in Proc. ICASSP, 2019.
  • [22] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2018, pp. 4385–4393.
  • [23] G. Schuller, B. Yu, and D. Huang, “Lossless coding of audio signals using cascaded prediction,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 5.   IEEE, 2001, pp. 3273–3276.
  • [24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [25] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in Advances in Neural Information Processing Systems, 2017, pp. 1141–1151.
  • [26]

    W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1874–1883.
  • [27] K. Tan, J. Chen, and D. Wang, “Gated residual networks with dilated convolutions for supervised speech separation,” in Proc. ICASSP, 2018.
  • [28] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 1952.
  • [29] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Communications of the ACM, vol. 30, no. 6, pp. 520–541, 1987.
  • [30] L. R. Welch and E. R. Berlekamp, “Error correction for algebraic block codes,” Dec. 30 1986, uS Patent 4,633,470.
  • [31] R. B. ITU-R, “1534-1,“method for the subjective assessment of intermediate quality levels of coding systems (mushra)”,” International Telecommunication Union, 2003.
  • [32] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue, “TIMIT acoustic-phonetic continuous speech corpus,” Linguistic Data Consortium, Philadelphia, 1993.
  • [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [34] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, vol. 2.   IEEE, 2001, pp. 749–752.