To build a comfortable voice communication system in noisy environment, it is necessary to include a speech enhancement or noise reduction techniques [1, 2, 3, 4]. However, the core module of coding system, i.e., vocoding techniques, and the speech enhancement techniques have been developed independently to each other. Thus, the entire speech communication system is implemented by simply concatenating the two systems, that is, the enhancement module is processed first, and the vocoder is processed afterwards In this method, the characteristics of the coding process are not adequately considered during the enhancement process, and it is difficult to include conventional enhancement systems in the speech coding system with their maximum performance, especially due to the additional but necessary processes required in the speech enhancement system.
Although the cost for modernized communication devices has dropped significantly thanks to advances in semiconductor technologies, it is still an important issue in many underdeveloped countries, where low-cost processors operating in a limited communication bandwidth environment are common. Thus, the need for an effective low bit-rate speech coding technique is still very high. Based on the assumption that we can modify the core algorithm of the conventional low bit-rate speech coding standard, we chose the 2.4 kbit/s mixed excitation linear prediction (MELP) speech coder as our target system [5, 6]. Several statistical-based speech enhancement techniques were combined with the MELP codec to improve the robustness of the codec in background noise conditions [2, 3], but the improvement was not dramatic especially in non-stationary noise environment.
To improve the quality of speech enhancement systems, recently developed deep learning (DL) techniques have been actively utilized [7, 8, 9]. Typical examples are time-frequency (T-F) masking-based algorithms [7, 8]. These algorithms first determine target T-F masking values with various signal-to-noise ratio (SNR) criteria, then they train a DL network using a noisy spectrum input and the output of target masking values in each T-F bin.
However, these approaches operate at the frequency domain and require the use of Fourier transform and overlap-add (OLA) processes; thus, additional delay and computational complexity is unavoidable, which makes them unsuitable for a low bit-rate coding system. Moreover, the approach is not a good choice in terms of memory usage, because the large dimensions of the input/output of the DL network corresponding to the frequency domain resolution require the use of a large DL network. Although several algorithms directly operate in the time-domain speech signal have been proposed[10, 11, 12], they are unsuitable for voice communication systems, because they are typically designed to have a non-causal structure with rich resources.
In this paper, we propose a DL-based vocoder parameter enhancement system that can be directly applied to the 2.4 kbit/s MELP-based speech communication system. Given the coding parameters encoded by the MELP encoder in noisy environment, e.g., line spectral frequencies (LSFs), gain, pitch, or other excitation related parameters, we use them as an input to the DL network, then directly estimate the parameters as they are obtained from the clean speech input. As the proposed enhancement works in vocoder parameterization/waveform reconstruction processes and does not require any T-F analysis/synthesis process, its complexity is very low, and there is no additional delay. In addition, as the dimension of input and output features is small, it is possible to build a very small-size DL network; we only needed 182.11 KB memory, even with a 32-bit single-precision floating point format. The objective and subjective test results confirm that the proposed vocoder parameter enhancement algorithm provides much simpler and faster enhancement architectures than the conventional speech enhancement systems while retaining a high quality.
2 MELP coder with speech enhancement
2.1 MELP vocoder
The main characteristic of the MELP codec  is to model an excitation signal by mixing voiced pulse and noise components in the frequency domain, where bandpass voicing flags are used to represent the voicing information of frequency subbands. In the system, total six parameters that consist of excitation and spectral parameters are transmitted to the decoder. Spectral parameters are represented by 10-dimensional line spectral frequencies (LSFs) and two dimensional gain terms, whereas excitation parameters consist of 5-dimensional bandpass voicing flags, pitch value, aperiodicity flag, and 10-dimensional Fourier magnitudes at a frame rate of 22.5-ms.
To obtain the voiced excitation component, the first 10 harmonics of the impulse train with an interval of pitch period is shaped with the decoded Fourier magnitudes. If the aperiodicity flag is on, then the jitter effect is imposed to the pulse excitation by randomly shifting the position of impulse component within a 25% range of the pitch value. Then, the mixed excitation signal is obtained by summing the pulse and noise excitation components after applying the bandpass filters processed by following the bandpass voicing flags. Finally, the speech signal is reconstructed by applying a linear prediction synthesis process. To make the reconstructed speech signal sound natural, the adaptive spectral enhancement filter and pulse dispersion filter are applied.
2.2 Speech enhancement as pre-processing algorithm
To improve the quality of MELP coding system, the speech enhancement algorithms have been introduced as a pre-processing module of the MELP encoder as illustrated in the Fig. 1-(a). For instance, Martin et al.  introduced a minimum mean square error log spectral amplitude estimator (MMSE-LSA) speech enhancement system  as a pre-processing module. However, its performance was not satisfactory, especially in a non-stationary noise environment.
By applying DL-based enhancement techniques, the quality of coding system can be further improved. In general, DL techniques are utilized to model the features obtained by the power spectrum of clean and noisy speech. For instance, a widely used ideal ratio mask (IRM)-based DL network generates IRM components, which are the T-F masking components estimated by the function of SNR between clean and noise power spectrums. Then, the enhanced magnitude spectrum is obtained by multiplying the IRM components to the input noisy spectrum. Finally, a speech signal is reconstructed by applying an inverse Fourier transform and an OLA process.
In spite of the high performance of conventional DL-based speech enhancement algorithm, it has several limitations to be used as a pre-processing module for a low bit-rate speech codec because of complexity, memory, and delay issues. For instance, the forward and inverse Fourier transforms, and OLA process required in the enhancement process present unavoidable delay and computational bottlenecks. Moreover, the conventional enhancement algorithm requires a large DL network, because the dimension of input and output vectors is equal to the resolution of the Fourier analysis, which must be large to achieve high performance. Thus, it is difficult to train a network with a small number of parameters.
In the next section, we propose a parameter enhancement method for the 2.4 kbit/s MELP communication system, which is implementable by a very simple system architecture with low complexity, low additional memory usage, and no delay.
3 Vocoder parameter enhancement method
In the proposed system, the noise-corrupted speech signal is first parameterized to the MELP parameters without any pre-processing. Then, noisy MELP parameters are directly enhanced to be similar to the ones obtained from a clean speech signal via a DL network. To train the network, first, both noisy and clean MELP parameters are first extracted from the noisy and clean speech pair through MELP vocoder analysis as described in Section 2.1. Then, the DL network is trained to estimate clean MELP parameters from noisy MELP parameters by minimizing the mean squared error (MSE) criterion.
To model the MELP parameters more accurately, some MELP parameters, such as gain, pitch, and Fourier magnitudes are refined before training the network. First, the mean and difference components of two dimensional gain features are used instead of the original gain features. Gain features imply the energy of two adjacent subframes, so independently generating gain features results in a discontinuous energy contour with a shimmer-like sound artifact. By coupling the two gain terms with a mean-difference pair, this problem can be easily solved. Moreover, because pitch values at the unvoiced region are set to zero at first time, discontinuity at the voice/unvoice boundary decreases the regression accuracy of pitch contour. We linearly interpolate the value of unvoiced region to prevent discontinuity at the voiced/unvoiced boundary. Finally, we use logarithmic Fourier magnitudes to estimate their trajectory accurately.
Since the MELP parameters are available in both the encoder and decoder step, we classify the proposed parameter enhancement method depending on the enhancement location to make the system have a higher flexibility as illustrated in Fig.1-(b) and (c), i.e., the encoder side and decoder side approaches, respectively. In the encoder side approach, noisy MELP parameters are first extracted from the noisy speech signal without any pre-enhancement processing, and the parameters are enhanced by the pre-trained DL network. Then, the enhanced MELP parameters are transmitted to the decoder after performing quantization and bitstream formatting processes. In the decoder side approach, on the other hand, the reconstructed MELP parameters at the decoder step are enhanced via pre-trained DL network. Then, the MELP synthesis module reconstructs the speech signal through the use of enhanced MELP parameters.
Note that the parameter enhancement approaches do not require any framing process, Fourier transform, or the heuristic selection of hyper-parameters such as the type of analysis window, window length, frame rate, or resolution of Fourier transform. Thus, we are able to build a simple and compact speech enhancement system that is fully specialized to the codec specifications. Its computational complexity is very low and there’s no additional delay in the enhancement process. In addition, the total number of the DL network’s input and output vectors is only 29-dimension; thus, their behavior is very simple in comparison to the features used in conventional speech enhancement, which enables us to train them successfully with a small size DL network.
4.1 Database generation
were used as speech and noise databases, respectively. To match the sampling rate with the 2.4 kbit/s MELP codec, all samples were down-sampled to 8-kHz. In the TIMIT database, sentences “SA1” and “SA2” commonly recorded by all speakers were excluded from the experiments; thus, a total of 3,696 utterances were used for training set, 1,152 utterances were used for validation set, and 48 utterances were used for test set. In the noise database, four types of noise; babble, factory1, volvo, and white noise, were used in a seen condition test, and other two types of noise; destroyer engine and pink noise, were used in an unseen condition test. To construct a noisy speech database for the training and validation processes, the speech signal was replicated and mixed with the noise signal under six SNR conditions from -5 dB to 20 dB with a step size of 5 dB. As a result, total 88,704 utterances (about 75 hours) and 22,648 utterances (about 24 hours) were used as a training and validation set, respectively. For testing purposes, the speech signal was mixed with both seen and unseen noises under four SNR conditions from 0 dB to 15 dB with a step size of 5 dB so that the 768 utterances for seen condition and 384 utterances for unseen condition were tested.
4.2 Network architectures
To show the performance of proposed systems, we included the MELP codec system with IRM-based speech enhancement algorithm 
as a baseline system. In each system, we tested various gated recurrent unit (GRU)-based DL networks by varying the size; A large network and a small network were used to simulate the enhancement scenario with rich and limited resources, respectively. The detailed settings of each system are described below.
|Noise type||SNR||No enhance.||Large network||Small network|
4.2.1 Parameter enhancement system
The MELP parameters refined by the method described in Section 3
from the noisy-clean speech pair were composed of 29-dimensional input and output vectors of parameter enhancement. Before training, both the input and output features were normalized to have zero-mean and unit-variance.
In the large network, hidden layers consisted of two GRU layers with 512 units at the input side and two feed-forward (FF) layers with 1,024 hidden nodes at the output side. In the small network, the hidden layers consisted of one GRU layer with 64 units at the input side and two FF layers with 128 hidden nodes at the output side. In total, the large network had around 4.01 million parameters which correspond to the 15.30 MByte and the small network had 46.62 thousand parameters which correspond to the 0.18 MByte, both with a 32-bit single-precision floating point. The rectified linear unit (ReLu) and linear activation functions were used in hidden and output layers, respectively. The weights were initialized usingXavier initializer , and trained using back-propagation through time procedure with Adam optimizer [18, 19] to optimize MSE criterion.
In the enhancement step, the inverse processes of feature refinements were performed. First, two gain features were reconstructed by summing and subtracting the mean and difference gain terms, respectively. Then, the pitch of unvoiced region was set to zero, and the exponential operator was performed to the Fourier magnitudes. Finally, the speech waveform was reconstructed using the MELP synthesizer.
4.2.2 IRM-based speech enhancement system
To assess system performance in fair condition, the size of the IRM-based DL network was set to similar to that of the parameter enhancement system. To compose input and output vectors, 129-dimensional log-power spectrum and IRM components were extracted from a 32-ms speech frame at a 22.5-ms frame rate with 9.5-ms overlap. In the large network, hidden layers consisted of two GRU layers with 512 units at the input side and two FF layers with 1,024 hidden nodes at the output side. In the small network, hidden layers consisted of one GRU layer with 64 units at the input side and two FF layers with 64 hidden nodes at the output side. In total, the networks had around 16.28 MByte and 0.21 MByte with a 32-bit single-precision floating point, respectively. The ReLu and sigmoid activation functions were used in the hidden and output layers, respectively. The weight initialization and training methods were the same as in the parameter enhancement system.
4.3 Objective and subjective evaluation results
In the objective test, distortions in MELP parameters obtained by clean and enhanced speech signals were evaluated. The metrics for measuring distortion were the error rate of the voicing flag at the first frequency band (VUV error; %), the root mean square error (RMSE) for gain features (Gain-RMSE), the RMSE for F0 (F0-RMSE; Hz), and the log-spectral distance of LSFs in dB (LSD; dB). In addition, the short-time objective intelligibility (STOI)  was measured to evaluate the intelligibility of reconstructed speech. In the evaluation of F0-RMSE, only voiced regions were evaluated. Moreover, we also included the codec outputs of clean and noisy speech signals to simulate the performance of communication systems with no background noise and no enhancement module, respectively. Note that these systems represent the upper and lower bounds of performances, respectively.
The objective results are summarized in Table 1 and 2. Experimental results were as follows. (1) All enhancement systems showed significant performance improvement in all metrics (noisy vs. all enhancement systems). (2) For the modeling of speech’s statistical characteristics such as prosody or voice color, the proposed parameter enhancement showed much higher accuracy than the conventional IRM-based speech enhancement (IRM vs. Param in VUV error, Gain-RMSE, F0-RMSE, and LSD). This is because the parameter enhancement is designed to estimate the statistical characteristics of clean speech directly. (3) The intelligibility of proposed parameter enhancement was slightly worse than the IRM-based enhancement (STOI). Note that the parameter enhancement operates on the vocoder parameter domain, so the cross-correlation between clean and enhanced speech (that is, STOI) is more easily weakened than in IRM-based enhancement, which operates on the frequency domain directly. (4) The decoder side parameter enhancement performed slightly worse than the encoder side parameter enhancement (Param-Enc vs. Param-Dec). This implies that the degradation by quantization effect before the DL network is larger than that after the DL network. However, their difference was negligible, and the trend with respect to the IRM-based enhancement was the same. (5) Although the performance of smaller network was worse than that of the larger one (large network vs. small network), the performance degradation in the proposed parametric approach was much smaller than the IRM-based approach.
To evaluate the perceptual quality of the proposed system, the A-B preference test was performed. In detail, the randomly selected 20 reconstructed utterances from the test set were mixed with the 7 dB SNR of babble and volvo noise, then enhanced by the small network setup for simulating a real communication environment. In the evaluation, total 12 listeners were asked to rate the quality preference and instructed to pay attention to both the signal distortion and the noise intrusiveness. The preference test results summarized in Table 3 first verify the significant quality improvement in the MELP coding system using parameter enhancement in addition to the IRM enhancement (Test 1 and 2). Moreover, the results shown that the perceptual qualities of all enhancement systems are indistinguishable (Test 3 and 4).
4.4 Computational efficiency of enhancement systems
To evaluate the computational efficiency of the systems, we computed the floating point operation per second (FLOPs) of each system. For the FLOP of DL network, the operation for biases and activations were neglected, and for the FLOP of the Fourier transform, we assumed the real split-radix fast Fourier transform algorithm , whose FLOP is 3,078 for a 256-points Fourier transform. As a result, we obtained a total computational efficiency around 5.06 MFLOPs and 4.11 MFLOPs in the small IRM enhancement and parameter enhancement systems, respectively. As to the equivalent performance of the IRM and parameter enhancement systems, the proposed parameter enhancement system showed faster computational efficiency than the conventional IRM enhancement system, about 1 MFLOPs.
In this paper, we introduced a DL-based parameter enhancement method for a MELP speech codec in noisy communication environments. By directly enhancing the MELP parameters, the proposed algorithm was successfully combined with the MELP-based speech communication system. Experimental results showed that the proposed method had a higher statistical modeling accuracy in terms of prosody and voice characteristics with faster enhancement speed than the conventional speech enhancement system, while the perceptual quality was similar. In summary, the proposed system successfully constructed a simple and compact speech enhancement system for a low profile speech codec in noisy environments by removing additional processing pipelines.
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (2019-11-0124).
-  W. B. Kleijn and K. K. Paliwal, Eds., Speech Coding and Synthesis. New York, NY, USA: Elsevier Science Inc., 1995.
-  J. S. Collura, D. F. Brandt, and D. J. Rahikka, “The 1.2 kbps/2.4 kbps melp speech coding suite with integrated noise pre-processing,” in MILCOM 1999. IEEE Military Communications. Conference Proceedings (Cat. No.99CH36341), vol. 2, Oct 1999, pp. 1449–1453 vol.2.
-  T. Agarwal and P. Kabal, “Preprocessing of noisy speech for voice coders,” in Speech Coding, 2002, IEEE Workshop Proceedings., Oct 2002, pp. 169–171.
-  R. Martin and R. V. Cox, “New speech enhancement techniques for low bit rate speech coding,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1999, pp. 614–617.
-  A. McCree and T. P. Barnwell, “A mixed excitation lpc vocoder model for low bit rate speech coding,” IEEE Trans. Speech and Audio Processing, vol. 3, pp. 242–250, 1995.
-  L. M. Supplee, R. P. Cohn, J. S. Collura, and A. V. McCree, “Melp: the new federal standard at 2400 bps,” in 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, April 1997, pp. 1591–1594 vol.2.
A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 7092–7096.
H. Erdogan, J. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in2015 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Proceedings, vol. 2015-August, 8 2015, pp. 708–712.
X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.” inINTERSPEECH. ISCA, 2013, pp. 436–440.
-  D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” in Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018, 2018, pp. 334–340. [Online]. Available: http://ismir2018.ircam.fr/doc/pdfs/205_Paper.pdf
-  S. Pascual, A. Bonafonte, and J. Serrà, “Segan: Speech enhancement generative adversarial network,” in INTERSPEECH, 2017.
-  D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 5069–5073.
-  Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, April 1985.
-  J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “Darpa timit acoustic phonetic continuous speech corpus cdrom,” 1993.
-  A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Communication, vol. 12, no. 3, pp. 247 – 251, 1993.
K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations
using rnn encoder–decoder for statistical machine translation,” in
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1724–1734. [Online]. Available: http://www.aclweb.org/anthology/D14-1179
-  X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. AISTATS, 2010, pp. 249–256.
-  R. J. Williams and J. Peng, “An efficient gradient-based algorithm for on-line training of recurrent network trajectories,” Neural computat., vol. 2, no. 4, pp. 490–501, 1990.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
-  C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, March 2010, pp. 4214–4217.
-  H. Sorensen, D. Jones, M. Heideman, and C. Burrus, “Real-valued fast fourier transform algorithms,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 6, pp. 849–863, June 1987.