Weighted Speech Distortion Losses for Neural-network-based Real-time Speech Enhancement

01/28/2020 ∙ by Yangyang Xia, et al. ∙ Microsoft Carnegie Mellon University 0

This paper investigates several aspects of training a RNN (recurrent neural network) that impact the objective and subjective quality of enhanced speech for real-time single-channel speech enhancement. Specifically, we focus on a RNN that enhances short-time speech spectra on a single-frame-in, single-frame-out basis, a framework adopted by most classical signal processing methods. We propose two novel mean-squared-error-based learning objectives that enable separate control over the importance of speech distortion versus noise reduction. The proposed loss functions are evaluated by widely accepted objective quality and intelligibility measures and compared to other competitive online methods. In addition, we study the impact of feature normalization and varying batch sequence lengths on the objective quality of enhanced speech. Finally, we show subjective ratings for the proposed approach and a state-of-the-art real-time RNN-based method.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech enhancement (SE) algorithms aim at improving speech quality and intelligibility of speech signals degraded by additive noise [16]

, in order to improve human or machine interpretation of speech. Typical SE applications are hearing aids, automatic speech recognition, and audio/video communications in noisy environments. Most SE methods apply a spectral suppression gain or filter to the noisy speech signal in a time-frequency domain


. In recent supervised learning methods using

deep neural networks (DNNs

), a DNN is typically set up to estimate this time-varying gain function

[26] from one or more sets of features derived from noisy speech.

Online processing capability is an attractive feature of a SE algorithm and required for real-time communication applications. Although most classical SE methods have to accommodate their approaches [6, 5, 7, 2] for fulfilling causality, many DNN-based methods in the literature [26, 9, 18] do not enforce this constraint. Several DNN-based approaches report high-quality enhancement using generous look-ahead [9, 18], but their performance for decreasing look-ahead is not well investigated. Nevertheless, DNN-based systems are preferred over classical methods for their ability to accurately suppress transient noise. In this work, we studied real-time speech enhancement with recurrent neural network (RNN). Recent works involving RNNs demonstrated promising results [25], even at very low signal-to-noise ratio (SNR) scenarios [23, 27].

A key challenge in designing a SE algorithm for audio/video communication is to preserve perceived (subjective) speech quality to the best extent possible while suppressing the noise. In classical literature, optimizing such a compound global objective can be done by solving a constrained objective function [3]. Alternatively, one can optimize a simpler objective such as the (log) mean-squared error (MSE) [7, 8] and employ post-processing modules such as residual noise removal [2] and gain limiting [10]

. By contrast, one major benefit of the deep learning framework is the relative ease to incorporate complex learning objectives one believes would drive the enhanced speech towards better quality and intelligibility. Methods along this line of thought include learning multiple objectives from heterogeneous features

[21, 28, 11], jointly optimizing the final goal and its sub-targets (e.g.

speech-presence probability)

[25, 27], and directly optimizing towards an objective measure of speech quality or intelligibility [17, 30]. The latter seems a promising way to improve objective quality, although both models have to incorporate the standard MSE due to the band limitation of each objective measure. [13] reported that a simple perceptually weighted wide-band MSE alone does not improve objective speech quality or intelligibility, suggesting that the MSE is still a reliable learning objective for wide-band speech enhancement.

In this paper, we propose a DNN-based online speech enhancement system for real-time applications. First, we will discuss features and normalization techniques that would facilitate pattern learning with a RNN. We then describe a compact RNN that produces a gain function from a single noisy frame. Next, we introduce two simple MSE-based loss functions with separate control of speech distortion and noise reduction. During the evaluation, we thoroughly examine the effect of error weighting on the subjective and objective speech quality and intelligibility measures. Furthermore, we discuss how the objective metrics are affected by different feature normalization techniques and training strategies.

2 Problem formulation

We assume the microphone signals to be described in the

short-time Fourier transform

(STFT) domain by


where , , and denote the STFT at time frame and frequency bin of the observed noisy speech, clean speech, and noise, respectively. Our system seeks for a time-varying gain function in the short-time Fourier transform magnitude (STFTM) domain, , that recovers to the best extent possible.


In real-time processing, shall depend only on the past and present information of input, and is given by


where is a transform function applied on the STFTM of the noisy signal, is a normalization function, and is a DNN whose adaptable parameters are together denoted by . Finally, noisy phase of is applied to to obtain the enhanced signal.

In the following sections, we will review the state-of-the-art methods, before discussing our choices for and , the architecture of , two learning objectives for , and further considerations in training that we believe will impact the quality of enhanced speech.

3 State-of-the-art online noise reduction

Classical online SE methods typically seek for the optimal gain function by optimizing some objective functions in a statistical sense. One of the most effective methods in this category assumes the clean and noise STFT

are uncorrelated, complex Gaussian distributions, and solves for

by minimizing the MSE between clean and enhanced STFTM [7] or log-STFTM [8]. Although more advanced noise and speech-presence probability models could be incorporated to improve speech quality and prevent musical noise [6, 5], retaining speech quality while removing highly non-stationary noise is still a challenging task.

In recent DNN-based methods, statistical assumptions about the distribution of noisy and clean STFTM are typically dropped, while the minimum MSE

(MMSE) objective becomes a loss function for which a DNN optimizes by stochastic gradient descent. One of the most popular loss functions has been the

MSE between clean and enhanced STFTM


where denotes

in vector form, and

is the element-wise product. A competitive method proposed recently [25] estimates the optimal gain function of a smoothed energy contour using a RNN

and interpolates the spectral details by pitch filtering. Experiments

[25, 19] report strong objective and subjective speech quality from enhanced speech produced by this RNNoise system.

4 Proposed Methods

4.1 Feature representation

Selecting appropriate features and normalization is important for successfully training a DNN. We consider two basic features in STFTM and log-power spectra (LPS), and apply global, frequency-dependent (FD), and frequency-independent (FI) normalization, respectively, to train our network.

The STFTM used in all our systems is computed based on a 32 ms Hamming window with 75% overlap between frames and a 512-point discrete Fourier transform. The LPS is taken with the natural logarithm and floored at -120 dB, i. e.,


We explore three types of normalization to be each individually combined with either STFTM or LPS

mentioned above. First, we consider global normalization, in which case each frequency bin is standardized by its mean and standard deviation accumulated from a training set:


Second, we consider online FD

mean and variance normalization, in which case the running mean and variance are smoothed by a decaying exponential:


where , is the frame shift in seconds (8 milliseconds in our setting), and is a time constant that controls the adaptation speed. The idea is that the normalized spectra will facilitate the recurrent neural network learning long-term patterns. Finally, we also have the FI online normalization, in which case the mean and variance from each frequency are averaged and applied to all frequencies. This method retains the relative dynamics across frequency bins, but might pose a more challenging learning task to the learning machine. In all our experiments apart from the feature experiments, we use FD online normalization with .

4.2 Learning machine

Our learning machine that takes in one frame of noisy speech spectra and outputs one frame of magnitude gain function is based on the gated recurrent unit (GRU) [4]. GRUs are preferred over long short-term memory (LSTM)[12] given their computational efficiency and superior performance in real-time SE tasks [19]. We stack three GRU layers followed by a fully-connected (FC) output layer with sigmoid activation to predict the gain function .

It is worth mentioning that we do not apply convolution layers as often done in other related work [23, 29] because of the relatively arbitrary process involved in choosing the amount of frequency span and filter taps. Previous studies [15] have shown that a naïve convolution layer applied on past and present input noisy frames did not improve objective quality of enhanced speech. Instead, we explore the temporal modeling capability of the network by training with sequences of different lengths, features and loss functions.

4.3 Loss functions

We use three loss functions to train our system. First, we use the regular MSE between clean and enhanced STFTM in Eq. (4). To obtain a better control of the loss, we propose to separate the error into speech distortion and noise reduction terms


where subscript SA denotes a subset of frames where speech is active. In our experiments, we adopted a simple energy-based frame-level voice activity detector operating on the power spectra of clean utterances. The short-time speech energy is accumulated between 300 Hz and 5000 Hz and smoothed across 3 frames by a moving-average filter. Finally, a frame is decided to be voiced above a threshold of 30 dB below the peak energy of the whole utterance.

Note that as the estimated gain approaches one, the speech distortion error is minimized and the noise error is maximized, and vice versa. Therefore, we can control the relative importance of speech distortion to noise reduction with a fixed-weighted loss,


where is a constant in range [0, 1].

In classical speech enhancement literature, the suppression rule is often adapted based on the SNR [10, 3]. Specifically, suppression should be limited at high SNR to avoid artifacts, and be aggressive at low SNR. Motivated by this principle, our second SNR-weighted loss adjusts in (12) using the global SNR of each utterance


where SNR and is a constant. Note that is maximized when . In this way, controls the global SNR at which a fixed amount of deviation would cause the maximum drift in speech distortion weighting. Furthermore, also indicates the global SNR, where the two loss terms are equally weighted. We illustrate this in Fig. 1.

Figure 1: Selected SNR-weighted speech distortion weighting. Horizontal line marks equal weighting of and .

The proposed method is depicted in a flow diagram shown in Fig. 2. During training, both clean speech and noise are required for computing the weighted loss. The trained model enhances the noisy STFTM one frame at a time and utilizes the noisy phase for reconstructing the enhanced speech waveform.

Figure 2: Flow diagram of the proposed system.

5 Experimental Results and Discussions

5.1 Corpora & Experimental setup

We train and evaluate all DNN-based systems using a large noisy speech dataset synthesized from publicly available speech and noise corpus using the MS-SNSD dataset [19] and toolkit 111https://github.com/microsoft/MS-SNSD. 14 diverse noise types are selected for training, while samples from 9 noise types not included in the training set are used for evaluation. Our test set includes challenging and highly non-stationary noise types such as munching, multi-talker babble, keyboard typing, etc. All audio clips are resampled to 16 kHz. The training set consists of 84 hours each of clean speech and noise while 18 hours (5500 clips) of noisy speech constitute the evaluation set. All speech clips are level-normalized on a per-utterance basis, while each noise clip is scaled to have one of the five global SNRs from {40,30,20,10,0} dB. During the training of all DNN-based systems described below, we randomly select an excerpt of clean speech and noise, respectively, before mixing them to create the noisy utterance.

We performed a comparative study of proposed methods with three baselines based on several objective speech quality and intelligibility measures and subjective tests. Specifically, we include perceptual evaluation of speech quality (PESQ) [20], short-time objective intelligibility (STOI)[22], cepstral distance (CD), and scale-invariant signal-to-distortion ratio (SI-SDR[14] for objective evaluation of enhanced speech in time, spectral, and cepstral domains. We conducted the subjective listening test using a web-based subjective framework presented in [19]. Each clip is rated with a discrete rating between 1 (very poor speech quality) and 5 (excellent speech quality) by 20 crowd-sourced listeners. Training and qualification are ensured before presenting test clips to these listeners. The mean of all 20 ratings is the mean opinion score (MOS) for that clip. We also removed obvious spammers who consistently selected the same rating throughout the MOS test. Our subjective test complements the other objective assessments, thus providing a balanced benchmark for evaluation of studied noise reduction algorithms.

We compare our proposed methods with three baseline methods. We used a classical enhancer, which is a slightly optimized implementation of the MMSE log-spectral amplitude (LSA) estimator [8] described in [24]. DNN-based baselines include the improved RNNoise () [19] and a RNN () that replicates the network architecture of RNNoise [25] but operates on 257-point spectra and trained with Eq.(4), and does not have the originally proposed post-processing component. realizes a system with a comparable number of parameters as the proposed methods.

In the next section, we discuss the impact of feature normalization and training on various sequence lengths on the objective quality of enhanced speech. Next, we explore the optimal weighting for the proposed fixed-weighted and SNR-weighted loss functions. Finally, we compare the subjective and objective quality of enhanced speech produced by our systems to several competitive online methods.

Length (s) SI-SDR (dB) CD STOI (%) PESQ (MOS)
1 13.7 3.78 90.1 2.58
2 13.7 3.80 90.3 2.57
5 14.1 3.72 90.5 2.59
10 14.1 3.73 90.7 2.64
20 14.0 3.73 90.6 2.64
Table 1: Effect of sequence lengths in a one-minute minibatch.
Method MOS (mean std.)
Noisy 2.63 0.03
[19] 3.26 0.03
Proposed () 3.93 0.03
Proposed () 3.92 0.03
Proposed () 3.74 0.03
Proposed () 3.65 0.03
Table 2: Subjective MOS from 5500 clips and 20 ratings per clip.
Method # Param. SI-SDR CD STOI PESQ
(dB) (%) (MOS)
Noisy 9.81 4.56 88.0 2.22
LSA [8, 24] 6.10 4.64 84.7 2.33
[19] 61.2 K 10.4 3.83 88.0 2.55
2.64 M 13.0 3.88 89.3 2.56
1.26 M 14.3 3.83 90.7 2.65
Wiener Oracle 20.5 2.13 98.1 3.82
Table 3: Comparison of objective metrics with baseline online SE systems. Refer to text for details about each setup.
Figure 3: Effect of fixed weighting and SNR weighting on objective speech quality and intelligibility measures. Black dashed vertical lines indicate the optimal coefficient for each metric. Note that the optimal points coincide for STOI and CD at and dB.

5.2 Results & Discussions

We want to evaluate how training with long or short sequences affects temporal modeling in the RNN. Although long sequences are expected to help deal with long-term noise patterns, it might also potentially degrade speech that is only short-term stationary. Table 1 summarizes this impact of sequence lengths on objective speech quality. For each setting, we adjust the number of sequences in a minibatch so that one batch always contains one minute of noisy speech. We observe a noticeable improvement in performance as each segment increases to 5 seconds, beyond which the improvement starts to diminish. We do not show the result for the feature test due to space limitation, but overall there is little difference between all normalized variants of STFTM and LPS features, while no normalization results in degradation. In general, we recommend FD online normalization due to its invariance to varying signal levels. We also suggest using segments that are no less than 5-second long each during training.

The effect of speech distortion weighting is shown in Fig. 3, where or are changed from 0 to 1 to search the optimal points for each objective measure. Curiously, only STOI and CD agree on the same coefficient in both cases, while both PESQ and SI-SDR suggest small weight on speech distortion. The optimal SNR weights for all metrics are concentrated around 0.01, meaning that the speech distortion weight should only rapidly increase when the noisy signal is relatively clean (around 20 dB). Overall, a fixed weighting is slightly better than SNR weighting in all metrics.

During experiments, we notice that even though our systems trained on MSE (e.g. row 4 in Table 1) could achieve similar objective measures compared to those trained on the proposed weighted losses, the corresponding subjective quality of systems trained on the weighted loss is a lot better. The most noticeable improvement of systems trained on our loss functions, especially with small , is that the estimated gain function is much more frequency-selective than systems trained on regular MSE, resulting in very clean noise suppression, especially at high SNRs. To testify this, we present the result of the online subjective listening test in Table 2. Not only did all our selected systems significantly outperform the improved RNNoise () trained on MSE presented in [19], we were surprised that the listening test subjects preferred a rather low setting for the speech distortion weight . This trend is mispredicted by all objective measures as well as the authors’ subjective preference of about . We observed noticeable speech distortion as goes below 0.35, while noise became more suppressed. It is evident that more detailed investigations are required in future work to shed more light on speech distortion and noise reduction preferences for different groups of listeners.

Finally, we report the objective evaluation from each baseline method, the noisy reference, and the optimal Wiener filtering method using oracle information in Table 3. The selected system from our method is trained using linear speech distortion weighting with , which we believe strikes a good balance between speech distortion and noise reduction. Although this setup might not be the most preferred for human listeners, it can be easily tuned to different applications. It is nevertheless important to show that it outperforms the other classical or DNN-based methods in all objective metrics.

6 Conclusions

In this paper, we proposed and evaluated a real-time speech enhancement approach based on a compact recurrent neural network trained with a simple MSE-based speech distortion weighted loss function. We show the impact of various feature normalization techniques and sequence lengths on the objective quality of enhanced speech. We also demonstrate how to control the amount of speech distortion with fixed-weighted and SNR-weighted coefficients in the loss function. Both objective and subjective tests show that our method outperforms other competitive online methods. In the future, we will explore time-varying speech distortion weighting and its influence on subjective and objective speech quality.


  • [1] J. Benesty, S. Makino, and J. Chen (Eds.) (2005) Speech enhancement. Springer. Cited by: §1.
  • [2] S. Boll (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. on acoustics, speech, and signal processing 27 (2), pp. 113–120. Cited by: §1, §1.
  • [3] S. Braun, K. Kowalczyk, and E. Habets (2015) Residual noise control using a parametric multichannel wiener filter. In IEEE ICASSP, pp. 1–5. Cited by: §1, §4.3.
  • [4] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 1724–1734. Cited by: §4.2.
  • [5] I. Cohen and B. Berdugo (2001) Speech enhancement for non-stationary noise environments. Signal processing 81 (11), pp. 2403–2418. Cited by: §1, §3.
  • [6] I. Cohen and B. Berdugo (2002) Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE signal processing letters 9 (1), pp. 12–15. Cited by: §1, §3.
  • [7] Y. Ephraim and D. Malah (1984) Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. on acoustics, speech, and signal processing 32 (6), pp. 1109–1121. Cited by: §1, §1, §3.
  • [8] Y. Ephraim and D. Malah (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. on acoustics, speech, and signal processing 33 (2), pp. 443–445. Cited by: §1, §3, §5.1, Table 3.
  • [9] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. on Graphics (TOG) 37 (4), pp. 112. Cited by: §1.
  • [10] T. Esch and P. Vary (2009) Efficient musical noise suppression for speech enhancement system. In IEEE ICASSP, pp. 1–5. Cited by: §1, §4.3.
  • [11] F. G. Germain, Q. Chen, and V. Koltun (2019)

    Speech Denoising with Deep Feature Losses

    In Proc. Interspeech 2019, pp. 2723–2727. External Links: Document, Link Cited by: §1.
  • [12] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.2.
  • [13] A. Kumar and D. Florencio (2016) Speech enhancement in multiple-noise conditions using deep neural networks. In ISCA INTERSPEECH 2016, pp. 3738–3742. External Links: Document, Link Cited by: §1.
  • [14] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey (2019) SDR–half-baked or well done?. In IEEE ICASSP, pp. 626–630. Cited by: §5.1.
  • [15] D. Liu, P. Smaragdis, and M. Kim (2014) Experiments on deep learning for speech denoising. In ISCA INTERSPEECH, Cited by: §4.2.
  • [16] P. C. Loizou (2013) Speech enhancement: theory and practice. CRC press. Cited by: §1.
  • [17] J. M. Martín-Doñas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado (2018) A deep learning loss function based on the perceptual evaluation of the speech quality. IEEE Signal processing letters 25 (11), pp. 1680–1684. Cited by: §1.
  • [18] S. Pascual, A. Bonafonte, and J. Serrà (2017) SEGAN: speech enhancement generative adversarial network. In ISCA INTERSPEECH 2017, pp. 3642–3646. External Links: Document, Link Cited by: §1.
  • [19] C. K.A. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke (2019) A Scalable Noisy Speech Dataset and Online Subjective Test Framework. In ISCA INTERSPEECH 2019, pp. 1816–1820. External Links: Document, Link Cited by: §3, §4.2, §5.1, §5.1, §5.1, §5.2, Table 2, Table 3.
  • [20] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra (2001) Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 2, pp. 749–752. Cited by: §5.1.
  • [21] L. Sun, J. Du, L. Dai, and C. Lee (2017) Multiple-target deep learning for LSTM-RNN based speech enhancement. In IEEE Hands-free Speech Communications and Microphone Arrays (HSCMA), pp. 136–140. Cited by: §1.
  • [22] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217. Cited by: §5.1.
  • [23] K. Tan and D. Wang (2018) A convolutional recurrent neural network for real-time speech enhancement.. In ISCA INTERSPEECH, pp. 3229–3233. Cited by: §1, §4.2.
  • [24] I. J. Tashev (2009) Sound capture and processing: practical approaches. John Wiley & Sons. Cited by: §5.1, Table 3.
  • [25] J. Valin (2018) A hybrid DSP/deep learning approach to real-time full-band speech enhancement. In 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), pp. 1–5. Cited by: §1, §1, §3, §5.1.
  • [26] Y. Wang, A. Narayanan, and D. Wang (2014) On training targets for supervised speech separation. IEEE/ACM Trans. on audio, speech, and language processing 22 (12), pp. 1849–1858. Cited by: §1, §1.
  • [27] Y. Xia and R. Stern (2018) A priori SNR estimation based on a recurrent neural network for robust speech enhancement. In ISCA INTERSPEECH, pp. 3274–3278. Cited by: §1, §1.
  • [28] Y. Xu, J. Du, Z. Huang, L. Dai, and C. Lee Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement. In ISCA INTERSPEECH 2015, pp. 1508–1512. Cited by: §1.
  • [29] H. Zhao, S. Zarar, I. Tashev, and C. Lee (2018) Convolutional-recurrent neural networks for speech enhancement. In IEEE ICASSP, pp. 2401–2405. Cited by: §4.2.
  • [30] Y. Zhao, B. Xu, R. Giri, and T. Zhang (2018) Perceptually guided speech enhancement using deep neural networks. In IEEE ICASSP, pp. 5074–5078. Cited by: §1.