Log In Sign Up

A Supervised Speech enhancement Approach with Residual Noise Control for Voice Communication

by   Andong Li, et al.

For voice communication, it is important to extract the speech from its noisy version without introducing unnaturally artificial noise. By studying the subband mean-squared error (MSE) of the speech for unsupervised speech enhancement approaches and revealing its relationship with the existing loss function for supervised approaches, this paper derives a generalized loss function, when taking the residual noise control into account, for supervised approaches. Our generalized loss function contains the well-known MSE loss function and many other often-used loss functions as special cases. Compared with traditional loss functions, our generalized loss function is more flexible to make a good trade-off between speech distortion and noise reduction. This is because a group of well-studied noise shaping schemes can be introduced to control residual noise for practical applications. Objective and subjective test results verify the importance of residual noise control for the supervised speech enhancement approach.


page 1

page 2

page 3

page 4


A Perceptual Weighting Filter Loss for DNN Training in Speech Enhancement

Single-channel speech enhancement with deep neural networks (DNNs) has s...

Weighted Speech Distortion Losses for Neural-network-based Real-time Speech Enhancement

This paper investigates several aspects of training a RNN (recurrent neu...

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Many deep learning-based speech enhancement algorithms are designed to m...

On Loss Functions and Recurrency Training for GAN-based Speech Enhancement Systems

Recent work has shown that it is feasible to use generative adversarial ...

Components Loss for Neural Networks in Mask-Based Speech Enhancement

Estimating time-frequency domain masks for single-channel speech enhance...

On Mean Absolute Error for Deep Neural Network Based Vector-to-Vector Regression

In this paper, we exploit the properties of mean absolute error (MAE) as...

Effect of noise suppression losses on speech distortion and ASR performance

Deep learning based speech enhancement has made rapid development toward...

I Introduction

Speech enhancement plays an important role in noisy environments for many applications, such as speech communication, speech interaction and speech translation. Numerous researchers have done lots of efforts on separating the speech from its noisy version and various approaches have already been proposed in the last five decades. Conventional approaches include spectral subtraction [1], statistical method [2, 3] and subspace-based method [4]

, which has proved to be valid when the additive noise is stationary or quasi-stationary. However, their performance often suffers from heavy degradation under non-stationary and low signal-to-noise ratio (SNR) conditions.

While moving head with deep learning, the supervised approaches gradually show their powerful capability on suppressing both stationary and highly non-stationary noise signals, which is mainly because of highly nonlinear mapping ability of deep neural networks (DNN) 

[5][6]. In DNN-based algorithms, minimum mean-squared error (MSE) is often adopted as a loss criterion to update the weights of the network. Nevertheless, usage of this criterion directly may suffer from some problems. First, although MSE is the most often used criterion, it is not so relevant with speech perception [7, 8]

. Second, global MSE optimization usually obtains an over-smoothing estimation which omits some important detailed information. To solve these problems, many new criteria, that consider speech perception, have been proposed in most recent years 

[9, 10, 11, 12]. The first one is to use perceptually weighted MSE functions, which are proposed to weight the loss in different time-frequency (T-F) regions [13, 10]. The second one is to use objective metrics as loss functions, for examples, perceptual evaluation speech quality (PESQ) [14], short-time objective intelligibility (STOI) [15] and scale-invariant speech distortion ratio (SI-SDR) [16] have been adopted as loss functions. In [17], speech distortion and residual noise are considered separately in the loss function, which is called components loss (CL).

Note that all the above mentioned loss functions aim at suppressing noise as much as possible at noise-only segments. In other words, at noise-only segments, the amount of noise reduction is expected to be a positive infinite value. As we know, this aim could not be achieved in most cases for many reasons. First, the noise is often stochastic, and thus it is inevitable that the estimation accuracy is often constrained by a limited number of available observations [18, 19]. Second, there are a great variety of noise signals, so that a DNN model cannot be expected to distinguish all of them correctly from the speech in each T-F unit. Therefore, when the noise cannot be suppressed totally as expected, some unnatural residual noise may degrade speech quality a lot [20], which needs to be considered carefully. In this paper, we derive a generalized loss function by introducing multiple manual parameters to flexibly make a balance between speech distortion and noise attenuation. More specifically, the residual noise control is introduced for voice communication [21, 22]. By theoretical derivations, MSE and other often-used loss functions can be included in the proposed generalized loss function.

The remainder of the paper is structured as follows. Section II formulates the problem. Section III derives the generalized loss in detail and introduces used network architecture. Section IV is the experimental settings. Results and analysis are given in Section V. Section VI presents some conclusions.

Ii Problem Formulation

In the time domain, the noisy signal can be modelled as


where is the clean speech and

is the additive noise. In the frequency domain, (

1) can be written as


where , , and

are, respectively, discrete Fourier transforms (DFT) of

, , and with the frame index and the frequency bin .

For practical applications, we only have the time-domain noisy signal or its frequency-domain version , the problem becomes how to estimate or from its noisy signal. It is common to use Minimum MSE (MMSE) as a criterion in unsupervised speech enhancement approaches. Before introducing MMSE, we first define the square error as


where is a nonlinear spectral gain function, is a function with a variable , and is a function with three variables , , and . When and , results in MMSE spectral amplitude estimator in [2], where is the expectation operator. When and , leads to MMSE log-spectral amplitude estimator in [3]. More complicated forms of and can be chosen, for example, many perceptually-weighted error criteria can be included, which can be referred to [7].

For supervised approaches, the square error in the subband is often defined as the loss function in the fullband, which is


One can get that, when and , is to minimize the MSE of log-spectral amplitude between the clean speech and the estimated speech, which is the training target in [6].

Note that (3) and (4) are quite similar and the most obvious difference between them is that is the subband square error, while is the fullband square error. The other difference is that the nonlinear spectral gain can be derived theoretically by minimizing

when the probability density function (p.d.f.) of the speech and that of the noise are both given, while it is difficult to derive the nonlinear spectral gain by minimizing

, where this gain can often be mapped from the input noisy features after training the supervised machine learning model. In all, it seems that all subband square error functions can be generalized to the fullband ones as supervised training targets.

Iii Proposed Algorithm

Only using MMSE as a criterion, it is difficult to make a balance between speech distortion and noise reduction. This section derives a more generalized fullband loss function.

Iii-a Trade-off Criterion in Subband

In traditional speech enhancement approaches, speech distortion and noise reduction in the subband can be considered separately. The subband square error of the speech and the subband residual noise can be, respectively, given by




where is a function with three variables , , and . When , , and , and become the MSE of the speech magnitude and the residual noise power in the subband, respectively, which are identical with [23, (8.31) and (8.32)].

By minimizing the subband MSE of the speech with a residual noise control, an optimization problem can be given to derive the nonlinear spectral gain, which is given by


where is a function of two variables and . could be both a frequency and frame-dependent factor that can be introduced to control the residual noise flexibly.

The optimal spectral gain in (7) can be solved theoretically by the Lagrange multiplier method, which is


where is a Lagrange multiplier. When , , , and , the optimal spectral gain can be derived from (8) and the constraint in (7), which can be given by


where is the a priori SNR. It is not always possible to derive mathematically, especially when , , , and have very complicated expressions. Moreover, it is uneasy to accurately estimate the noise power spectral density in non-stationary noise environments [24, 25, 26]. However, it seems that this optimization can be easily solved by supervised approaches. To transfer this problem, we need to define the fullband square error of the speech and the fullband residual noise power to derive the loss function for supervised approaches.

Iii-B Trade-off Criterion in Fullband

The fullband MSE of the speech and the fullband residual noise can be, respectively, given by




The loss function without any constraints can be given by


where (12) is the same as the newly proposed components loss function as given in [17].

The loss function with residual noise control is



It is obvious that (13) is a generalization of (12), where (13) reduces to (12) when . One can observe that is both frequency and frame-dependent, so it can control the residual noise in each time-frequency bin.

Iii-C A Generalized Loss Function

We further generalize the subband square error in (5) and (6), the square is substituted by a variable and an additional variable is also introduced on the spectra, then (5) and (6) can be, respectively, given by




Analogously, with the residual noise control, the optimal problem in the subband becomes


By setting , , , and , one can derive a generalized gain function with the Lagrange multiplier method, which is


where and , where (17) is identical to [27, (6)]. Note that [27, (6)] is given intuitively without theoretical derivation. When and , (17) reduces to (9). When , one can get , which has already been derived and presented in ([27, (22)]).

Similarly, the generalized loss function for supervised approaches can be given by


where the first item relates to the fullband speech distortion and the second item is introduced to control the residual noise.

Eq. (18) is a generalized loss function that includes (12) and (13). This is because (18) reduces to (13) when and it can further reduces to (12) by setting . It is interesting to see that (3) also can be separated into two components, where one is the MSE of the speech and the other is related to the residual noise. When and , we have


where relates to the power of speech distortion and relates to the power of residual noise. is a combination of speech distortion and residual noise, so the fullband MSE loss function of a complex spectrum is also a special case of the generalized loss function in (18). If and are chosen, the decomposition of is more complicated than (19), which will not be further discussed for limited space.

In this letter, we emphasize the importance of introducing the residual noise control. , , , and are applied, although more complicated expressions can be chosen when taking the perceptual quality into account. Accordingly, we have




where will be set to a constant value and is a constant value over frequency for simplicity, that is to say, and are used in the following. We only study the impact of , , and on supervised approaches.

Iv Experimental Setup

Iv-a Dataset

Experiments are conducted with TIMIT corpus, where 1000 and 200 utterances are randomly chosen as the training and the evaluation datasets, respectively. 125 types of environment noises [6, 28] are used for generating noisy utterances under different SNR levels ranging from -5dB to 15dB with the interval 5dB. For model test, additional 10 male and 10 female utterances are chosen to mix with unseen noise signals taken from the NOISEX92 [29] with SNR ranging from -5dB to 10dB with the interval 5dB.

Iv-B Network Architecture

U-Net is chosen as the network in this letter, which has been widely adopted for speech separation task [30]. As shown in Fig. 1

, the network consists of the convolutional encoder and decoder, both of which are comprised of five convolutional blocks where the 2-D convolution layer is adopted, followed by batch normalization (BN) and exponential linear unit (ELU). Skip connection is introduced to compensate for the information loss during features compression process. Note that the mapping target is the gain function and the sigmoid function is adopted to make sure that the output ranges from 0 to 1. Causal mechanism is introduced to achieve real-time processing, where only the past frames are involved in the convolution calculation. The tensor output size of each layer is given with

format, which is shown in Fig. 1.

Iv-C Loss Functions and Training Models

This letter chooses three loss functions including MSE in (4), Time-MSE-based loss (TMSE) [30] and recently proposed SI-SDR-based loss [30] as baselines. AS T-F domain-based network is used, an additional fixed iSTFT-like layer is needed to transform the estimated T-F spectrum back into time domain for TMSE- and SI-SDR-based loss [31]. They compare with the proposed generalized loss function given in (18) with (20) and (21

). All the models are trained with stochastic gradient descent (SGD) optimized by Adam 


Fig. 1: The network architecture adopted in this study. Input is the noisy magnitude spectra and output is the estimated gain functions.

V Results and Analysis

V-a Objective Evaluation

This letter uses four objective measurements including noise attenuation (NA) [21], speech attenuation (SA) [21], PESQ [14], and SDR [33]. The testing results w.r.t. and are shown in Fig. 2, where , and are considered. The test results of three baselines are also presented as comparison. From this figure, one can observe the following phenomena. First, the increase of will decrease NA. This is because the residual noise control mechanism is introduced for optimization, which means, during the training process, the residual noise in the estimated spectra will gradually get close to the preset residual noise threshold. As a consequence, the characteristic of the residual noise is expected to be effectively preserved, which will be further confirmed by subjective listening tests in the following. Second, the increase of is beneficial to noise suppression and meanwhile introducing more speech distortion. As generalized loss can be viewed as the joint optimization of both speech distortion and noise reduction, a larger leads to smaller gain values, as (17) states, where on the one hand more interference is suppressed and on the other hand, more speech components are inevitably abandoned. Third, the increase of has a negative influence on NA and SD. Finally, among various parameter configurations, (2, -30dB, 0.5), (2, -30dB, 1) and (2, -20dB, 1) can be chosen. This is because relatively better performance can be obtained for all the four objective metrics. One can observe that the three competing loss functions can get better performance in some objective metrics, while they may suffer much worse performance in others. For example, SI-SDR and TMSE have larger values of SDR, while their PESQ scores are even lower than the MSE, which is consistent with the study in [16].

Fig. 2: Test results in terms of NA, SA, PESQ and SDR, where the averaged PESQ score of the noisy signals is 1.80 and its averaged SDR is 2.51dB.

V-B Subjective Evaluation

To evaluate speech quality of the proposed generalized loss (GL) function, a subjective evaluation test is conducted among GL and baselines, where we follow the subjective testing procedures of [34]. In this comparison, we choose the parameter configuration (2, -20dB, 1) for the propose GL function. The experiment is conducted in a standard listening room, where 10 listeners participate. The listening material consists of 20 utterances, each of which includes one male and female utterance selected from TIMIT corpus and is mixed with one of five noises including aircraft, babble, bus, cafeteria, and car. Four SNR conditions are selected for mixing, i.e. -5dB, 0dB, 5dB, 10dB. Speech pause of 3s duration is specifically inserted before each utterance. Then, the duration of each listening utterance is about 13s. Each listener needs to write down the utterance index that they prefer considering both noise naturalness and speech quality. The same as  [34], ”Equal” option is also provided if no subjective preference can be given. To avoid inertia, the utterance index in each pair is shuffled. The averaged subjective results are presented in Table. I. From this table, one can observe that the proposed GL function with residual noise control achieves better performance in subjective testing, which can be explained as the proposed GL method can effectively recover speech components while preserving the characteristic of background noise to some extent compared with all the baselines.

Methods GL MSE Equal
Preference 70.0% 22.0% 8.0%
Methods GL TMSE Equal
Preference 66.5% 22.0% 12.5%
Methods GL SI-SDR Equal
Preference 70.5% 23.5% 6.0%
TABLE I: Results of subjective listening test. The numbers indicate the percentage of votes in favor of one approach. The choice ”Equal” means no subjective difference.

Vi Conclusion

This letter derives a generalized loss function which can easily make a balance between noise attenuation and speech distortion with multiple manual parameters. In addition, MSE and other typical loss functions are revealed to be special cases. Both objective and subjective tests are conducted to show that it is important to control the residual noise for supervised speech enhancement approaches, where the residual noise becomes much more natural than before. Further work could concentrate on studying a combination of the residual noise control scheme with objective metrics-based loss functions to improve the naturalness of the residual noise.


  • [1] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust. Speech Signal Process., vol. 27, no. 2, pp. 113–120, 1979.
  • [2] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Trans. Acoust. Speech Signal Process., vol. 32, no. 6, pp. 1109–1121, 1984.
  • [3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust. Speech Signal Process., vol. 33, no. 2, pp. 443–445, 1985.
  • [4] Y. Ephraim and H. L. Van Trees, “A signal subspace approach for speech enhancement,” IEEE Trans. Acoust. Speech Signal Process., vol. 3, no. 4, pp. 251–266, 1995.
  • [5] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1849–1858, 2014.
  • [6] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Process. Lett., vol. 21, no. 1, pp. 65–68, 2013.
  • [7] Y. Hu and P. C. Loizou, “A perceptually motivated approach for speech enhancement,” IEEE Trans. Audio, Speech, Lang. Process., vol. 11, no. 5, pp. 457–465, 2003.
  • [8] P. C. Loizou and G. Kim, “Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 1, pp. 47–56, 2010.
  • [9] J. M. Martín-Doñas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, “A deep learning loss function based on the perceptual evaluation of the speech quality,” IEEE Signal Process. Lett., vol. 25, no. 11, pp. 1680–1684, 2018.
  • [10] Q. Liu, W. Wang, P. J. Jackson, and Y. Tang, “A perceptually-weighted deep neural network for monaural speech enhancement in various background noise conditions,” in Proc. Eur. Signal Process. Conf, pp. 1270–1274, IEEE, 2017.
  • [11] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 5059–5063, 2018.
  • [12] S. Venkataramani, J. Casebeer, and P. Smaragdis, “Adaptive front-ends for end-to-end source separation,” in Proc. NIPS, 2017.
  • [13]

    P. G. Shivakumar and P. G. Georgiou, “Perception optimized deep denoising autoencoders for speech enhancement.,” in

    INTERSPEECH, pp. 3743–3747, 2016.
  • [14] I.-T. Recommendation, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Rec. ITU-T P. 862, 2001.
  • [15] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 4214–4217, 2010.
  • [16] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 626–630, IEEE, 2019.
  • [17] Z. Xu, S. Elshamy, Z. Zhao, and T. Fingscheidt, “Components loss for neural networks in mask-based speech enhancement,” arXiv preprint arXiv:1908.05087, 2019.
  • [18] C. Zheng, Y. Zhou, X. Hu, and X. Li, “Two-channel post-filtering based on adaptive smoothing and noise properties,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 1745–1748, 2011.
  • [19] C. Zheng, H. Liu, R. Peng, and X. Li, “A statistical analysis of two-channel post-filter estimators in isotropic noise fields,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 2, pp. 336–342, 2012.
  • [20] F. B. Gelderblom, T. V. Tronstad, and E. M. Viggen, “Subjective evaluation of a noise-reduced training target for deep neural network-based speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 3, pp. 583–594, 2018.
  • [21] S. Gustafsson, P. Jax, and P. Vary, “A novel psychoacoustically motivated audio enhancement algorithm preserving background noise characteristics,” in Proc. Int. Conf. Acoustics, Speech, Signal Process., vol. 1, pp. 397–400, 1998.
  • [22] S. Braun, K. Kowalczyk, and E. A. Habets, “Residual noise control using a parametric multichannel wiener filter,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 360–364, 2015.
  • [23] J. Benesty, J. Chen, Y. Huang, and I. Cohen, Noise reduction in speech processing, vol. 2. Springer Science & Business Media, 2009.
  • [24] R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans. Speech Audio Process., vol. 9, no. 5, pp. 504–512, 2001.
  • [25] I. Cohen, “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process.,, vol. 11, no. 5, pp. 466–475, 2003.
  • [26] T. Gerkmann and R. C. Hendriks, “Unbiased mmse-based noise power estimation with low complexity and low tracking delay,” IEEE Trans. Audio, Speech, Language Process., vol. 20, no. 4, pp. 1383–1393, 2011.
  • [27] T. Inoue, H. Saruwatari, K. Shikano, and K. Kondo, “Theoretical analysis of musical noise in wiener filtering family via higher-order statistics,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 5076–5079, 2011.
  • [28] Z. Duan, G. J. Mysore, and P. Smaragdis, “Speech enhancement by online non-negative spectrogram decomposition in nonstationary noise environments,” in Proc. INTERSPEECH, pp. 1–4, 2012.
  • [29]

    A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems,”

    Speech Commun., vol. 12, no. 3, pp. 247–251, 1993.
  • [30] M. Kolbæk, Z.-H. Tan, S. H. Jensen, and J. Jensen, “On loss functions for supervised monaural time-domain speech enhancement,” arXiv preprint arXiv:1909.01019, 2019.
  • [31] G. Wichern and J. Le Roux, “Phase reconstruction with learned time-frequency representations for single-channel speech separation,” in IEEE Int. Workshop Acoust. Signal Enhancement (IWAENC), pp. 396–400, IEEE, 2018.
  • [32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [33] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006.
  • [34] C. Breithaupt, T. Gerkmann, and R. Martin, “Cepstral smoothing of spectral filter gains for speech enhancement without musical noise,” IEEE Signal Process. Lett., vol. 14, no. 12, pp. 1036–1039, 2007.