I Introduction
Speech enhancement plays an important role in noisy environments for many applications, such as speech communication, speech interaction and speech translation. Numerous researchers have done lots of efforts on separating the speech from its noisy version and various approaches have already been proposed in the last five decades. Conventional approaches include spectral subtraction [1], statistical method [2, 3] and subspacebased method [4]
, which has proved to be valid when the additive noise is stationary or quasistationary. However, their performance often suffers from heavy degradation under nonstationary and low signaltonoise ratio (SNR) conditions.
While moving head with deep learning, the supervised approaches gradually show their powerful capability on suppressing both stationary and highly nonstationary noise signals, which is mainly because of highly nonlinear mapping ability of deep neural networks (DNN)
[5], [6]. In DNNbased algorithms, minimum meansquared error (MSE) is often adopted as a loss criterion to update the weights of the network. Nevertheless, usage of this criterion directly may suffer from some problems. First, although MSE is the most often used criterion, it is not so relevant with speech perception [7, 8]. Second, global MSE optimization usually obtains an oversmoothing estimation which omits some important detailed information. To solve these problems, many new criteria, that consider speech perception, have been proposed in most recent years
[9, 10, 11, 12]. The first one is to use perceptually weighted MSE functions, which are proposed to weight the loss in different timefrequency (TF) regions [13, 10]. The second one is to use objective metrics as loss functions, for examples, perceptual evaluation speech quality (PESQ) [14], shorttime objective intelligibility (STOI) [15] and scaleinvariant speech distortion ratio (SISDR) [16] have been adopted as loss functions. In [17], speech distortion and residual noise are considered separately in the loss function, which is called components loss (CL).Note that all the above mentioned loss functions aim at suppressing noise as much as possible at noiseonly segments. In other words, at noiseonly segments, the amount of noise reduction is expected to be a positive infinite value. As we know, this aim could not be achieved in most cases for many reasons. First, the noise is often stochastic, and thus it is inevitable that the estimation accuracy is often constrained by a limited number of available observations [18, 19]. Second, there are a great variety of noise signals, so that a DNN model cannot be expected to distinguish all of them correctly from the speech in each TF unit. Therefore, when the noise cannot be suppressed totally as expected, some unnatural residual noise may degrade speech quality a lot [20], which needs to be considered carefully. In this paper, we derive a generalized loss function by introducing multiple manual parameters to flexibly make a balance between speech distortion and noise attenuation. More specifically, the residual noise control is introduced for voice communication [21, 22]. By theoretical derivations, MSE and other oftenused loss functions can be included in the proposed generalized loss function.
The remainder of the paper is structured as follows. Section II formulates the problem. Section III derives the generalized loss in detail and introduces used network architecture. Section IV is the experimental settings. Results and analysis are given in Section V. Section VI presents some conclusions.
Ii Problem Formulation
In the time domain, the noisy signal can be modelled as
(1) 
where is the clean speech and
is the additive noise. In the frequency domain, (
1) can be written as(2) 
where , , and
are, respectively, discrete Fourier transforms (DFT) of
, , and with the frame index and the frequency bin .For practical applications, we only have the timedomain noisy signal or its frequencydomain version , the problem becomes how to estimate or from its noisy signal. It is common to use Minimum MSE (MMSE) as a criterion in unsupervised speech enhancement approaches. Before introducing MMSE, we first define the square error as
(3) 
where is a nonlinear spectral gain function, is a function with a variable , and is a function with three variables , , and . When and , results in MMSE spectral amplitude estimator in [2], where is the expectation operator. When and , leads to MMSE logspectral amplitude estimator in [3]. More complicated forms of and can be chosen, for example, many perceptuallyweighted error criteria can be included, which can be referred to [7].
For supervised approaches, the square error in the subband is often defined as the loss function in the fullband, which is
(4) 
One can get that, when and , is to minimize the MSE of logspectral amplitude between the clean speech and the estimated speech, which is the training target in [6].
Note that (3) and (4) are quite similar and the most obvious difference between them is that is the subband square error, while is the fullband square error. The other difference is that the nonlinear spectral gain can be derived theoretically by minimizing
when the probability density function (p.d.f.) of the speech and that of the noise are both given, while it is difficult to derive the nonlinear spectral gain by minimizing
, where this gain can often be mapped from the input noisy features after training the supervised machine learning model. In all, it seems that all subband square error functions can be generalized to the fullband ones as supervised training targets.
Iii Proposed Algorithm
Only using MMSE as a criterion, it is difficult to make a balance between speech distortion and noise reduction. This section derives a more generalized fullband loss function.
Iiia Tradeoff Criterion in Subband
In traditional speech enhancement approaches, speech distortion and noise reduction in the subband can be considered separately. The subband square error of the speech and the subband residual noise can be, respectively, given by
(5) 
and
(6) 
where is a function with three variables , , and . When , , and , and become the MSE of the speech magnitude and the residual noise power in the subband, respectively, which are identical with [23, (8.31) and (8.32)].
By minimizing the subband MSE of the speech with a residual noise control, an optimization problem can be given to derive the nonlinear spectral gain, which is given by
(7) 
where is a function of two variables and . could be both a frequency and framedependent factor that can be introduced to control the residual noise flexibly.
The optimal spectral gain in (7) can be solved theoretically by the Lagrange multiplier method, which is
(8) 
where is a Lagrange multiplier. When , , , and , the optimal spectral gain can be derived from (8) and the constraint in (7), which can be given by
(9) 
where is the a priori SNR. It is not always possible to derive mathematically, especially when , , , and have very complicated expressions. Moreover, it is uneasy to accurately estimate the noise power spectral density in nonstationary noise environments [24, 25, 26]. However, it seems that this optimization can be easily solved by supervised approaches. To transfer this problem, we need to define the fullband square error of the speech and the fullband residual noise power to derive the loss function for supervised approaches.
IiiB Tradeoff Criterion in Fullband
The fullband MSE of the speech and the fullband residual noise can be, respectively, given by
(10) 
and
(11) 
The loss function without any constraints can be given by
(12) 
where (12) is the same as the newly proposed components loss function as given in [17].
The loss function with residual noise control is
(13) 
where
IiiC A Generalized Loss Function
We further generalize the subband square error in (5) and (6), the square is substituted by a variable and an additional variable is also introduced on the spectra, then (5) and (6) can be, respectively, given by
(14) 
and
(15) 
Analogously, with the residual noise control, the optimal problem in the subband becomes
(16) 
By setting , , , and , one can derive a generalized gain function with the Lagrange multiplier method, which is
(17) 
where and , where (17) is identical to [27, (6)]. Note that [27, (6)] is given intuitively without theoretical derivation. When and , (17) reduces to (9). When , one can get , which has already been derived and presented in ([27, (22)]).
Similarly, the generalized loss function for supervised approaches can be given by
(18) 
where the first item relates to the fullband speech distortion and the second item is introduced to control the residual noise.
Eq. (18) is a generalized loss function that includes (12) and (13). This is because (18) reduces to (13) when and it can further reduces to (12) by setting . It is interesting to see that (3) also can be separated into two components, where one is the MSE of the speech and the other is related to the residual noise. When and , we have
(19) 
where relates to the power of speech distortion and relates to the power of residual noise. is a combination of speech distortion and residual noise, so the fullband MSE loss function of a complex spectrum is also a special case of the generalized loss function in (18). If and are chosen, the decomposition of is more complicated than (19), which will not be further discussed for limited space.
In this letter, we emphasize the importance of introducing the residual noise control. , , , and are applied, although more complicated expressions can be chosen when taking the perceptual quality into account. Accordingly, we have
(20) 
and
(21) 
where will be set to a constant value and is a constant value over frequency for simplicity, that is to say, and are used in the following. We only study the impact of , , and on supervised approaches.
Iv Experimental Setup
Iva Dataset
Experiments are conducted with TIMIT corpus, where 1000 and 200 utterances are randomly chosen as the training and the evaluation datasets, respectively. 125 types of environment noises [6, 28] are used for generating noisy utterances under different SNR levels ranging from 5dB to 15dB with the interval 5dB. For model test, additional 10 male and 10 female utterances are chosen to mix with unseen noise signals taken from the NOISEX92 [29] with SNR ranging from 5dB to 10dB with the interval 5dB.
IvB Network Architecture
UNet is chosen as the network in this letter, which has been widely adopted for speech separation task [30]. As shown in Fig. 1
, the network consists of the convolutional encoder and decoder, both of which are comprised of five convolutional blocks where the 2D convolution layer is adopted, followed by batch normalization (BN) and exponential linear unit (ELU). Skip connection is introduced to compensate for the information loss during features compression process. Note that the mapping target is the gain function and the sigmoid function is adopted to make sure that the output ranges from 0 to 1. Causal mechanism is introduced to achieve realtime processing, where only the past frames are involved in the convolution calculation. The tensor output size of each layer is given with
format, which is shown in Fig. 1.IvC Loss Functions and Training Models
This letter chooses three loss functions including MSE in (4), TimeMSEbased loss (TMSE) [30] and recently proposed SISDRbased loss [30] as baselines. AS TF domainbased network is used, an additional fixed iSTFTlike layer is needed to transform the estimated TF spectrum back into time domain for TMSE and SISDRbased loss [31]. They compare with the proposed generalized loss function given in (18) with (20) and (21
). All the models are trained with stochastic gradient descent (SGD) optimized by Adam
[32].V Results and Analysis
Va Objective Evaluation
This letter uses four objective measurements including noise attenuation (NA) [21], speech attenuation (SA) [21], PESQ [14], and SDR [33]. The testing results w.r.t. and are shown in Fig. 2, where , and are considered. The test results of three baselines are also presented as comparison. From this figure, one can observe the following phenomena. First, the increase of will decrease NA. This is because the residual noise control mechanism is introduced for optimization, which means, during the training process, the residual noise in the estimated spectra will gradually get close to the preset residual noise threshold. As a consequence, the characteristic of the residual noise is expected to be effectively preserved, which will be further confirmed by subjective listening tests in the following. Second, the increase of is beneficial to noise suppression and meanwhile introducing more speech distortion. As generalized loss can be viewed as the joint optimization of both speech distortion and noise reduction, a larger leads to smaller gain values, as (17) states, where on the one hand more interference is suppressed and on the other hand, more speech components are inevitably abandoned. Third, the increase of has a negative influence on NA and SD. Finally, among various parameter configurations, (2, 30dB, 0.5), (2, 30dB, 1) and (2, 20dB, 1) can be chosen. This is because relatively better performance can be obtained for all the four objective metrics. One can observe that the three competing loss functions can get better performance in some objective metrics, while they may suffer much worse performance in others. For example, SISDR and TMSE have larger values of SDR, while their PESQ scores are even lower than the MSE, which is consistent with the study in [16].
VB Subjective Evaluation
To evaluate speech quality of the proposed generalized loss (GL) function, a subjective evaluation test is conducted among GL and baselines, where we follow the subjective testing procedures of [34]. In this comparison, we choose the parameter configuration (2, 20dB, 1) for the propose GL function. The experiment is conducted in a standard listening room, where 10 listeners participate. The listening material consists of 20 utterances, each of which includes one male and female utterance selected from TIMIT corpus and is mixed with one of five noises including aircraft, babble, bus, cafeteria, and car. Four SNR conditions are selected for mixing, i.e. 5dB, 0dB, 5dB, 10dB. Speech pause of 3s duration is specifically inserted before each utterance. Then, the duration of each listening utterance is about 13s. Each listener needs to write down the utterance index that they prefer considering both noise naturalness and speech quality. The same as [34], ”Equal” option is also provided if no subjective preference can be given. To avoid inertia, the utterance index in each pair is shuffled. The averaged subjective results are presented in Table. I. From this table, one can observe that the proposed GL function with residual noise control achieves better performance in subjective testing, which can be explained as the proposed GL method can effectively recover speech components while preserving the characteristic of background noise to some extent compared with all the baselines.
Methods  GL  MSE  Equal 

Preference  70.0%  22.0%  8.0% 
Methods  GL  TMSE  Equal 
Preference  66.5%  22.0%  12.5% 
Methods  GL  SISDR  Equal 
Preference  70.5%  23.5%  6.0% 
Vi Conclusion
This letter derives a generalized loss function which can easily make a balance between noise attenuation and speech distortion with multiple manual parameters. In addition, MSE and other typical loss functions are revealed to be special cases. Both objective and subjective tests are conducted to show that it is important to control the residual noise for supervised speech enhancement approaches, where the residual noise becomes much more natural than before. Further work could concentrate on studying a combination of the residual noise control scheme with objective metricsbased loss functions to improve the naturalness of the residual noise.
References
 [1] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust. Speech Signal Process., vol. 27, no. 2, pp. 113–120, 1979.
 [2] Y. Ephraim and D. Malah, “Speech enhancement using a minimummean square error shorttime spectral amplitude estimator,” IEEE Trans. Acoust. Speech Signal Process., vol. 32, no. 6, pp. 1109–1121, 1984.
 [3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum meansquare error logspectral amplitude estimator,” IEEE Trans. Acoust. Speech Signal Process., vol. 33, no. 2, pp. 443–445, 1985.
 [4] Y. Ephraim and H. L. Van Trees, “A signal subspace approach for speech enhancement,” IEEE Trans. Acoust. Speech Signal Process., vol. 3, no. 4, pp. 251–266, 1995.
 [5] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1849–1858, 2014.
 [6] Y. Xu, J. Du, L.R. Dai, and C.H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Process. Lett., vol. 21, no. 1, pp. 65–68, 2013.
 [7] Y. Hu and P. C. Loizou, “A perceptually motivated approach for speech enhancement,” IEEE Trans. Audio, Speech, Lang. Process., vol. 11, no. 5, pp. 457–465, 2003.
 [8] P. C. Loizou and G. Kim, “Reasons why current speechenhancement algorithms do not improve speech intelligibility and suggested solutions,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 1, pp. 47–56, 2010.
 [9] J. M. MartínDoñas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, “A deep learning loss function based on the perceptual evaluation of the speech quality,” IEEE Signal Process. Lett., vol. 25, no. 11, pp. 1680–1684, 2018.
 [10] Q. Liu, W. Wang, P. J. Jackson, and Y. Tang, “A perceptuallyweighted deep neural network for monaural speech enhancement in various background noise conditions,” in Proc. Eur. Signal Process. Conf, pp. 1270–1274, IEEE, 2017.
 [11] M. Kolbæk, Z.H. Tan, and J. Jensen, “Monaural speech enhancement using deep neural networks by maximizing a shorttime objective intelligibility measure,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 5059–5063, 2018.
 [12] S. Venkataramani, J. Casebeer, and P. Smaragdis, “Adaptive frontends for endtoend source separation,” in Proc. NIPS, 2017.

[13]
P. G. Shivakumar and P. G. Georgiou, “Perception optimized deep denoising autoencoders for speech enhancement.,” in
INTERSPEECH, pp. 3743–3747, 2016.  [14] I.T. Recommendation, “Perceptual evaluation of speech quality (PESQ): An objective method for endtoend speech quality assessment of narrowband telephone networks and speech codecs,” Rec. ITUT P. 862, 2001.
 [15] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A shorttime objective intelligibility measure for timefrequency weighted noisy speech,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 4214–4217, 2010.
 [16] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–halfbaked or well done?,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 626–630, IEEE, 2019.
 [17] Z. Xu, S. Elshamy, Z. Zhao, and T. Fingscheidt, “Components loss for neural networks in maskbased speech enhancement,” arXiv preprint arXiv:1908.05087, 2019.
 [18] C. Zheng, Y. Zhou, X. Hu, and X. Li, “Twochannel postfiltering based on adaptive smoothing and noise properties,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 1745–1748, 2011.
 [19] C. Zheng, H. Liu, R. Peng, and X. Li, “A statistical analysis of twochannel postfilter estimators in isotropic noise fields,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 2, pp. 336–342, 2012.
 [20] F. B. Gelderblom, T. V. Tronstad, and E. M. Viggen, “Subjective evaluation of a noisereduced training target for deep neural networkbased speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 3, pp. 583–594, 2018.
 [21] S. Gustafsson, P. Jax, and P. Vary, “A novel psychoacoustically motivated audio enhancement algorithm preserving background noise characteristics,” in Proc. Int. Conf. Acoustics, Speech, Signal Process., vol. 1, pp. 397–400, 1998.
 [22] S. Braun, K. Kowalczyk, and E. A. Habets, “Residual noise control using a parametric multichannel wiener filter,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 360–364, 2015.
 [23] J. Benesty, J. Chen, Y. Huang, and I. Cohen, Noise reduction in speech processing, vol. 2. Springer Science & Business Media, 2009.
 [24] R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans. Speech Audio Process., vol. 9, no. 5, pp. 504–512, 2001.
 [25] I. Cohen, “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process.,, vol. 11, no. 5, pp. 466–475, 2003.
 [26] T. Gerkmann and R. C. Hendriks, “Unbiased mmsebased noise power estimation with low complexity and low tracking delay,” IEEE Trans. Audio, Speech, Language Process., vol. 20, no. 4, pp. 1383–1393, 2011.
 [27] T. Inoue, H. Saruwatari, K. Shikano, and K. Kondo, “Theoretical analysis of musical noise in wiener filtering family via higherorder statistics,” in Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 5076–5079, 2011.
 [28] Z. Duan, G. J. Mysore, and P. Smaragdis, “Speech enhancement by online nonnegative spectrogram decomposition in nonstationary noise environments,” in Proc. INTERSPEECH, pp. 1–4, 2012.

[29]
A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: Ii. noisex92: A database and an experiment to study the effect of additive noise on speech recognition systems,”
Speech Commun., vol. 12, no. 3, pp. 247–251, 1993.  [30] M. Kolbæk, Z.H. Tan, S. H. Jensen, and J. Jensen, “On loss functions for supervised monaural timedomain speech enhancement,” arXiv preprint arXiv:1909.01019, 2019.
 [31] G. Wichern and J. Le Roux, “Phase reconstruction with learned timefrequency representations for singlechannel speech separation,” in IEEE Int. Workshop Acoust. Signal Enhancement (IWAENC), pp. 396–400, IEEE, 2018.
 [32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [33] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006.
 [34] C. Breithaupt, T. Gerkmann, and R. Martin, “Cepstral smoothing of spectral filter gains for speech enhancement without musical noise,” IEEE Signal Process. Lett., vol. 14, no. 12, pp. 1036–1039, 2007.