1 Introduction
Speech enhancement technology aims to improve the quality and the comfort of hearing while improving speech intelligibility [11]. In recent years, deep learning has been widely used in speech enhancement. In addition to exploring various deep network structures to build effective enhancement models, related research also involves the construction and optimization of loss functions used to guide model training. In general, the loss function is constructed based on a certain distance measure between the predicted speech and the reference one. Common methods are mean square error (MSE) [25] and mean absolute error (MAE) [15] as distance measures, which are very convenient to calculate. [10] proposed the scaleinvariant signaltonoise ratio (SISNR), and [24] proposed the scaledependent signal distortion ratio (SDSDR). This kind of loss function directly uses the distance between the enhanced speech signal waveform and the reference one to evaluate the enhanced speech quality, and there are many successful applications [9, 17]. However, this kind of objective distance is different from subjective auditory one, so it is necessary to introduce auditory effect into the composition of distance measure. The basic idea is to simulate the perceptual characteristics of human ear to frequency and loudness [29], and to implement nonlinear warp of frequency and compression of intensity [18]. Jesper proposed extended shorttime objective intelligibility (ESTOI) [12] based on STOI [27, 26], and it’s obtained by calculating subband spectral envelope correlation coefficient. [13] introduced the perceptual evaluation of speech quality (PESQ) [28], and combined with the MSE in the logarithmic power spectra to form perceptual metric for speech quality evaluation (PMSQE) loss function (To distinguish the MSE in [13] from the ordinary MSE, we abbreviate it as LogMSE). [30, 32] use the power exponent to compress the value of each component in their loss functions. In addition, [7, 23] trained an exclusive network for speech quality evaluation, and in [6], the evaluation network is connected to the tail of the speech enhancement model to guide the training of the model.
Due to the need to test large amounts of data, subjective evaluation consumes too much resources. Several objective indexes, such as PESQ provided by ITUT P.862 [28], STOI and SISNR become the main scheme to measure the quality of speech. By analyzing the loss functions such as MSE, PMSQE, SISNR, and STOI, we find that the correlation coefficients [19] between these loss functions and PESQ are generally lower than
. During training, PMSQE and STOI can’t pay attention to the phase information because they are measured on the magnitude spectra, and too much empirical auditory processing will make the model be inclined to the corresponding indexes (PESQ or STOI). Since SISNR is measured by waveform in the timedomain, the magnitude and phase can be investigated indirectly and it introduces a reasonable expression of signaltonoise ratio (SNR) to make its effect better and more stable. However, when the performance of the model is improved, the guiding effect of SISNR loss function will deteriorate. Therefore, a signaltonoise ratio loss function based on auditory power compression (APCSNR) is proposed. Firstly, the auditory power exponents used in PESQ are referenced and mapped back to the power spectrum. Secondly, the exponent operation in the power spectrum is converted into the proportional factor in the timefrequency spectrum, and the proportional factor is controlled by hyperparameters before it is applied to the signal. Finally, we refer to the SNR representation method of timedomain SISNR to calculate the loss function in the compressed timefrequency domain (tfdomain). The experimental results show that this method has a good correlation with most speech quality evaluation indexes. And it can make the comprehensive performance of the model better.
The rest of this paper is arranged as follows: Section 2 introduces the related algorithms. The third section is the description of the auditory power compression loss function. Section 4 is experiment and result analysis. The fifth section is the work summary.
2 Related work
2.1 Scaleinvariant signaltonoise ratio
The scaleinvariant signaltonoise ratio (SISNR) [10] is a reasonable expression of SNR. Measurement in timedomain can take into account both magnitude and phase as shown in the Fig.1(a). So SISNR is a commonly used loss function and is defined as,
(1) 
where and represent the referenced and the enhanced signal respectively. represents the energy of the signal.
2.2 Perceptual metric for speech quality evaluation
When calculating PMSQE [13] or PESQ [28], the first step is level alignment. PESQ aligns the enhanced to label. But since we can’t artificially introduce the influence of label during training, PMSQE aligns label to the enhanced:
(2) 
where is the tth frame of the power spectrum. is a spectral weighting mask which replicates the bandpass filtering and is a power correction factor which accounts for the frame lengh, overlapping and windowing applied during the spectral computation (via STFT). Then the power spectrum is mapped to the Bark spectrum by transformation matrix ,
(3) 
where , , , and are the dimensions of the Bark spectrum and power spectrum. Then signals are transferred to loudness spectra as,
(4) 
where is the scaling factor, and is the absolute auditory threshold for the qth Bark band. Each Bark band has a corresponding invariable [33]
. Then, PESQ and PMSQE measures the difference between enhanced and referenced signal in loudness spectra. PMSQE simplify the computation of the symmetrical disturbance vector proposed in PESQ by applying a centerclipping operator over the absolute difference between the loudness spectra as,
(5) 
where and represent the referenced and enhanced loudness spectra respectively. , , and are elementwise operation. Then the asymmetric disturbance vector can be obtained as,
(6) 
where represents elementwise multiplication, elements of are computed from the Bark spectra of referenced and enhanced as follows,
(7) 
and are 50 and 1.2. Symmetric disturbance and asymmetric disturbance measurement are calculated as follows,
(8) 
(9) 
where is a vector filled with weights proportional to the width of the Bark band, obtained from [28].
Finally, PMSQE introduces LogMSE, because the resulting PESQ includes highly nonlinear and nonfully differentiable operators which can lead to gradient misguidance if applied alone [13]. The loss is defined as,
(10) 
where and are the reference and enhanced power spectra respectively.
is the standard deviation of logpower spectrum.
and are weighting factors. is the number of frames.3 Proposed method
3.1 Scaleinvariant signaltonoise ratio in timefrequency spectrum
The idea of signaltonoise ratio expression of SISNR in timedomain is migrated to the timefreqency spectrum:
(11) 
where and represent the timefreqency spectra of enhanced and referenced signal respectively. As shown in Fig.1(b), different from the timedomain, the can only focus on the phase, so the effect will be far worse than the SISNR in timedomain. We will introduce the scaling calculated based on the magnitude to compensate for the magnitude insensitivity. In the timefreqency spectrum, we can make the loss function focus on magnitude and phase more controllable. It is also more conducive to gradient transfer for tfdomain model.
3.2 Auditory power compression
Auditory loudness spectrum (4) is the core of PMSQE and PESQ, so we conducted a indepth analysis of it.
(12) 
With the expansion of the formula (4). The abstract formula becomes easier to explain. The loudness conversion mainly calculates the distance between the power of each band and the absolute auditory threshold under the action of in the Bark spectra.
Since we need to use it in the loss function, we need to consider its gradient,
(13) 
where is a constant in . is the a constant in . It can be observed that the maximum multiple between different can reach 200 million. It is easy to make a large deviation of gradients between different subbands with an exponential about . We speculate that the difference of gradient and masking effect (5) can easily make the model unbalanced in training. It will make the optimizer only focus on the parts beneficial to the PESQ score. Using the Zwicker exponentials without the absolute auditory thresholds could also endow different subbands with auditory difference characteristics. So we simplify the loudness expression (12) to,
(14) 
where is set to prevent the base number from zero.
The second problem is the gradient blur. Let’s return to the formula (3), where the matrix is a sparse zeroone matrix. It aims to calculate the sum of the specified subbands and obtain the corresponding subband values of the Bark spectrum. It will cause the gradient of some (up to 25) subbands to become the same when the gradient is propagated back, resulting in gradient blur.
So we map the Zwicker auditory effect exponentials to the power spectrum according to the correspondence between the Bark bands and power bands [33]. Then we change the Bark spectrum in formula (14) to the power spectrum , and the auditory expression is changed to,
(15) 
where and represent the real and imaginary part of the timefreqency spectrum respectively.
In order to avoid the absence of phase in the power spectrum, we convert the exponential operation of the power spectrum into the proportional scaling relationship of the real and imaginary part in timefreqency spectrum. And the auditory expression is changed to ,
(16) 
where and represent the real and imaginary part of auditory expression . is a scaling factor computed by,
(17) 
In order to prevent the difference of scaling between different bins become too large. A limiting threshold is set to change the value less than to ,
(18) 
Finally, we measure the loss by formula (11) in compressed spectra . A signaltonoise ratio loss function based on auditory power compression (APCSNR) is obtained,
(19) 
where and represent enhanced and referenced spectrum, and their is calculated from their energy spectra respectively.
4 Experiments
4.1 Model design and training setup
The model used in our experiment is similar to that in these papers [30, 3]. The overall framework is shown in the Fig.2.
We used the shorttime Fourier transform (STFT) and its inversion (iSTFT) with 512 window length, 256 frame shift, and hanning window. The magnitude spectrum was taken as the input. The model is mainly composed of fullyconnected layer (FC) and gated recurrent units (GRUs)
[5]. Rectified linearunit (ReLU)
[20] is used as activations except for the last layer. Sigmoid activation [8] is used to predict the gain mask. The result is obtained by the noisy timefrequency spectrum multiplied by the gain mask.The calculation of the loss function was divided into cases A or B. If the loss function is calculated in the timefreqency spectrum, operation A will be performed. If the loss function is calculated in the timedomain, operation B will be performed. And parameters of model are optimized by Adam method [14]. And the learning rate was initialized to , which was decayed to
when the loss on the verification set plateaued for 5 epochs. The training was stopped if the loss plateaued for 20 epochs.
4.2 Data
We used the speech and noise data in DNSChallenge [22] to generate a total of hours noisy speech. The SNR was between dB and dB. We shifted pitch of speech by pysox [2], and the shifted value was between and semitones. Data was then divided into training and validation set at .
The SNR of the test data was between dB and dB, and the pitch shift was also between and semitones. The speech data in the test set did not participate in the training or validation set, and the noises were from the ESC50 [21] dataset. A total of 14,000 10seconds audio were generated for testing.
PESQ  STOI  PMSQE1  LogMSE  MSE  SISNR  

PESQ  1  0.71  0.87  0.65  0.40  0.88 
STOI  0.71  1  0.86  0.71  0.47  0.80 
PMSQE1  0.87  0.86  1  0.79  0.49  0.87 
LogMSE  0.65  0.71  0.79  1  0.43  0.68 
MSE  0.40  0.47  0.49  0.43  1  0.51 
APCMSE  0.50  0.60  0.61  0.60  0.94  0.59 
SISNR  0.88  0.80  0.87  0.68  0.51  1 
APCSNR  0.91  0.83  0.91  0.74  0.52  0.99 
4.3 Correlation analysis
Pearson correlation coefficient is often used to measure the correlation between indexes [1]. We calculated the PESQ, STOI, LogMSE ( in formula (10) ), PMSQE1 (formula (10) without LogMSE), MSE, SISNR, and APCSNR on test data. The reason why we decomposed PMSQE into PMSQE1 and LogMSE is that the value ranges of them are quite different. If combined, the correlation will be dominated by LogMSE. Table 1 shows the absolute values of the correlation coefficient between indexes. It can be observed that the overall correlation between our proposed method and other indexes is better than other methods. We also additionally tested the MSE after compression (APCMSE). From the perspective of correlation, APCMSE is better than MSE, which also proves the effectiveness of auditory perception compression.
We also drew the distribution of each loss and PESQ in Fig.3
. The larger value of loss function, the smaller value of the PESQ. It can be seen that STOI, PMSQE1, and LogMSE become insensitive with the decrease of PESQ. MSE and APCMSE are insensitive on both sides. We speculate that insensitivity will mislead the model in the case of corresponding speech quality. For SISNR, although the variance is large, it have a more stable downward trend with the increase of PESQ, and the tail of the insensitive area is smaller. This also explains why SISNR can achieve good results although it converges slowly. For the APCSNR we proposed, its overall variance is smaller, and the insensitiveness of the head and tail is also weakened. It can be seen that the proposed loss function is more sensitive to the speech quality, and the small variance can make the model converge faster. It is more suitable for model training.
4.4 Evaluation scores for experiments
Although many indexes are listed in the table 1, only PESQ, STOI, and SISNR are widely used in model performance evaluation [16, 4, 31]. PESQ introduces nondifferentiable auditory effects and considers numerical differences in loudness spectra, which can accurately measure the energy value in the range of human ear hearing. PMSQE is only a differentiable suboptimal mixture for PESQ and LogMSE. STOI measures correlation on octave spectra, it’s more sensitive to the delay between the spectra. SISNR is a reasonable measure of SNR and it can measure both magnitude and phase. MSE and LogMSE are the most traditional distance metrics, they neither accord with human hearing, nor have a reasonable expression of SNR. So MSE and LogMSE can’t reasonably measure the performance of the model, but comprehensive consideration of PESQ, STOI and SISNR can.
4.5 Hyperparameters analysis
Based on the model and data in 4.1, we analyzed the hyperparameters and in formula (17) and formula (18).
Shown as Fig.4, firstly we set to and was set to , , , , , , and respectively to test. It can be seen from the graph that PESQ and STOI with value of are the best, and SISNR is slightly lower than . On the whole, the three indexes were concave distribution. The reason is that controls the lower bound of compression in formula (17), and is equivalent to 100 times the maximum compression difference when is set to . The maximum difference of compression ratio will increase with the decrease of . If the difference is too large, it is easy to lead to overfitting of the model. If it is too small, it will weaken the compression effect, so is a suitable value. Similarly, we set to and conducted experiments on . The experiment found that setting was appropriate. The reason is that the exponent of formula (17) is about . If is set as a small number, the in the region with energy less than will be very large, which will lead to the distraction for the model. So set to can control scaling below and reserve exponential scaling effect.
It can be seen from the above experiments that and are used to control the upper and lower bounds of respectively. The two can be regulated according to different situations.
4.6 Experimental results and discussion
We trained models by STOI, PMSQE1 (without LogMSE), PMSQE (10), MSE, SISNR, and APCSNR respectively. The results are shown in the table LABEL:results. Due to the different value ranges of PESQ, STOI and SISNR, we take the average of the standardized values of each index as the comprehensive index (CI), which is defined as,
(20) 
where PESQSTOISISNR. and represent the mean and standard deviation of corresponding index respectively (computed by column elements in table 2). represent the value of corresponding index of each method.
As shown in the table LABEL:results, PMSQE1 can greatly improve the performance on PESQ, but the performance on STOI and SISNR will be weakened We speculate that the gradient inhomogeneity and blur problems mentioned in section 3 will make the model tend to optimize the favorable parts for PESQ, and ignore the structural characteristics of the signal itself. STOI converges very slowly and results show that the model only inclines to STOI. Because PMSQE1 and STOI only focus on magnitude, they perform poorly in SISNR. After adding LogMSE to PMSQE1, the effect on the PESQ score will be weakened, and the other scores are improved. MSE and SISNR methods perform well. This proves that the conversion of auditory domain and the introduction of a large number of auditory constants are not conducive to model training. Our method shows better comprehensive performance (CI=0.570) than other referenced loss functions. The proposed loss function can properly introduce the auditory compression effect without auditory spectral mapping, and endow the measurement method 3.1 that only focuses on the phase with magnitude difference. It can improve the auditory quality (PESQ and STOI) under the premise of ensuring the original signal structure (SISNR), so that the comprehensive performance is improved.
Besides, we also combined APCSNR with PMSQE1 and achieved better results. We speculate that APCSNR is a linear loss function similar to LogMSE, and PMSQE1 is a nonlinear loss function. And our method is much better than LogMSE, it just makes up for the shortcomings of PMSQE1, so the combination can get very good results (CI=0.838).
PESQ  STOI  SISNR  CI  

PMSQE1  2.819  0.915  6.793  1.067 
STOI  2.422  0.942  14.932  0.337 
PMSQE  2.609  0.928  15.538  0.310 
MSE  2.593  0.934  17.098  0.010 
SISNR  2.638  0.937  17.482  0.295 
APCSNR  2.718  0.939  17.532  0.570 
PMSQE1+APCSNR  2.794  0.940  17.638  0.838 
5 Conclusions
In this paper, we proposed a signaltonoise ratio loss function based on auditory power compression (APCSNR) for model training, which can improve the overall performance of the model under various indexes. The experimental results show that the comprehensive quality score of the model trained based on our method is better than other referenced ones. This method also maintains a good correlation with evaluation indexes. Not only can it be used to train models, but it can also be used to evaluate speech quality. But this method still needs improvement, such as the adaptation of auditory power index under different frequency resolutions and the adaptation with different mask apply methods of model. More effort is needed to delve into the solutions.
References

[1]
(2019)
Nonintrusive speech quality assessment using neural networks
. ICASSP 2019  2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Cited by: §4.3.  [2] (2016) Pysox: leveraging the audio signal processing power of sox in python. In Proceedings of the International Society for Music Information Retrieval Conference Late Breaking and Demo Papers, Cited by: §4.2.
 [3] (2020) Data augmentation and loss normalization for deep noise suppression. In International Conference on Speech and Computer, pp. 79–86. Cited by: §4.1.
 [4] (2021) Realtime denoising and dereverberation with tiny recurrent unet. arXiv preprint arXiv:2102.03207. Cited by: §4.4.

[5]
(2014)
Empirical evaluation of gated recurrent neural networks on sequence modeling
. arXiv preprint arXiv:1412.3555. Cited by: §4.1.  [6] (2019) Learning with learned loss function: speech enhancement with qualitynet to improve perceptual evaluation of speech quality. IEEE Signal Processing Letters PP (99), pp. 1–1. Cited by: §1.

[7]
(2019)
Intrusive and nonintrusive perceptual speech quality assessment using a convolutional neural network
. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 85–89. Cited by: §1. 
[8]
(1995)
The influence of the sigmoid function parameters on the speed of backpropagation learning
. In IWANN ’96 Proceedings of the International Workshop on Artificial Neural Networks: From Natural to Artificial Neural Computation, pp. 195–201. Cited by: §4.1.  [9] (2020) Dccrn: deep complex convolution recurrent network for phaseaware speech enhancement. arXiv preprint arXiv:2008.00264. Cited by: §1.
 [10] (2016) Singlechannel multispeaker separation using deep clustering. In Interspeech 2016, Cited by: §1, §2.1.
 [11] (1986) Speech enhancement. In ICASSP ’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 11, pp. 3135–3142. External Links: Document Cited by: §1.
 [12] (2016) An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (11), pp. 2009–2022. Cited by: §1.
 [13] (2018) A deep learning loss function based on the perceptual evaluation of the speech quality. IEEE Signal Processing Letters. Cited by: §1, §2.2, §2.2.
 [14] (2014) Adam: a method for stochastic optimization. Computer Science. Cited by: §4.1.
 [15] (2020) Monaural speech enhancement with recursive learning in the time domain. arXiv preprint arXiv:2003.09815. Cited by: §1.
 [16] (2021) Glance and gaze: a collaborative learning framework for singlechannel speech enhancement.. arXiv preprint arXiv:2106.11789. Cited by: §4.4.

[17]
(2021)
DCCRN+: channelwise subband dccrn with snr estimation for speech enhancement
. arXiv preprint arXiv:2106.08672. Cited by: §1.  [18] (2012) Statespace frequencydomain adaptive filtering for nonlinear acoustic echo cancellation. IEEE Transactions on Audio, Speech, and Language Processing 20 (7), pp. 2065–2079. Cited by: §1.
 [19] (2009) Pearson correlation coefficient. Springer Vienna 10.1007/9783211898369 (Chapter 1025), pp. 132–132. Cited by: §1.

[20]
(2010)
Rectified linear units improve restricted boltzmann machines.
In
Proceedings of the 27th International Conference on Machine Learning
, pp. 807–814. Cited by: §4.1.  [21] (20151013) ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. External Links: Link, Document, ISBN 9781450334594 Cited by: §4.2.
 [22] Interspeech 2021 deep noise suppression challenge. arXiv preprint arXiv:2101.01902. Cited by: §4.2.
 [23] (2020) DNSMOS: a nonintrusive perceptual objective speech quality metric to evaluate noise suppressors. arXiv preprint arXiv:2010.15258. Cited by: §1.
 [24] (2019) SDR – halfbaked or well done?. In ICASSP 2019  2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1.
 [25] (2017) Multipletarget deep learning for lstmrnn based speech enhancement. In Handsfree Speech Communications & Microphone Arrays, Cited by: §1.
 [26] (2010) A shorttime objective intelligibility measure for timefrequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing, pp. 4214–4217. Cited by: §1.
 [27] (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing 19 (7), pp. 2125–2136. Cited by: §1.
 [28] (2001) Perceptual evaluation of speech quality (pesq) : an objective method for endtoend speech quality assessment of narrowband telephone networks and speech codecs. ITUT Recommendation P. 862. Cited by: §1, §1, §2.2, §2.2.
 [29] (2001) Single channel speech enhancement using mdlbased subspace approach in bark domain. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), Vol. 1, pp. 641–644. Cited by: §1.
 [30] (2019) Differentiable consistency constraints for improved deep speech enhancement. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 900–904. Cited by: §1, §4.1.
 [31] (2021) Monaural speech enhancement with complex convolutional block attention module and joint time frequency losses. ICASSP 2021  2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Cited by: §4.4.
 [32] (2021) Complex spectral mapping with attention based convolution recurrent neural network for speech enhancement. arXiv preprint arXiv:2104.05267. Cited by: §1.
 [33] (1967) Das ohr als nachrichtenempfänger: monographien der elektrischen nachrichtentechnik. Cited by: §2.2, §3.2.