DeepAI
Log In Sign Up

A Deep Learning Loss Function based on Auditory Power Compression for Speech Enhancement

08/26/2021
by   Tianrui Wang, et al.
0

Deep learning technology has been widely applied to speech enhancement. While testing the effectiveness of various network structures, researchers are also exploring the improvement of the loss function used in network training. Although the existing methods have considered the auditory characteristics of speech or the reasonable expression of signal-to-noise ratio, the correlation with the auditory evaluation score and the applicability of the calculation for gradient optimization still need to be improved. In this paper, a signal-to-noise ratio loss function based on auditory power compression is proposed. The experimental results show that the overall correlation between the proposed function and the indexes of objective speech intelligibility, which is better than other loss functions. For the same speech enhancement model, the training effect of this method is also better than other comparison methods.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/03/2019

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Many deep learning-based speech enhancement algorithms are designed to m...
04/27/2017

Complex spectrogram enhancement by convolutional neural network with multi-metrics learning

This paper aims to address two issues existing in the current speech enh...
03/30/2021

Time-domain Speech Enhancement with Generative Adversarial Learning

Speech enhancement aims to obtain speech signals with high intelligibili...
02/11/2022

A Novel Speech Intelligibility Enhancement Model based on CanonicalCorrelation and Deep Learning

Current deep learning (DL) based approaches to speech intelligibility en...
12/07/2020

Modeling the effects of dynamic range compression on signals in noise

Hearing aids use dynamic range compression (DRC), a form of automatic ga...

1 Introduction

Speech enhancement technology aims to improve the quality and the comfort of hearing while improving speech intelligibility [11]. In recent years, deep learning has been widely used in speech enhancement. In addition to exploring various deep network structures to build effective enhancement models, related research also involves the construction and optimization of loss functions used to guide model training. In general, the loss function is constructed based on a certain distance measure between the predicted speech and the reference one. Common methods are mean square error (MSE) [25] and mean absolute error (MAE) [15] as distance measures, which are very convenient to calculate. [10] proposed the scale-invariant signal-to-noise ratio (SI-SNR), and [24] proposed the scale-dependent signal distortion ratio (SD-SDR). This kind of loss function directly uses the distance between the enhanced speech signal waveform and the reference one to evaluate the enhanced speech quality, and there are many successful applications [9, 17]. However, this kind of objective distance is different from subjective auditory one, so it is necessary to introduce auditory effect into the composition of distance measure. The basic idea is to simulate the perceptual characteristics of human ear to frequency and loudness [29], and to implement nonlinear warp of frequency and compression of intensity [18]. Jesper proposed extended short-time objective intelligibility (ESTOI) [12] based on STOI [27, 26], and it’s obtained by calculating subband spectral envelope correlation coefficient. [13] introduced the perceptual evaluation of speech quality (PESQ) [28], and combined with the MSE in the logarithmic power spectra to form perceptual metric for speech quality evaluation (PMSQE) loss function (To distinguish the MSE in [13] from the ordinary MSE, we abbreviate it as LogMSE). [30, 32] use the power exponent to compress the value of each component in their loss functions. In addition, [7, 23] trained an exclusive network for speech quality evaluation, and in [6], the evaluation network is connected to the tail of the speech enhancement model to guide the training of the model.

Due to the need to test large amounts of data, subjective evaluation consumes too much resources. Several objective indexes, such as PESQ provided by ITU-T P.862 [28], STOI and SI-SNR become the main scheme to measure the quality of speech. By analyzing the loss functions such as MSE, PMSQE, SI-SNR, and STOI, we find that the correlation coefficients [19] between these loss functions and PESQ are generally lower than

. During training, PMSQE and STOI can’t pay attention to the phase information because they are measured on the magnitude spectra, and too much empirical auditory processing will make the model be inclined to the corresponding indexes (PESQ or STOI). Since SI-SNR is measured by waveform in the time-domain, the magnitude and phase can be investigated indirectly and it introduces a reasonable expression of signal-to-noise ratio (SNR) to make its effect better and more stable. However, when the performance of the model is improved, the guiding effect of SI-SNR loss function will deteriorate. Therefore, a signal-to-noise ratio loss function based on auditory power compression (APC-SNR) is proposed. Firstly, the auditory power exponents used in PESQ are referenced and mapped back to the power spectrum. Secondly, the exponent operation in the power spectrum is converted into the proportional factor in the time-frequency spectrum, and the proportional factor is controlled by hyper-parameters before it is applied to the signal. Finally, we refer to the SNR representation method of time-domain SI-SNR to calculate the loss function in the compressed time-frequency domain (tf-domain). The experimental results show that this method has a good correlation with most speech quality evaluation indexes. And it can make the comprehensive performance of the model better.

The rest of this paper is arranged as follows: Section 2 introduces the related algorithms. The third section is the description of the auditory power compression loss function. Section 4 is experiment and result analysis. The fifth section is the work summary.

(a) SI-SNR in the time-domain (b) SI-SNR in the tf-domain
Figure 1: SI-SNR in the different domains. and represent noisy and noise waveform respectively. and represent real and imaginary axis of tf-domain respectively.

2 Related work

2.1 Scale-invariant signal-to-noise ratio

The scale-invariant signal-to-noise ratio (SI-SNR) [10] is a reasonable expression of SNR. Measurement in time-domain can take into account both magnitude and phase as shown in the Fig.1(a). So SI-SNR is a commonly used loss function and is defined as,

(1)

where and represent the referenced and the enhanced signal respectively. represents the energy of the signal.

2.2 Perceptual metric for speech quality evaluation

When calculating PMSQE [13] or PESQ [28], the first step is level alignment. PESQ aligns the enhanced to label. But since we can’t artificially introduce the influence of label during training, PMSQE aligns label to the enhanced:

(2)

where is the t-th frame of the power spectrum. is a spectral weighting mask which replicates the band-pass filtering and is a power correction factor which accounts for the frame lengh, overlapping and windowing applied during the spectral computation (via STFT). Then the power spectrum is mapped to the Bark spectrum by transformation matrix ,

(3)

where , , , and are the dimensions of the Bark spectrum and power spectrum. Then signals are transferred to loudness spectra as,

(4)

where is the scaling factor, and is the absolute auditory threshold for the q-th Bark band. Each Bark band has a corresponding invariable [33]

. Then, PESQ and PMSQE measures the difference between enhanced and referenced signal in loudness spectra. PMSQE simplify the computation of the symmetrical disturbance vector proposed in PESQ by applying a center-clipping operator over the absolute difference between the loudness spectra as,

(5)

where and represent the referenced and enhanced loudness spectra respectively. , , and are element-wise operation. Then the asymmetric disturbance vector can be obtained as,

(6)

where represents element-wise multiplication, elements of are computed from the Bark spectra of referenced and enhanced as follows,

(7)

and are 50 and 1.2. Symmetric disturbance and asymmetric disturbance measurement are calculated as follows,

(8)
(9)

where is a vector filled with weights proportional to the width of the Bark band, obtained from [28].

Finally, PMSQE introduces LogMSE, because the resulting PESQ includes highly nonlinear and non-fully differentiable operators which can lead to gradient misguidance if applied alone [13]. The loss is defined as,

(10)

where and are the reference and enhanced power spectra respectively.

is the standard deviation of log-power spectrum.

and are weighting factors. is the number of frames.

3 Proposed method

3.1 Scale-invariant signal-to-noise ratio in time-frequency spectrum

The idea of signal-to-noise ratio expression of SI-SNR in time-domain is migrated to the time-freqency spectrum:

(11)

where and represent the time-freqency spectra of enhanced and referenced signal respectively. As shown in Fig.1(b), different from the time-domain, the can only focus on the phase, so the effect will be far worse than the SI-SNR in time-domain. We will introduce the scaling calculated based on the magnitude to compensate for the magnitude insensitivity. In the time-freqency spectrum, we can make the loss function focus on magnitude and phase more controllable. It is also more conducive to gradient transfer for tf-domain model.

3.2 Auditory power compression

Auditory loudness spectrum (4) is the core of PMSQE and PESQ, so we conducted a in-depth analysis of it.

(12)

With the expansion of the formula (4). The abstract formula becomes easier to explain. The loudness conversion mainly calculates the distance between the power of each band and the absolute auditory threshold under the action of in the Bark spectra.

Since we need to use it in the loss function, we need to consider its gradient,

(13)

where is a constant in . is the a constant in . It can be observed that the maximum multiple between different can reach 200 million. It is easy to make a large deviation of gradients between different subbands with an exponential about . We speculate that the difference of gradient and masking effect (5) can easily make the model unbalanced in training. It will make the optimizer only focus on the parts beneficial to the PESQ score. Using the Zwicker exponentials without the absolute auditory thresholds could also endow different subbands with auditory difference characteristics. So we simplify the loudness expression (12) to,

(14)

where is set to prevent the base number from zero.

The second problem is the gradient blur. Let’s return to the formula (3), where the matrix is a sparse zero-one matrix. It aims to calculate the sum of the specified subbands and obtain the corresponding subband values of the Bark spectrum. It will cause the gradient of some (up to 25) subbands to become the same when the gradient is propagated back, resulting in gradient blur.

So we map the Zwicker auditory effect exponentials to the power spectrum according to the correspondence between the Bark bands and power bands [33]. Then we change the Bark spectrum in formula (14) to the power spectrum , and the auditory expression is changed to,

(15)

where and represent the real and imaginary part of the time-freqency spectrum respectively.

In order to avoid the absence of phase in the power spectrum, we convert the exponential operation of the power spectrum into the proportional scaling relationship of the real and imaginary part in time-freqency spectrum. And the auditory expression is changed to ,

(16)

where and represent the real and imaginary part of auditory expression . is a scaling factor computed by,

(17)

In order to prevent the difference of scaling between different bins become too large. A limiting threshold is set to change the value less than to ,

(18)

Finally, we measure the loss by formula (11) in compressed spectra . A signal-to-noise ratio loss function based on auditory power compression (APC-SNR) is obtained,

(19)

where and represent enhanced and referenced spectrum, and their is calculated from their energy spectra respectively.

4 Experiments

4.1 Model design and training setup

The model used in our experiment is similar to that in these papers [30, 3]. The overall framework is shown in the Fig.2.

We used the short-time Fourier transform (STFT) and its inversion (iSTFT) with 512 window length, 256 frame shift, and hanning window. The magnitude spectrum was taken as the input. The model is mainly composed of fully-connected layer (FC) and gated recurrent units (GRUs)

[5]

. Rectified linearunit (ReLU)

[20] is used as activations except for the last layer. Sigmoid activation [8] is used to predict the gain mask. The result is obtained by the noisy time-frequency spectrum multiplied by the gain mask.

Figure 2: The framework of model and loss function.

The calculation of the loss function was divided into cases A or B. If the loss function is calculated in the time-freqency spectrum, operation A will be performed. If the loss function is calculated in the time-domain, operation B will be performed. And parameters of model are optimized by Adam method [14]. And the learning rate was initialized to , which was decayed to

when the loss on the verification set plateaued for 5 epochs. The training was stopped if the loss plateaued for 20 epochs.

4.2 Data

We used the speech and noise data in DNS-Challenge [22] to generate a total of hours noisy speech. The SNR was between dB and dB. We shifted pitch of speech by pysox [2], and the shifted value was between and semitones. Data was then divided into training and validation set at .

The SNR of the test data was between dB and dB, and the pitch shift was also between and semitones. The speech data in the test set did not participate in the training or validation set, and the noises were from the ESC-50 [21] dataset. A total of 14,000 10-seconds audio were generated for testing.

PESQ STOI PMSQE1 LogMSE MSE SI-SNR
PESQ 1 0.71 0.87 0.65 0.40 0.88
STOI 0.71 1 0.86 0.71 0.47 0.80
PMSQE1 0.87 0.86 1 0.79 0.49 0.87
LogMSE 0.65 0.71 0.79 1 0.43 0.68
MSE 0.40 0.47 0.49 0.43 1 0.51
APC-MSE 0.50 0.60 0.61 0.60 0.94 0.59
SI-SNR 0.88 0.80 0.87 0.68 0.51 1
APC-SNR 0.91 0.83 0.91 0.74 0.52 0.99
Table 1: The correlation coefficient between evaluation indexes.
Figure 3: Distribution of each loss with PESQ.

4.3 Correlation analysis

Pearson correlation coefficient is often used to measure the correlation between indexes [1]. We calculated the PESQ, STOI, LogMSE ( in formula (10) ), PMSQE1 (formula (10) without LogMSE), MSE, SI-SNR, and APC-SNR on test data. The reason why we decomposed PMSQE into PMSQE1 and LogMSE is that the value ranges of them are quite different. If combined, the correlation will be dominated by LogMSE. Table 1 shows the absolute values of the correlation coefficient between indexes. It can be observed that the overall correlation between our proposed method and other indexes is better than other methods. We also additionally tested the MSE after compression (APC-MSE). From the perspective of correlation, APC-MSE is better than MSE, which also proves the effectiveness of auditory perception compression.

We also drew the distribution of each loss and PESQ in Fig.3

. The larger value of loss function, the smaller value of the PESQ. It can be seen that STOI, PMSQE1, and LogMSE become insensitive with the decrease of PESQ. MSE and APC-MSE are insensitive on both sides. We speculate that insensitivity will mislead the model in the case of corresponding speech quality. For SI-SNR, although the variance is large, it have a more stable downward trend with the increase of PESQ, and the tail of the insensitive area is smaller. This also explains why SI-SNR can achieve good results although it converges slowly. For the APC-SNR we proposed, its overall variance is smaller, and the insensitiveness of the head and tail is also weakened. It can be seen that the proposed loss function is more sensitive to the speech quality, and the small variance can make the model converge faster. It is more suitable for model training.

Figure 4: Model results with different values of and .

4.4 Evaluation scores for experiments

Although many indexes are listed in the table 1, only PESQ, STOI, and SI-SNR are widely used in model performance evaluation [16, 4, 31]. PESQ introduces non-differentiable auditory effects and considers numerical differences in loudness spectra, which can accurately measure the energy value in the range of human ear hearing. PMSQE is only a differentiable suboptimal mixture for PESQ and LogMSE. STOI measures correlation on octave spectra, it’s more sensitive to the delay between the spectra. SI-SNR is a reasonable measure of SNR and it can measure both magnitude and phase. MSE and LogMSE are the most traditional distance metrics, they neither accord with human hearing, nor have a reasonable expression of SNR. So MSE and LogMSE can’t reasonably measure the performance of the model, but comprehensive consideration of PESQ, STOI and SI-SNR can.

4.5 Hyper-parameters analysis

Based on the model and data in 4.1, we analyzed the hyper-parameters and in formula (17) and formula (18).

Shown as Fig.4, firstly we set to and was set to , , , , , , and respectively to test. It can be seen from the graph that PESQ and STOI with value of are the best, and SI-SNR is slightly lower than . On the whole, the three indexes were concave distribution. The reason is that controls the lower bound of compression in formula (17), and is equivalent to 100 times the maximum compression difference when is set to . The maximum difference of compression ratio will increase with the decrease of . If the difference is too large, it is easy to lead to over-fitting of the model. If it is too small, it will weaken the compression effect, so is a suitable value. Similarly, we set to and conducted experiments on . The experiment found that setting was appropriate. The reason is that the exponent of formula (17) is about . If is set as a small number, the in the region with energy less than will be very large, which will lead to the distraction for the model. So set to can control scaling below and reserve exponential scaling effect.

It can be seen from the above experiments that and are used to control the upper and lower bounds of respectively. The two can be regulated according to different situations.

4.6 Experimental results and discussion

We trained models by STOI, PMSQE1 (without LogMSE), PMSQE (10), MSE, SI-SNR, and APC-SNR respectively. The results are shown in the table LABEL:results. Due to the different value ranges of PESQ, STOI and SI-SNR, we take the average of the standardized values of each index as the comprehensive index (CI), which is defined as,

(20)

where PESQSTOISI-SNR. and represent the mean and standard deviation of corresponding index respectively (computed by column elements in table 2). represent the value of corresponding index of each method.

As shown in the table LABEL:results, PMSQE1 can greatly improve the performance on PESQ, but the performance on STOI and SI-SNR will be weakened We speculate that the gradient inhomogeneity and blur problems mentioned in section 3 will make the model tend to optimize the favorable parts for PESQ, and ignore the structural characteristics of the signal itself. STOI converges very slowly and results show that the model only inclines to STOI. Because PMSQE1 and STOI only focus on magnitude, they perform poorly in SI-SNR. After adding LogMSE to PMSQE1, the effect on the PESQ score will be weakened, and the other scores are improved. MSE and SI-SNR methods perform well. This proves that the conversion of auditory domain and the introduction of a large number of auditory constants are not conducive to model training. Our method shows better comprehensive performance (CI=0.570) than other referenced loss functions. The proposed loss function can properly introduce the auditory compression effect without auditory spectral mapping, and endow the measurement method 3.1 that only focuses on the phase with magnitude difference. It can improve the auditory quality (PESQ and STOI) under the premise of ensuring the original signal structure (SI-SNR), so that the comprehensive performance is improved.

Besides, we also combined APC-SNR with PMSQE1 and achieved better results. We speculate that APC-SNR is a linear loss function similar to LogMSE, and PMSQE1 is a nonlinear loss function. And our method is much better than LogMSE, it just makes up for the shortcomings of PMSQE1, so the combination can get very good results (CI=0.838).

PESQ STOI SI-SNR CI
PMSQE1 2.819 0.915 6.793 -1.067
STOI 2.422 0.942 14.932 -0.337
PMSQE 2.609 0.928 15.538 -0.310
MSE 2.593 0.934 17.098 0.010
SI-SNR 2.638 0.937 17.482 0.295
APC-SNR 2.718 0.939 17.532 0.570
PMSQE1+APC-SNR 2.794 0.940 17.638 0.838
Table 2: Test results of models trained by each loss function

5 Conclusions

In this paper, we proposed a signal-to-noise ratio loss function based on auditory power compression (APC-SNR) for model training, which can improve the overall performance of the model under various indexes. The experimental results show that the comprehensive quality score of the model trained based on our method is better than other referenced ones. This method also maintains a good correlation with evaluation indexes. Not only can it be used to train models, but it can also be used to evaluate speech quality. But this method still needs improvement, such as the adaptation of auditory power index under different frequency resolutions and the adaptation with different mask apply methods of model. More effort is needed to delve into the solutions.

References

  • [1] A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, and J. Gehrke (2019)

    Non-intrusive speech quality assessment using neural networks

    .
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Cited by: §4.3.
  • [2] R. Bittner, E. Humphrey, and J. Bello (2016) Pysox: leveraging the audio signal processing power of sox in python. In Proceedings of the International Society for Music Information Retrieval Conference Late Breaking and Demo Papers, Cited by: §4.2.
  • [3] S. Braun and I. Tashev (2020) Data augmentation and loss normalization for deep noise suppression. In International Conference on Speech and Computer, pp. 79–86. Cited by: §4.1.
  • [4] H. Choi, S. Park, J. H. Lee, H. Heo, D. Jeon, and K. Lee (2021) Real-time denoising and dereverberation with tiny recurrent u-net. arXiv preprint arXiv:2102.03207. Cited by: §4.4.
  • [5] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    .
    arXiv preprint arXiv:1412.3555. Cited by: §4.1.
  • [6] S. W. Fu, C. F. Liao, and Y. Tsao (2019) Learning with learned loss function: speech enhancement with quality-net to improve perceptual evaluation of speech quality. IEEE Signal Processing Letters PP (99), pp. 1–1. Cited by: §1.
  • [7] H. Gamper, C. K. Reddy, R. Cutler, I. J. Tashev, and J. Gehrke (2019)

    Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network

    .
    In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 85–89. Cited by: §1.
  • [8] J. Han and C. Moraga (1995)

    The influence of the sigmoid function parameters on the speed of backpropagation learning

    .
    In IWANN ’96 Proceedings of the International Workshop on Artificial Neural Networks: From Natural to Artificial Neural Computation, pp. 195–201. Cited by: §4.1.
  • [9] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie (2020) Dccrn: deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264. Cited by: §1.
  • [10] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey (2016) Single-channel multi-speaker separation using deep clustering. In Interspeech 2016, Cited by: §1, §2.1.
  • [11] Jae Lim (1986) Speech enhancement. In ICASSP ’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 11, pp. 3135–3142. External Links: Document Cited by: §1.
  • [12] J. Jensen and C. H. Taal (2016) An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (11), pp. 2009–2022. Cited by: §1.
  • [13] Juan, Manuel, Martín-Doñas, Angel, Manuel, Gomez, Jose, A., Gonzalez, and Antonio (2018) A deep learning loss function based on the perceptual evaluation of the speech quality. IEEE Signal Processing Letters. Cited by: §1, §2.2, §2.2.
  • [14] D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Computer Science. Cited by: §4.1.
  • [15] A. Li, C. Zheng, L. Cheng, R. Peng, and X. Li (2020) Monaural speech enhancement with recursive learning in the time domain. arXiv preprint arXiv:2003.09815. Cited by: §1.
  • [16] A. Li, C. Zheng, L. Zhang, and X. Li (2021) Glance and gaze: a collaborative learning framework for single-channel speech enhancement.. arXiv preprint arXiv:2106.11789. Cited by: §4.4.
  • [17] S. Lv, Y. Hu, S. Zhang, and L. Xie (2021)

    DCCRN+: channel-wise subband dccrn with snr estimation for speech enhancement

    .
    arXiv preprint arXiv:2106.08672. Cited by: §1.
  • [18] S. Malik and G. Enzner (2012) State-space frequency-domain adaptive filtering for nonlinear acoustic echo cancellation. IEEE Transactions on Audio, Speech, and Language Processing 20 (7), pp. 2065–2079. Cited by: §1.
  • [19] Nahler and Gerhard (2009) Pearson correlation coefficient. Springer Vienna 10.1007/978-3-211-89836-9 (Chapter 1025), pp. 132–132. Cited by: §1.
  • [20] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In

    Proceedings of the 27th International Conference on Machine Learning

    ,
    pp. 807–814. Cited by: §4.1.
  • [21] K. J. Piczak (2015-10-13) ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. External Links: Link, Document, ISBN 978-1-4503-3459-4 Cited by: §4.2.
  • [22] C. K. Reddy, H. Dubey, K. Koishida, A. Nair, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan Interspeech 2021 deep noise suppression challenge. arXiv preprint arXiv:2101.01902. Cited by: §4.2.
  • [23] C. K. Reddy, V. Gopal, and R. Cutler (2020) DNSMOS: a non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. arXiv preprint arXiv:2010.15258. Cited by: §1.
  • [24] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey (2019) SDR – half-baked or well done?. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1.
  • [25] L. Sun, J. Du, L. R. Dai, and C. H. Lee (2017) Multiple-target deep learning for lstm-rnn based speech enhancement. In Hands-free Speech Communications & Microphone Arrays, Cited by: §1.
  • [26] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing, pp. 4214–4217. Cited by: §1.
  • [27] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing 19 (7), pp. 2125–2136. Cited by: §1.
  • [28] I. T. Union (2001) Perceptual evaluation of speech quality (pesq) : an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. ITU-T Recommendation P. 862. Cited by: §1, §1, §2.2, §2.2.
  • [29] R. Vetter (2001) Single channel speech enhancement using mdl-based subspace approach in bark domain. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), Vol. 1, pp. 641–644. Cited by: §1.
  • [30] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous (2019) Differentiable consistency constraints for improved deep speech enhancement. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 900–904. Cited by: §1, §4.1.
  • [31] S. Zhao, T. H. Nguyen, and B. Ma (2021) Monaural speech enhancement with complex convolutional block attention module and joint time frequency losses. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Cited by: §4.4.
  • [32] L. Zhou, Y. Gao, Z. Wang, J. Li, and W. Zhang (2021) Complex spectral mapping with attention based convolution recurrent neural network for speech enhancement. arXiv preprint arXiv:2104.05267. Cited by: §1.
  • [33] E. Zwicker, R. Feldtkeller, and R. Feldtkeller (1967) Das ohr als nachrichtenempfänger: monographien der elektrischen nachrichtentechnik. Cited by: §2.2, §3.2.