Log In Sign Up

End-to-End Multi-Task Denoising for the Joint Optimization of Perceptual Speech Metrics

by   Jaeyoung Kim, et al.

Although supervised learning based on a deep neural network has recently achieved substantial improvement on speech enhancement, the existing schemes have either of two critical issues: spectrum or metric mismatches. The spectrum mismatch is a well known issue that any spectrum modification after short-time Fourier transform (STFT), in general, cannot be fully recovered after inverse short-time Fourier transform (ISTFT). The metric mismatch is that a conventional mean square error (MSE) loss function is typically sub-optimal to maximize perceptual speech measure such as signal-to-distortion ratio (SDR), perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). This paper presents a new end-to-end denoising framework. First, the network optimization is performed on the time-domain signals after ISTFT to avoid the spectrum mismatch. Second, three loss functions based on SDR, PESQ and STOI are proposed to minimize the metric mismatch. The experimental result showed the proposed denoising scheme significantly improved SDR, PESQ and STOI performance over the existing methods.


page 1

page 2

page 3

page 4


End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization

Supervised learning based on a deep neural network recently has achieved...

Temporal envelope and fine structure cues for dysarthric speech detection using CNNs

Deep learning-based techniques for automatic dysarthric speech detection...

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Many deep learning-based speech enhancement algorithms are designed to m...

Performance Based Cost Functions for End-to-End Speech Separation

Recent neural network strategies for source separation attempt to model ...

Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models

Deep learning based speech denoising still suffers from the challenge of...

Real-time Monaural Speech Enhancement With Short-time Discrete Cosine Transform

Speech enhancement algorithms based on deep learning have been improved ...

Jointly optimal dereverberation and beamforming

We previously proposed an optimal (in the maximum likelihood sense) conv...

1 Introduction

In recent years, deep neural networks have shown great success in speech enhancement compared with traditional statistical approaches. Neural networks directly learn nonlinear complicated mapping from noisy speech to clean one only by referencing data without any prior assumption.

Spectral mask estimation is a popular supervised denoising method that predicts a time-frequency mask to obtain an estimate of clean speech by scaling the noisy spectrum. There are numerous types of spectral mask estimation techniques depending on how to define mask labels. For example, authors in 


proposed the ideal binary mask (IBM) as a training label, where it is set to be zero or one depending on the signal to noise ratio (SNR) of the noisy spectrum. The ideal ratio mask (IRM) 

[18] and the ideal amplitude mask (IAM) [2] provided non-binary soft mask labels to overcome the coarse label mapping of IBM. The phase sensitive mask (PSM) [2] considers the phase spectrum difference between clean and noisy signals, in order to correctly maximize the signal to noise ratio (SNR).

Generative models, such as generative adversarial networks (GANs) suggested an alternative to supervised learning. In speech enhancement GAN (SEGAN) [6], a generator network is trained to output a time-domain denoised signal that can fool a discriminator from a true clean signal. TF-SEGAN [12] extended SEGAN to use a time-frequency mask.

However, all the schemes described above suffer from at least one of two critical issues: metric mismatch or spectrum mismatch. Signal-to-distortion ratio (SDR), perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) are the most well-known perceptual speech metrics. The typical mean square error (MSE) criterion popularly used for spectral mask estimation is not optimal to maximize them. For example, decreasing the mean square error of noisy speech signals often degrades SDR, PESQ or STOI due to different weighting of the frequency components or non-linear transforms involved in those metrics. The spectrum mismatch is a well known issue. Any modification of the spectrum after the short-time Fourier transform (STFT), in general, cannot be fully recovered after inverse short-time Fourier transform (ISTFT). Therefore, spectral mask estimation and other alternatives optimized in the spectrum domain always have a potential risk of performance loss.

This paper presents a new end-to-end multi-task denoising scheme with the following contributions. First, the proposed framework presents three loss functions:

  • SDR loss function: The SDR metric is used as a loss function. The scale-invariant term in the SDR metric is incorporated as a part of training, which provided a significant SDR boost.

  • PESQ loss function: The PESQ metric is redesigned to be usable as a loss function.

  • STOI loss function: The proposed STOI loss function modified the original STOI metric by allowing 16 kHz sampled acoustic signal to be used without resampling.

The proposed multi-task denoising scheme combines loss functions mentioned above for the joint optimization of SDR, PESQ and STOI metrics. Second, a denoising network still predicts a time-frequency mask, but the network optimization is performed after ISTFT in order to avoid spectrum mismatch. The evaluation result showed that the proposed framework provided large improvement on SDR, PESQ and STOI. Moreover, the proposed scheme also showed good generalization on the unseen metrics as in Table 3.

2 The Proposed Framework

Figure 1: Illustration of End-to-End Multi-Task Denoising based on CNN-BLSTM

Figure 1 describes the proposed end-to-end multi-task denoising framework. The underlying model architecture is composed of convolutional layers and bi-directional LSTM. The spectrogram formed by 11 frames of the noisy amplitude spectrum is the input to the convolutional layers with the kernel size of 5x5. is an utterance index, is a frame index and is a frequency index. The dilated convolution with rate of 2 and 4 is applied to the second and third layers, respectively, in order to increase kernel’s coverage on frequency bins. Dilation is only applied to the frequency dimension because time correlation will be learned by bi-drectional LSTMs. Griffin-Lim ISTFT is applied to the synthesized complex spectrum to obtain time-domain denoised output . Three proposed loss functions are evaluated based on and therefore, they are free from the spectrum mismatch.

2.1 SDR Loss Function

Unlike SNR, the definition of SDR is not unique. There are at least two popularly used SDR definitions. The most well known Vincent’s definition [17] is given by


and can be found by projecting the denoised signal into the clean and noise signal domains, respectively. is a residual term. They can be formulated as follows:


Substituting Eq.(2), (3) and (4) into Eq.(1), the rearranged SDR is given by


where . Eq. (5) coincides with SI-SDR, which is another popularly used SDR definition [11]. In the general multiple source denoising problems, SDR and SI-SDR do not match each other. However, for the single source denoising problem, we can use them interchangeably. SDR loss function is defined as mini-batch average of Eq.(5):


2.2 PESQ Loss Function

Figure 2: Block Diagram for the PESQ loss function: and are clean and denoised time-domain signals.

PESQ [10] is defined by ITU-T recommendation P.862 for the objective speech quality evaluation. Its value ranges from -0.5 to 4.5 and the higher PESQ means better perceptual quality. Although SDR and PESQ have high correlation, it is frequently observed that signals with smaller SDR have higher PESQ than the ones with the higher SDR.

Figure 2 shows the block diagram to find PESQ loss function . The overall procedure is similar to the PESQ system in  [10]. There are three major modifications. First, the IIR filter in the PESQ calculation is removed. The reason is that time evolution of IIR filter is too deep to apply back-propagation. Second, delay adjustment routine is not necessary because training data pairs were already time-aligned. Third, bad-interval iteration was removed. As long as clean and noisy data pairs are time-aligned, there’s no substantial impact on PESQ from removing it. Each block in Figure 2 is briefly explained below:

Level Alignment: The average power of clean and noisy speech ranging from 300Hz to 3KHz are set to be predefined power value, .

Bark Spectrum: The Bark spectrum block is to convert linear scale frequency bins into the Bark scale which is roughly equivalent to a logarithmic scale.

Time-Frequency Equalization: Each Bark spectrum bin of clean speech is compensated by the average power ratio between clean and noisy Bark spectrum.

Loudness Mapping: The power densities are transformed to a Soner loudness scale using Zwicker’s law [20]:

Disturbance Processing: The raw disturbance is the absolute difference between clean and noisy loudness densities, which is used to generate symmetric and asymmetric disturbance metrics by applying two different scaling factors.

Aggregation: PESQ loss function can be found as follows:


where and are symmetric and asymmetric metrics, respectively. The computation of and can be found at PESQ C code [7].

2.3 STOI Loss Function

Figure 3: Block Diagram for the STOI loss function

STOI [13, 19] is a segmented correlation-based metric. The main difference is that STOI optimized the segment length that provides the highest correlation with the speech intelligibility. The experiment result of Figure 10 in [13] showed STOI gets the highest correlation for segment lengths around hundreds of milliseconds. The analysis window length of ms is chosen based on this criterion.

The proposed STOI loss function is depicted in Figure 3. The clean and denoised signal pairs are converted into complex spectra vis STFT. After linear frequency scale is converted into 1/3-octave frequency scale, a noisy spectrum is normalized and clipped by the norm of a clean spectrum. The intelligibility measure is calculated by correlation between normalized clean and noisy spectra. Finally, is averaged over 1/3 octave bands, frames and mini-batch utterances:


where B is the number of mini-batch utterances, M is the number of frames, J is the number of 1/3 octave bands and is intelligibility measure at utterance , frame and 1/3-octave band which is described in detail at Eq. (5) in [13].

2.4 Joint optimization of perceptual speech metrics

In this section, several new loss functions are defined by combining SDR, PESQ and STOI loss functions. The first one is to jointly maximize SDR and PESQ by combining and :


where is a hyper-parameter to adjust relative weighting between SDR and PESQ loss functions.

Similarly, SDR and STOI can be jointly maximized by combining and as follows:


where is a hyper-parameter to adjust relative weighting between SDR and STOI loss functions.

Finally, we can combine all three loss functions to jointly optimize SDR, PESQ and STOI metrics as follows:


3 Experimental Results

Loss Type -10dB -5dB 0dB 5dB 10dB 15dB -10dB -5dB 0dB 5dB 10dB 15dB
Noisy Input -11.82 -7.33 -3.27 0.21 2.55 5.03 1.07 1.08 1.13 1.26 1.44 1.72
IAM -3.23 0.49 2.79 4.63 5.74 7.52 1.29 1.47 1.66 1.88 2.07 2.30
PSM -2.95 0.92 3.37 5.40 6.64 8.50 1.30 1.49 1.71 1.94 2.15 2.37
SDR -2.66 1.55 4.13 6.25 7.53 9.39 1.26 1.42 1.65 1.92 2.16 2.41
SDR-STOI -2.45 1.67 4.22 6.38 7.62 9.56 1.30 1.49 1.72 1.99 2.21 2.49
SDR-PESQ -2.31 1.80 4.36 6.51 7.79 9.65 1.43 1.65 1.89 2.16 2.35 2.54
SDR-PESQ-STOI -2.34 1.78 4.22 6.37 7.54 9.36 1.37 1.59 1.83 2.13 2.37 2.61
Table 1: SDR and PESQ results on QUT-NOISE-TIMIT: Test set consists of 6 SNR ranges:-10, -5, 0, 5, 10, 15 dB. The highest SDR or PESQ scores for each SNR test data were highlighted with bold fonts.
Loss Type -10dB -5dB 0dB 5dB 10dB 15dB
Noisy Input 51.9 60.4 68.6 76.9 82.4 86.9
SDR 58.5 70.8 79.0 84.6 87.3 89.6
SDR-PESQ 60.6 72.6 80.4 85.6 88.2 90.3
SDR-STOI 60.6 72.5 80.2 85.5 88.1 90.3
SDR-PESQ-STOI 60.8 73.0 80.6 85.7 88.3 90.3
Table 2: STOI results on QUT-NOISE-TIMIT

3.1 Experimental Settings

Two datasets were used for the evaluation of the proposed denoising framework. QUT-NOISE-TIMIT [1] is synthesized by mixing 5 different background noise sources with the TIMIT [3]. For the training set, -5 and 5 dB SNR data were used but the evaluation set contains all SNR ranges. The total length of train and test data corresponds to 25 hours and 12 hours, respectively. For VoiceBank-DEMAND [15], 30 speakers selected from Voice Bank corpus [16] were mixed with 10 noise types:8 from Demand dataset [14] and 2 artificially generated one. Test set is generated with 5 noise types from Demand.

3.2 Main Result

Table 1 compared SDR and PESQ performance between different denoising methods on QUT-NOISE-TIMIT corpus. All the schemes were based on the same CNN-BLSTM model trained with -5 and +5 dB SNR data. IAM and PSM are the existing spectral mask estimation schemes. SDR refers to at Section 2.1 and SDR-PESQ, SDR-STOI and SDR-PESQ-STOI correspond to , and at Section 2.4, respectively.

outperformed IAM and PSM for all SNR ranges. However, for PESQ, it did not show similar improvement due to the metric mismatch. loss function improved both SDR and PESQ metrics. loss function acted as a regularization term to improve not only PESQ but also the SDR metric. The STOI loss function was not as effective as PESQ loss function to improve PESQ and SDR metrics. SDR-STOI and SDR-PESQ-STOI showed degradation on the SDR metric compared with SDR-PESQ.

Table 2 compared STOI performance among different loss functions. loss function showed % relative STOI gain over the SDR loss function. One thing to note is that showed the similar improvement on STOI, which is a surprising result because the model based on was not directly trained with STOI function but its generalization on the STOI metric is as good as the model based on . We further evaluated STOI performance by combining all three loss functions. However, showed only marginal STOI improvements over . Considering the result on SDR and PESQ metrics, loss function showed the most consistent performance over all three metrics.

3.3 Comparison with Generative Models

Noisy Input 3.37 2.49 2.66 1.99 2.17 8.68
SEGAN 3.48 2.94 2.80 2.16 7.73 -
WAVENET 3.62 3.23 2.98 - - -
TF-GAN 3.80 3.12 3.14 2.53 - -
SDR-PESQ (ours) 4.09 3.54 3.55 3.01 10.44 19.14
Table 3: Evaluation on VoiceBank-DEMAND corpus

Table 3 showed comparison with other generative models. All the results except our end-to-end model came from the original papers: SEGAN [6], WAVENET [9] and TF-GAN [12]. CSIG, CBAK and COVL are objective measures, where high value means better quality of speech [4]. CSIG is mean opinion score (MOS) of signal distortion, CBAK is MOS of background noise intrusiveness and COVL is MOS of the overall effect. SSNR is Segmental SNR defined in [8].

The proposed SDR and PESQ joint optimization scheme outperformed all the generative models in all the perceptual speech metrics listed above. One thing to note is that any of the metrics at Table 3 was not used as a loss function but the proposed SDR and PESQ combined loss function is highly effective to those metrics, which suggested its good generalization performance.

4 Conclusion

In this paper, a new end-to-end multi-task denoising scheme was proposed. The proposed scheme resolved two issues addressed before: spectrum and metric mismatches. The experimental result presented that the proposed joint optimization scheme significantly improved SDR, PESQ and STOI performances over both spectral mask estimation schemes and generative models. Moreover, the proposed scheme provided good generalization performance by showing substantial improvement on the unseen perceptual speech metrics.


  • [1] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason (2010) The qut-noise-timit corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010. Cited by: §3.1.
  • [2] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux (2015)

    Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks

    In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 708–712. Cited by: §1.
  • [3] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett (1993) DARPA timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93. Cited by: §3.1.
  • [4] Y. Hu and P. C. Loizou (2008) Evaluation of objective quality measures for speech enhancement. IEEE Transactions on audio, speech, and language processing 16 (1), pp. 229–238. Cited by: §3.3.
  • [5] A. Narayanan and D. Wang (2013) Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7092–7096. Cited by: §1.
  • [6] S. Pascual, A. Bonafonte, and J. Serra (2017) SEGAN: speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452. Cited by: §1, §3.3.
  • [7] (2001) Perceptual evaluation of speech quality (pesq): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. ITU-T Recommendation P.862. Cited by: §2.2.
  • [8] S. R. Quackenbush (1986) OBJECTIVE measures of speech quality (subjective).. Cited by: §3.3.
  • [9] D. Rethage, J. Pons, and X. Serra (2018) A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5069–5073. Cited by: §3.3.
  • [10] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra (2001) Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, Vol. 2, pp. 749–752. Cited by: §2.2, §2.2.
  • [11] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey (2018) SDR-half-baked or well done?. arXiv preprint arXiv:1811.02508. Cited by: §2.1.
  • [12] M. H. Soni, N. Shah, and H. A. Patil (2018) Time-frequency masking-based speech enhancement using generative adversarial network. Cited by: §1, §3.3.
  • [13] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing 19 (7), pp. 2125–2136. Cited by: §2.3, §2.3.
  • [14] J. Thiemann, N. Ito, and E. Vincent (2013) The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. The Journal of the Acoustical Society of America 133 (5), pp. 3591–3591. Cited by: §3.1.
  • [15] C. Valentini, X. Wang, S. Takaki, and J. Yamagishi (2016) Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. In 9th ISCA Speech Synthesis Workshop, pp. 146–152. Cited by: §3.1.
  • [16] C. Veaux, J. Yamagishi, and S. King (2013) The voice bank corpus: design, collection and data analysis of a large regional accent speech database. In Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference, pp. 1–4. Cited by: §3.1.
  • [17] E. Vincent, R. Gribonval, and C. Févotte (2006) Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: §2.1.
  • [18] Y. Wang, A. Narayanan, and D. Wang (2014) On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22 (12), pp. 1849–1858. Cited by: §1.
  • [19] Y. Zhao, B. Xu, R. Giri, and T. Zhang (2018) Perceptually guided speech enhancement using deep neural networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5074–5078. Cited by: §2.3.
  • [20] E. Zwicker and R. Feldtkeller (1967) Das ohr als nachrichtenempfänger. Hirzel. Cited by: §2.2.