1 Introduction
In recent years, deep neural networks have shown great success in speech enhancement compared with traditional statistical approaches. Statistical enhancement schemes such as MMSE STSA (Ephraim & Malah, 1984) and OMLSA (Ephraim & Malah, 1985; Cohen & Berdugo, 2001)
do not learn any model architecture from data but they provide blind signal estimation based on their predefined speech and noise models. However, their model assumptions, in general, do not match with realworld complex nonstationary noise, which often leads to failure of noise estimation and tracking. On the contrary, a neural network directly learns nonlinear complex mapping from noisy speech to clean one only by referencing data without any prior assumption. With more data, a neural network can learn better underlying mapping.
Spectrum mask estimation is a popular supervised denoising method that predicts a timefrequency mask to obtain an estimate of clean speech by multiplication with noisy spectrum. There are numerous types of spectrum mask estimation depending on how to define mask labels. For example, authors in (Narayanan & Wang, 2013) proposed ideal binary mask (IBM) as a training label, where it is set to be zero or one depending on the signal to noise ratio (SNR) of a noisy spectrum. Ideal ratio mask (IRM) (Wang et al., 2014) and ideal amplitude mask (IAM) (Erdogan et al., 2015) provided nonbinary soft mask labels to overcome coarse label mapping of IBM. Phase sensitive mask (PSM) (Erdogan et al., 2015) considers phase spectrum difference between clean and noisy signal in order to correctly maximize signal to noise ratio (SNR).
Generative models such as generative adversarial networks (GANs) and variational autoencoders (VAEs) suggested an alternative to supervised learning. In speech enhancement GAN (SEGAN)
(Pascual et al., 2017), a generator network is trained to output a timedomain denoised signal that can fool a discriminator from a true clean signal. TFSEGAN (Soni et al., 2018) extended SEGAN into timefrequency mask. (Bando et al., 2018), on the other hand, combined a VAEbased speech model with nonnegative matrix factorization (NMF) for noise spectrum to show good performance on unseen data.However, all the schemes described above suffer from either of two critical issues: metric and spectrum mismatches. SDR and PESQ are two most widely known metrics for measuring quality of speech signal. The typical mean square error (MSE) criterion popularly used for spectrum mask estimation is not optimal to maximize our target metrics of SDR and PESQ. For example, decreasing mean square error of noisy speech signal often degrades SDR or PESQ due to different weighting on frequency components or nonlinear transforms involved in the metric. Furthermore, GANbased generative models don’t have even a specific loss function to minimize. Although they are robust for unseen data, they typically perform much worse for test data with small data mismatch. The spectrum mismatch is a well known issue that any spectrum modification after shorttime Fourier transform (STFT), in general, cannot be fully recovered after ISTFT. Therefore, spectrum mask estimation and other alternatives optimized in the spectrum domain always have a potential risk of performance loss.
This paper presents a new endtoend multitask denoising scheme with following contributions. First, the proposed framework presents two loss functions:

SDR loss function: instead of typical MSE, SDR metric is used as a loss function. The scaleinvariant term in SDR metric is incorporated as a part of training, which provided significant SDR boost.

PESQ loss function: PESQ metric is redesigned to be usable as a loss function. The two key variables of PESQ, symmetric and asymmetric disturbances are approximated to be optimized during training.
The proposed multitask denoising scheme combines two loss functions for joint SDR and PESQ optimization. Second, a denoising network still predicts a timefrequency mask but the network optimization is performed after ISTFT in order to avoid spectrum mismatch. SDR loss function is naturally calculated from the timedomain reconstructed signal. PESQ loss function needs clean and denoised power spectra as input for disturbance calculation and therefore, the second STFT is applied on the timedomain signal for spectrum consistency. The evaluation result showed that the proposed framework provided large improvement both on SDR and PESQ.
2 Background and Related Works
Figure 1 illustrates the overall speech denoising flow. The input noisy signal is given by
(1) 
where is a utterance index, is a time index, is clean speech and is noise signal. is STFT output, where and are frame and frequency indices, respectively. It is fed into two separate paths. For the upper path, the magnitude of is passed to the Denoiser block. On the contrary, the phase of is bypassed without compensation. is synthesized by noisy input phase and denoised amplitude spectra. After ISTFT, we can recover timedomain denoised output .
2.1 Spectrum Mask Estimation
The neural network in Figure 1 predicts a timefrequency mask to obtain an estimate of the clean amplitude spectrum. The estimated mask is multiplied by the input amplitude spectrum as follows:
(2) 
Given clean and noisy amplitude spectrum pairs, the mask estimation is to minimize a predefined distortion metric, for all utterances and timefrequency bins:
(3) 
where is a denoiser parameter, is the number of frames for the utterance , is the number of frequency bins, is a distortion metric, is an amplitude spectrum of clean speech and is an amplitude spectrum of noisy speech. Ideal binary mask (IBM) and ideal ratio mask (IRM) are two wellknown mask labels given by
(4) 
where is a logscale threshold for frequency index .
(5) 
The distortion metric for IBM and IRM is given by
(6) 
where is either IBM or IRM. The one issue for two mask labels is that they don’t correctly recover clean speech. For example, is generally not equal to . Ideal amplitude mask (IAM) is defined to represent the exact magnitude ratio:
(7) 
For IAM, instead of direct mask optimization as IBM or IRM at Eq.(6), (Weninger et al., 2014) suggested magnitude spectrum minimization, which was shown to have significant improvement:
(8) 
The one drawback in Eq.(8) is that it cannot correctly maximize signal to noise ratio (SNR) when phase difference between clean and noisy spectra is not zero. The optimal distortion measure to maximize SNR is to minimize mean square error between complex spectra as follows:
(9) 
Eq.(2.1) can be rearranged after removing unnecessary terms for optimization:
(10) 
The equivalent mask label, also called phase sensitive mask (PSM) label is given by
(11) 
In this paper, PSM (Erdogan et al., 2015) was chosen as a baseline denoising scheme because PSM showed the best performance on the target metrics of SDR and PESQ among spectrum mask estimation schemes.
2.2 GriffinLim Algorithm
ISTFT operation in Figure 1 consists of IFFT, windowing and overlap addition as follows:
(12) 
where is frame shifting number and is IFFT of . Due to the overlapped frame sequence generation, any arbitrary spectrum modification can cause spectrum mismatch. For example, is, in general, not matched to the STFT of due to the modification of amplitude spectrum. GriffinLim algorithm (Griffin & Lim, 1984) is to find legitimate , which has the closest spectrum to . A legitimate signal is meant to have no spectrum mismatch. The GriffinLim formulation is given by
(13) 
where is STFT of and is a STFT window function. The solution can be easily derived by finding that makes Eq.(2.2) have zero gradient:
(14) 
Although GriffinLim algorithm guarantees to find a legitimate signal with minimum MSE, it is still not guaranteed to coincide with . The iterative GriffinLim algorithm further decreases magnitude spectrum mismatch for every iteration by allowing phase distortion, which will be evaluated in conjunction with the proposed framework.
3 The Proposed Framework
Figure 2 describes the proposed endtoend multitask denoising framework. The underlying model architecture is composed of convolution layers and bidirectional BLSTMs. The spectrogram formed by 11 frames of the noisy amplitude spectrum is the input to the convolutional layers with the kernel size of 5x5. The dilated convolution with rate of 2 and 4 is applied to the second and third layers, respectively, in order to increase kernel’s coverage on frequency bins. Dilation is only applied to the frequency dimension because time correlation will be learned by bidrectional LSTMs. GriffinLim ISTFT is applied to the synthesized complex spectrum to obtain timedomain denoised output . Two proposed loss functions are evaluated based on and therefore, they are free from spectrum mismatch.
3.1 SDR Loss Function
Unlike SNR, the definition of SDR is not unique. There are at least two popularly used SDR definitions. The most well known Vincent’s definition (Vincent et al., 2006) is given by
(15) 
There was an term in the original SDR but it is removed because there is no interference error for single source denoising problem. and can be found by projecting denoised signal into clean and noise signal domain, respectively. is a residual term and they can be formulated as follows:
(16)  
(17)  
(18) 
Substituting Eq.(16), (17) and (18) into Eq.(15), the rearranged SDR is given by
SDR  (19)  
where . Eq.(19) coincides with SISDR that is another popularly used SDR definition (Roux et al., 2018). In the general multiple source problems, they do not match each other. However, for single source denoising problem, we can use them interchangeably.
SDR loss function is defined as minibatch average of Eq.(19):
(20) 
Compared with conventional MSE loss function, scaleinvariant is included as a training variable in Eq.(20). Figure 3 illustrates why training is important to maximize the SDR metric. For two noisy signal and , they have the same SNR because they positioned at the same circle centered at the clean signal . However, their SDRs are different. For , SDR is , which is by geometry. By the same way, SDR for is . Clearly, is better than in terms of SDR metric but MSE criterion cannot distinguish between them because they have the same SNR.
3.2 PESQ Loss Function
Perceptual evaluation of speech quality (PESQ) (Rix et al., 2001) is ITUT standard and it provides objective speech quality evaluation. Its value ranges from 0.5 to 4.5 and the higher PESQ means better perceptual quality. Although SDR and PESQ have high correlation, it is frequently observed that signal with smaller SDR have higher PESQ than the one with the higher SDR. For example, acoustic signal with high reverberation would have low SNR or SDR but PESQ can be much better because timefrequency equalization in PESQ can compensate the most of channel fading effects. It is just one example and there are a lot of operations that make PESQ behave much differently from SDR. Therefore, SDR loss function cannot effectively optimize PESQ.
In this section, a new PESQ loss function is designed based on PESQ metric. Figure 4 shows the block diagram to find PESQ loss function . The overall procedure is similar to the PESQ system in (Rix et al., 2001). There are three major modifications. First, IIR filter in PESQ is removed. The reason is that time evolution of IIR filter is too deep to apply backpropagation. Second, delay adjustment routine is not necessary because training data pairs were already timealigned. Third, badinterval iteration was removed. PESQ improves metric calculations by detecting bad intervals of frames and updating metrics over those periods. As long as training clean and noisy data pairs are perfectly timealigned, there’s no significant impact on PESQ by removing this operation.
Level Alignment: The average power of clean and noisy speeches ranging from 300Hz to 3KHz are aligned to be , which is a predefined power value. IIR filter gain 2.47 is also compensated in this block.
Bark Spectrum: Bark spectrum block is to find the mean of linear scale frequency bins according to the Bark scale mapping. The higher frequency bins are averaged with more number of bins, which effectively provides lower weighting to them. The mapped bark spectrum power can be formulated as follows:
(21) 
where is the start of linear frequency bin number for the bark spectrun, is bark spectrum power of clean speech. All the positive linear frequency bins were mapped to 49 Bark spectrum bins. is a bark spectrum power of noisy speech and can be also found in the similar manner.
TimeFrequency Equalization: Each Bark spectrum bin of clean speech is compensated by the average power ratio between clean and noisy Bark spectrum as follows:
(22) 
where , , is a constant and and are silence masks that become 1 only when the corresponding bark spectrum power exceeds predefined thresholds. After frequency equalization, the shortterm gain variation of a noisy bark spectrum is also compensated for each frame:
(23)  
(24)  
(25) 
where , , is the size of Bark spectrum bins, and is a constant.
Loudness Mapping: The power densities are transformed to a Soner loudness scale using Zwicker’s law (Zwicker & Feldtkeller, 1967):
(26) 
where is the absolute hearing threshold, is the loudness scaling factor, and r is Zwicker power and x can be c (clean) or n (noisy).
Disturbance Processing: The raw disturbance is difference between clean and noisy loudness densities with following operations:
(27) 
where . If absolute difference between clean and noisy loudness densities are less than 0.25 of minimum of two densities, raw disturbance becomes zero. From raw disturbance, symmetric frame disturbance is given by
(28) 
where is predefined weighting for bark spectrum bins. Asymmetric frame disturbance has additional scaling and thresholding steps as follows:
(29)  
(30) 
(31) 
Aggregation: PESQ loss function can be found as follows:
(32) 
can be found from two stage averaging:
(33)  
(34) 
where is . can also be found with similar averaging.
3.3 Joint SDR and PESQ optimization
To jointly maximize SDR and PESQ, a new loss function is defined by combining and :
(35) 
Another combined loss function is defined using MSE criterion instead of :
(36) 
By comparing with , we can evaluate PESQ improvement from the proposed PESQ loss function over MSE criterion.
4 Experiments and Results
4.1 Experimental Settings
Train Set  CAFEFOODCOURTB1, CAFEFOODCOURTB2, CARWINDOWNB1, CARWINDOWNM2, HOMEKITCHEN1, HOMEKITCHEN2, REVERBPOOL1, REVERBPOOL2, STREETCITY1, STREETCITY2 

Test Set  CAFECAFE1, CARWINUPB1, HOMELIVING1, REVERBCARPARK1, STREETKG1 
In order to evaluate the proposed denoising framework, we used the following three datasets:
CHIME4 (Vincent et al., 2017): Simulated dataset is used for training and evaluation. It is synthesized by mixing Wall Stree Journal (WSJ) corpus with noise sources from four different environments: bus, cafe, pedestrian area and street junction. Four noise conditions are applied to both train and development data. Moreover, simulated dataset has fixed SNR and therefore, it is relatively easy data to improve.
QUTNOISETIMIT (Dean et al., 2010): QUTNOISETIMIT corpus is created by mixing 5 different background noise sources with TIMIT clean speech (Garofolo et al., 1993). Unlike CHIME4, the synthesized speech sequences provide wide range of SNR categories from 10dB to 15dB SNR. For training set, only 5 and 5 dB SNR data were used but test set contains all SNR ranges. The noise and location information used for train and test sets are summarized in Table 1.
VoiceBankDEMAND (Valentini et al., 2016): 30 speakers selected from Voice Bank corpus (Veaux et al., 2013) were mixed with 10 noise types: 8 from Demand dataset (Thiemann et al., 2013) and 2 artificially generated one. Test set is generated with 5 noise types from Demand that does not coincide with those for training data. VoiceBankDEMAND corpus was used to evaluate generative models such as SEGAN (Pascual et al., 2017), TFGAN (Soni et al., 2018) and WAVENET (Rethage et al., 2018).
4.2 Model Comparison
Model  GL Iter.  Loss Type  SDR  PESQ 

Noisy Input      5.80  1.267 
OMLSA      8.50  1.514 
DNN*    IRM  10.93   
CNN DAE  1  IAM  11.30  1.507 
CNN DAE  10  IAM  11.21  1.491 
CNNBLSTM  1  IAM  11.91  1.822 
CNNBLSTM  10  IAM  12.06  1.829 
SDR  PESQ  

Loss Type  10 dB  5 dB  0 dB  5 dB  10 db  15 dB  10 dB  5 dB  0 dB  5 dB  10 db  15 dB 
Noisy Input  11.82  7.33  3.27  0.21  2.55  5.03  1.07  1.08  1.13  1.26  1.44  1.72 
IAM  3.23  0.49  2.79  4.63  5.74  7.52  1.29  1.47  1.66  1.88  2.07  2.30 
PSM  2.95  0.92  3.37  5.40  6.64  8.50  1.30  1.49  1.71  1.94  2.15  2.37 
SNR  2.79  1.36  3.70  5.68  6.18  8.44  1.29  1.46  1.68  1.93  2.14  2.38 
SDR  2.66  1.55  4.13  6.25  7.53  9.39  1.26  1.42  1.65  1.92  2.16  2.41 
SDRMSE  2.53  1.57  4.10  6.31  7.60  9.43  1.29  1.47  1.68  1.93  2.15  2.39 
SDRPESQ  2.31  1.80  4.36  6.51  7.79  9.65  1.43  1.65  1.89  2.16  2.35  2.54 
Table 2 showed SDR and PESQ performance for various denoising networks based on CHIME4 corpus. OMLSA is a baseline statistical scheme that presented 2.7 dB SDR gain and 0.25 PESQ improvement. For neural networkbased approaches, the author in (Bando et al., 2018)
presented 10.93 dB SDR for 5layer DNN. In this paper, two models were trained: CNNbased denoising autoencoder (DAE) and CNNBLSTM. The generator architecture in
(Pascual et al., 2017) was used for CNNDAE. It presented SDR improvement over DNN model but PESQ performance is worse than OMLSA. For CNNBLSTM, both SDR and PESQ significantly improved, which is why we chose it as our model architecture.Iterative GriffinLim algorithm (Griffin & Lim, 1984) was evaluated for CNNDAE and CNNBLSTM at the inference time. SDR and PESQ for two networks converged after 10 iterations. CNNBLSTM showed 0.15 dB SDR gain from 10 iterations, on the other hand, CNN DAE showed 0.1 dB loss. Although iterative GriffinLim algorithm reduces amplitude spectrum mismatch for each iteration, phase spectrum also changes from update. Phase spectrum update is not predictable. As shown in this experiment, sometimes, it is beneficial but in other case, it is not. Due to unstable SDR performance, iterative GriffinLim algorithm was not used in the inference time. Iterative GriffinLim in the training stage would be evaluated at the later section.
4.3 Main Results
Table 3 compared SDR and PESQ performance between different denoising methods on QUTNOISETIMIT corpus. All the schemes were based on the same CNNBLSTM model and trained with 5 and +5 dB SNR data as explained at Section 4.1. IAM and PSM are two spectrum mask estimation schemes explained at Section 2.1. SDR refers at Section 3.1 and SDRMSE and SDRPESQ correspond to and at Section 3.3, respectively.
The proposed endtoend scheme based on showed significant SDR gain over spectrum mask schemes for all SNR ranges. However, for PESQ, it did not show similar improvement due to metric mismatch. For example, PESQ performance degraded on the low SNR ranges such as 10, 5 and 0 dB over PSM.
Two proposed joint optimization schemes were evaluated. First, showed improvement on both SDR and PESQ metrics for the most of SNR ranges. However, the improvement is marginal and it still suffered from PESQ loss on low SNR ranges over PSM. Second, loss function improved both SDR and PESQ metrics with large margin over all other schemes. By combining loss function with , both SDR and PESQ performances were significantly improved.
At Section 3.1, we claimed MSE loss function cannot correctly maximize SDR. To evaluate this statement, we trained CNNBLSTM with MSE criterion, which is the SNR entry in Table 1. Compared with , SDR performance for SNR loss function degraded for all SNR ranges. One thing to note is that PESQ performance does not degrade. Therefore, this also showed that or SNR loss functions couldn’t correctly impact PESQ metric.
Loss Type  SDR  PESQ 

IAM  11.91  1.822 
PSM  12.08  1.857 
SDR  12.43  1.699 
SDRMSE  12.44  1.758 
SDRPESQ  12.59  1.953 
Table 4 showed SDR and PESQ result on CHIME4. The result is similar to QUTNOISETIMIT. presented significant SDR improvement but PESQ performance degraded. The proposed joint SDR and PESQ optimization presented the best performance both on SDR and PESQ metrics.
The proposed joint optimization loss, improved both SDR and PESQ even compared with . It is logical to guess should present the best SDR performance. However, for both CHIME4 and QUTNOISETIMIT corpora, PESQ loss function in was also helpful to improve SDR metric. The reason can be explained by train and test curves in Figure 5. The training SDR curve for reached the highest value as expected but it showed the lower SDR on the test set than . Clearly, also acted as a regularizer to avoid overffiting. We tried other regularization terms such as and norms but was more effective to improve generalization to unseen data than other general regularization methods.
4.4 Comparison with Generative Models
Models  CSIG  CBAK  COVL  PESQ  SSNR 

Noisy Input  3.37  2.49  2.66  1.99  2.17 
SEGAN  3.48  2.94  2.80  2.16  7.73 
WAVENET  3.62  3.23  2.98     
TFGAN  3.80  3.12  3.14  2.53   
DCUnet10  3.70  3.22  3.10  2.52  9.40 
DCUnet20  4.12  3.47  3.51  2.87  9.96 
WSDR  3.54  3.24  3.01  2.51  9.93 
SDRPESQ  4.09  3.54  3.55  3.01  10.44 
Table 5 showed comparison with other generative models. All the results except our endtoend model came from the original papers: SEGAN (Pascual et al., 2017), WAVENET (Rethage et al., 2018) and TFGAN (Soni et al., 2018). CSIG, CBAK and COVL are objective measures where high value means better quality of speech (Hu & Loizou, 2008). CSIG is mean opinion score (MOS) of signal distortion, CBAK is MOS of background noise intrusiveness and COVL is MOS of the overall effect. SSNR is Segmental SNR defined in (Quackenbush, 1986).
The proposed SDR and PESQ joint optimization scheme outperformed all the generative models in all objective measures. Unfortunately, the original papers didn’t provide SDR metric. SSNR metric is similar to SDR but basically, SSNR is scalesensitive metric and therefore, SDR loss function would not be optimal to maximize SSNR. Nevertheless, SSNR performance for the SDRPESQ loss function showed significant gain over other generative models.
The results for DCUnet10 and DCUnet20 in Table 5 came from the recent paper (Choi et al., 2019). This paper suggested a couple of fancy ideas. One of main contributions was phase compensation scheme, which showed significant improvement on VoiceBankDEMAND corpus. Our proposed joint optimization scheme was based on amplitude spectrum estimation. However, it is not limited to real mask estimation but can also enjoy phase compensation gain if applied. Therefore, the gain from complex mask should be removed for fair comparison. Fortunately, this paper provided its scheme based on realvalued network (RMRn), which were shown in Table 5. Our joint optimization scheme, presented better performance on almost all the objective measures. In order to remove performance variation from model difference, a CNNBLSTM model was also trained with the weighted SDR loss function proposed in this paper, which is WSDR in Table 5. It showed comparable performance to DCUnet10. Compared with , it showed large loss especially for PESQ and SSNR.
4.5 Training with Iterative GriffinLim Algorithm
GriffinLim Iteration  SDR  PESQ 

1  12.59  1.953 
2  11.59  1.679 
3  12.35  1.949 
4  11.61  1.638 
KMISI scheme for blind source separation (Wang et al., 2018) applied multiple STFTISTFT operations similar to iterative GriffinLim algorithm. It showed SDR improvement by increasing the number of iterations. We also applied iterative GriffinLim algorithm to our endtoend joint optimization at the training stage. Table 6 showed SDR and PESQ performance for each GriffinLim iteration. Multiple iteration of GriffinLim algorithm only hurt SDR and PESQ performance unlike KMISI. The key difference in KMISI is that the reconstructed timedomain signal is redistributed between multiple sources for each iteration, which could provide substantial enhancement of source separation. For single source denoising problem, single iteration of GriffinLim algorithm presented the best performance.
5 Conclusion
In this paper, a new endtoend multitask denoising scheme was proposed. The proposed scheme resolved two issues addressed before: Spectrum and metric mismatches. First, two metricbased loss functions are defined: SDR and PESQ loss functions. Second, two newly defined loss functions are combined for joint SDR and PESQ optimization. Finally, the combined loss function is optimized based on the reconstructed timedomain signal after GriffinLim ISTFT in order to avoid spectrum mismatch. The experimental result presented that the proposed joint optimization scheme significantly improved SDR and PESQ performances over both spectrum mask estimation schemes and generative models.
References
 Bando et al. (2018) Bando, Y., Mimura, M., Itoyama, K., Yoshii, K., and Kawahara, T. Statistical speech enhancement based on probabilistic integration of variational autoencoder and nonnegative matrix factorization. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 716–720. IEEE, 2018.
 Choi et al. (2019) Choi, H.S., Kim, J., Huh, J., Kim, A., Ha, J.W., and Lee, K. Phaseaware speech enhancement with deep complex unet. ICLR, 2019.
 Cohen & Berdugo (2001) Cohen, I. and Berdugo, B. Speech enhancement for nonstationary noise environments. Signal processing, 81(11):2403–2418, 2001.
 Dean et al. (2010) Dean, D. B., Sridharan, S., Vogt, R. J., and Mason, M. W. The qutnoisetimit corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.
 Ephraim & Malah (1984) Ephraim, Y. and Malah, D. Speech enhancement using a minimummean square error shorttime spectral amplitude estimator. IEEE Transactions on acoustics, speech, and signal processing, 32(6):1109–1121, 1984.
 Ephraim & Malah (1985) Ephraim, Y. and Malah, D. Speech enhancement using a minimum meansquare error logspectral amplitude estimator. IEEE transactions on acoustics, speech, and signal processing, 33(2):443–445, 1985.

Erdogan et al. (2015)
Erdogan, H., Hershey, J. R., Watanabe, S., and Le Roux, J.
Phasesensitive and recognitionboosted speech separation using deep recurrent neural networks.
In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 708–712. IEEE, 2015.  Garofolo et al. (1993) Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S. Darpa timit acousticphonetic continous speech corpus cdrom. nist speech disc 11.1. NASA STI/Recon technical report n, 93, 1993.
 Griffin & Lim (1984) Griffin, D. and Lim, J. Signal estimation from modified shorttime fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
 Hu & Loizou (2008) Hu, Y. and Loizou, P. C. Evaluation of objective quality measures for speech enhancement. IEEE Transactions on audio, speech, and language processing, 16(1):229–238, 2008.
 Narayanan & Wang (2013) Narayanan, A. and Wang, D. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7092–7096. IEEE, 2013.
 Pascual et al. (2017) Pascual, S., Bonafonte, A., and Serra, J. Segan: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452, 2017.
 Quackenbush (1986) Quackenbush, S. R. Objective measures of speech quality (subjective). 1986.
 Rethage et al. (2018) Rethage, D., Pons, J., and Serra, X. A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5069–5073. IEEE, 2018.
 Rix et al. (2001) Rix, A. W., Beerends, J. G., Hollier, M. P., and Hekstra, A. P. Perceptual evaluation of speech quality (pesq)a new method for speech quality assessment of telephone networks and codecs. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, volume 2, pp. 749–752. IEEE, 2001.
 Roux et al. (2018) Roux, J. L., Wisdom, S., Erdogan, H., and Hershey, J. R. Sdrhalfbaked or well done? arXiv preprint arXiv:1811.02508, 2018.
 Soni et al. (2018) Soni, M. H., Shah, N., and Patil, H. A. Timefrequency maskingbased speech enhancement using generative adversarial network. 2018.
 Thiemann et al. (2013) Thiemann, J., Ito, N., and Vincent, E. The diverse environments multichannel acoustic noise database: A database of multichannel environmental noise recordings. The Journal of the Acoustical Society of America, 133(5):3591–3591, 2013.
 Valentini et al. (2016) Valentini, C., Wang, X., Takaki, S., and Yamagishi, J. Investigating rnnbased speech enhancement methods for noiserobust texttospeech. In 9th ISCA Speech Synthesis Workshop, pp. 146–152, 2016.
 Veaux et al. (2013) Veaux, C., Yamagishi, J., and King, S. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (OCOCOSDA/CASLRE), 2013 International Conference, pp. 1–4. IEEE, 2013.
 Vincent et al. (2006) Vincent, E., Gribonval, R., and Févotte, C. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, 14(4):1462–1469, 2006.
 Vincent et al. (2017) Vincent, E., Watanabe, S., Nugraha, A. A., Barker, J., and Marxer, R. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech & Language, 46:535–557, 2017.
 Wang et al. (2014) Wang, Y., Narayanan, A., and Wang, D. On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12):1849–1858, 2014.
 Wang et al. (2018) Wang, Z.Q., Roux, J. L., Wang, D., and Hershey, J. R. Endtoend speech separation with unfolded iterative phase reconstruction. arXiv preprint arXiv:1804.10204, 2018.

Weninger et al. (2014)
Weninger, F., Hershey, J. R., Le Roux, J., and Schuller, B.
Discriminatively trained recurrent neural networks for singlechannel
speech separation.
In
Proceedings 2nd IEEE Global Conference on Signal and Information Processing, GlobalSIP, Machine Learning Applications in Speech Processing Symposium, Atlanta, GA, USA
, 2014.  Zwicker & Feldtkeller (1967) Zwicker, E. and Feldtkeller, R. Das Ohr als Nachrichtenempfänger. Hirzel, 1967.