1 Introduction
Speech enhancement, which aims to recover the target speech from a noisy observed signal, is a fundamental task in a wide range of speech applications including automatic speech recognition (ASR)
[1] and telecommunication [2]. In these applications, the objectives of enhancement are different; the former is for helping machine listening, while the latter is for helping human listening. This study focuses on the latter, where the subjective sound quality of the enhanced speech signal is the target of improvement.Over the last decade, the use of deep neural network (DNN) for speech enhancement has substantially advanced the stateoftheart performance [3, 4, 5, 9, 6, 7, 8, 10, 11, 12, 13, 14, 15]
. The popular strategy is to estimate a timefrequency (TF) mask by a DNN and apply it in the shorttime Fourier transform (STFT)–domain
[3], where the enhanced signal is obtained by the inverse STFT. Ordinarily, DNNs are trained by backpropagation to minimize a mathematicallydefined differentiable cost function such as the mean squared/absolute error [4] and the signaltodistortion ratio (SDR) [6]. Unfortunately, it has been shown that such mathematicallydefined cost functions do not guarantee to improve subjective sound quality [10]. For improving the sound quality of enhanced speech signals, humanperceptionbased measures for objective sound quality assessments (OSQA), such as PESQ (perceptual evaluation of speech quality) [16], has been applied to the training of DNNs.The difficulty in the training of DNN based on the score of OSQA is its nondifferentiable nature which restricts the use of backpropagation. In the previous studies, two types of strategies have been proposed to circumvent this difficulty [9, 10, 11]. Koizumi et al. formulated the training as a blackbox optimization problem and adopted techniques from reinforcement learning (RL) that approximate the gradient using a sampling algorithm [9, 10]. MetricGAN proposed by Fu et al. [11] is the other approach which utilizes an auxiliary DNN to approximate the score of OSQA as the generativeadversarialnetwork (GAN) [17]. This functionapproximationbased strategy allows to backpropagate the information of OSQA to the primary DNN which enhances the signals. While these methods effectively improved the sound quality, their problem is instability of the training. The targeted score of OSQA on the test dataset does not stably increase (see Fig. 5 of [10] and Fig. 2 of [11]), which can be a cause of failure for some situations.
In this study, we propose the use of stabilization techniques for the functionapproximationbased method as shown in Fig. 1
. For stably training the auxiliary DNN for approximating the score of OSQA, we design a new cost function and adopt training techniques in RL and other machine learning areas. We conducted experiments training a DNN based on PESQ as an example, and the results show that the proposed method (i) can stably train a DNN to increase PESQ, (ii) achieved the stateoftheart PESQ score on a public dataset, and (iii) obtained better sound quality than conventional methods based on subjective evaluation.
2 Conventional Methods
2.1 DNNbased Speech Enhancement using TF mask
Let pointslong timedomain observation be a mixture of a target signal and noise as . The goal of speech enhancement is to recover from , where the use of DNN has substantially advanced the stateoftheart performance. A popular strategy is to use a DNN for estimating a TF mask in the STFTdomain. Let be the STFT where and are the number of frequency and time bins, respectively. A general form of DNNbased speech enhancement using a TF mask can be written as
(1) 
where is the estimate of , is the inverseSTFT, is the elementwise product, and is a DNN for estimating the TF mask. The set of parameters of the DNN is trained to minimize a cost function by iterating the gradient descending procedure:
(2) 
where , and is the differential operator w.r.t. .
To apply a gradientdescenttype algorithm as in (2), derivatives of both and are required for computing . Since is usually computed by the backpropagation, differentiability of the cost function is the matter for the algorithm. The mean squared/absolute error [4] and SDR [6] are some examples of differential cost functions popular in speech enhancement. However, these mathematicallydefined cost functions do not guarantee to improve the subjective sound quality of the enhanced signals [10] because they do not take the perceptual concepts into account.
2.2 OSQAbased Cost Function for Speech Enhancement
Instead of such mathematicallydefined differentiable cost functions, some perceptuallymotivated functions such as PESQ have been considered in DNNbased speech enhancement. Let be an OSQA score evaluated between and . To incorporate it into training of DNN, two approches have been proposed [9, 10, 11].
The first approach [10] considered the expectation w.r.t. ,
(3) 
as the cost function. To calculate , Koizumi et al. [10] considered a sampling algorithm used in RL. They rewrote (3) as
(4) 
where is a conditional distribution of given . As the goal is to train a DNN for recovering , is considered to consist of a DNN. Then, by using the logderivative trick, is given by
(5) 
When is differentiable w.r.t and can be drawn from , (5) can be approximately calculated. The problem of this approach is that training takes a longtime for stabilizing it. This is because the expectation in (5) is numerically calculated by the Monte Carlo method, and stabilization of the randomsamplingbased expectation requires a huge number of samples.
The second approach, MetricGAN proposed by Fu et al. [11], is based on function approximation of . In this method, the score of OSQA is approximated by using an auxiliary DNN as
(6) 
where is the set of parameters of . Since is differentiable w.r.t. , can be calculated via backpropagation. In this method, and are trained alternately. First, is updated to decrease the following cost function:
(7) 
where is the meansquared error (MSE) between true and estimated OSQA scores, . Then, to update so as to obtain the best score for all , the cost function,
(8) 
is minimized, where the OSQA score is assumed to be normalized as , and therefore
It is known that training of GAN is difficult and unstable. Thus, Fu et al. introduced several techniques, including the Spectral Normalization [18], to stabilize the training of MetricGAN. However, as can be seen in Fig. 2 of their paper [11], its training is still unstable in the early stage of training (at around 20–50 iterations).
3 Proposed method
Recently, several pieces of literature have reported that, based on the relevance between GAN and RL, training of GAN can be stabilized by adopting techniques of RL [19]. In addition, several techniques for stabilizing DNN training have also been proposed in other areas [19, 20, 21, 22, 23]. Therefore, we consider that the training of MetricGAN, or the functionapproximationbased approach, can be stabilized by adopting such techniques. In this section, we describe the techniques that succeeded to stabilize the training.
3.1 Techniques for Stabilizing DNN Training
3.1.1 Cost function for OSQA score approximation
First, we consider a cost function better than . Since consists of and , can know the OSQA scores of clean and currentoutput signals only, i.e., it cannot know the score of noisy signal . Then, it is difficult to approximate the score for noisier signals as illustrated in Fig. 2(a). Such lack of information on should be the cause of the instability of training. Therefore, we additionally supervise the OSQA score of a noisy signal as
(9) 
where is the minibatchsize of ’s training, and , , and are the th samples of clean, noisy, and output signal in the minibatch, respectively. Since three points (clean, current, and noisy) of the score are informed to , we can expect that can learn better as in Fig. 2(b), and can know which is a lowerquality signal.
3.1.2 Training techniques
We also adopted some techniques in the training procedures. Here, we provide a recipe including a pretraining method, optimizer and minibatchsize selection.
(i) Pretraining: Training of based on a differentiable cost function is easier than that of using OSQA scores. Although SDR does not reflect the subjective sound quality, signals with higher SDR tends to result in higher OSQA score. Therefore, pretraining of using SDR should be effective. Thus, before training by OSQA score, we train using the SDRbased cost function [6] defined as
(10) 
where , is a clipping parameter, is the minibatchsize of ’s training, and and are the th samples of clean and output signals in the minibatch, respectively. After the pretraining of using (10), is also pretrained using the cost function (9) with fixed . In the pretraining stage, the minibacthsize was set to 5 for both networks.
(ii) Optimizer:
In MetricGAN, the adaptive moment estimation (Adam)
[20] was used as the optimizer for both and. However, it is known that the Adam optimizer may not improve the generalization performance compared to the stochastic gradient descent (SGD)
[21]. Therefore, we use the SGD optimizer instead of the Adam optimizer in the training stage.(iii) Minibatchsize: Since calculation of OSQA takes long time, using a large minibatchsize is not practical. However, a small minibatchsize may degrade the approximation accuracy of the OSQA scores because the gradients become less accurate that may results in unstable training. We experimentally found that the minibatchsize is the smallest size which stabilizes the training of in our experiment. Based on this finding, we use minibatchsize for ’s training. The reason for this selection is based on the literature of the actorcritic algorithm in RL which is an optimization method for the functionapproximationbased cost function. In the literature, it is known that the learning rates of the actor (or ) and critic (or ) should be [22]. In addition, recent research has reported that decreasing the learning rate has the same effect as increasing minibatchsize in some situations [23], i.e., the learning rate and minibatchsize are related inversely in these context. Thus, we set the minibatchsizes to ( and ) based on the inverse relation.
3.2 Implementation
The proposed training procedure is summarized in Fig. 1. First, is pretrained using the SDRbased function (10), and then is also pretrained based on the fixed as in the lefthand side of Fig. 1. Next, we train and alternately as follows. The parameter of is updated ten times for decreasing (9). Then, the parameter of is updated twenty times for decreasing the following cost function
(11) 
Note that, while updating , is fixed, and vice versa.
4 Experiments
We conducted three experiments: (i) a verification experiment for investigating the stabilization effect of the proposed method, (ii) an objective experiment using a public dataset, and (iii) a subjective experiment. In all experiments, we utilized the VoiceBankDEMAND dataset constructed by Valentini et al. [24] which is openly available and frequently used in the literature of DNNbased speech enhancement [5, 7, 8, 12, 11]. The train and test sets consists of 28 and 2 speakers (11572 and 824 utterances), respectively, and all signals were downsampled to 16 kHz. As an example of OSQA to be approximated by the DNN, PESQ was considered in this section.
4.1 Experimental Setups
The DNN for estimating the TF mask,
, consisted of two 2D convolutional neural networks (CNNs) followed by a 1x1 CNN, two linear layers, and two bidirectional long shortterm memory (BLSTM)–layers. This setup is a standard architecture in DNNbased speech enhancement
[6]. The input of the DNN was logamplitude spectrogram of the observed signal whose size was. The kernel size, stride, and padding of both 2D CNNs were (5,15), (1,1) and (2,7), respectively. The number of output channels of the first and second 2D CNNs were 30 and 60, respectively. Then, the number of channels was decreased to 1 by the 1x1 CNN. The dimension of the 1x1 CNN’s output
was changed by the first linear layer to . It was passed to the BLSTM layers, and its forward and backward outputs were concatenated so that the output size of this block was . It was converted to by the last linear layer. Finally, the output was split into two matrices which were used as the real and imaginaryparts of the complexvalued TF mask. The spectrogram was multiplied by the estimated complex TF mask and transformed back to the timedomain as (1), where the STFT parameters, frame shift and window size , were set to 128 and 512samples, respectively, with the Hann window. The DNN for approximating PESQ, , was the same network used in MetricGAN.In the pretraining stage, and
were trained 200 and 290 epochs, where each epoch included randomly selected 1,000 utterances. We fixed the learning rate for the initial 100 epochs and then decreased it linearly down to a factor of 100 using Adam optimizer, where we started with a learning rate of 0.001. In the training stage, SGD was used as the optimizer and the learning rate was set to 0.001, and it was concluded after 2,000 updates.
4.2 Objective experiments
For the objective evaluation, we conducted two experiments. First, we conducted an experiment for verifying whether the PESQ score of the testdataset was stably improved by increasing the number of iterations. Figure 3 shows the relationship between the number of iterations and the PESQ score testdataset, where each minibatch in training stage was randomly chosen by using a seed value. From Fig. 3, the PESQ score was improved stably by increasing the number of iterations for all seed values used for choosing the minibatches in traning stage. This result indicates that the proposed method succeeded to stabilize the training of DNNbased speech enhancement for increasing the OSQA score.
Method  PESQ  CSIG  CBAK  COVL 

Noisy  1.97  3.35  2.44  2.63 
SEGAN [5]  2.16  3.48  2.94  2.80 
MMSEGAN [7]  2.53  3.80  3.12  3.14 
DFLoss [8]    3.86  3.33  3.22 
SERGAN [12]  2.62       
MetricGAN [11]  2.86  3.99  3.18  3.42 
Pretrain  2.73  3.73  2.55  3.20 
Ours  2.93  3.72  2.64  3.29 
Next, we compared the proposed method with the conventional methods on the same dataset and metrics, where CSIG, CBAK, and COVL are the popular predictor of the mean opinion score (MOS) of the target signal distortion, background noise interference, and overall speech quality, respectively [25]. In this evaluation, we considered SEGAN [5], MMSEGAN [7]
, Deep Feature Loss (DFLoss)
[8], SERGAN [12] and MetricGAN [11] as the reference of conventional methods because these methods have been evaluated on the same dataset [24]. In addition to the proposed method, the pretrained network without the training using was also evaluated (Pretrain) to investigate the performance improvement by the proposed training for improving the OSQA score.Table 1 summarizes the evaluated scores, where the proposed method achieved the stateoftheart score for PESQ compared to the conventional methods. As the proposed method can be considered as an improved version of MetricGAN, the higher PESQ score indicates the effectiveness of the proposed method. Although the proposed method did not outperform conventional methods on the other metrics, it is a straightforward result because the proposed method in this experiment was specialized to PESQ and did not take the other scores into account. To improve these scores simultaneously, design of mixedOSQA as in [10] should be performed. However, since there is no OSQA which perfectly correlates with the sound quality, improving every OSQA score is not the essential goal for improving the actual subjective quality, at least for the current standards.
4.3 Subjective evaluation
To confirm whether the proposed method improved the actual subjective quality, we conducted a subjective experiment. The proposed method was compared with SEGAN [5] and DFLoss [8] because speech samples of these methods are openly available in the webpage [26]. We selected 20 samples from Tranche 1–4 data (low SNR conditions) from the webpage. The speech samples of the proposed method used in this test are also openly available^{1}^{1}1https://miyazakilab.github.io/icassp2020_demo/. Nine participants evaluated the sound quality of the output signals according to ITUT P.835 [27]. The participants listened to each test sample three times and evaluated the quality of only speech (SMOS), only noise (NMOS), and overall (GMOS). By evaluating SMOS and NMOS before evaluating GMOS, the situation that only either one of the speech or noise affects the score of GMOS was avoided.
Figure 4 shows the results of the subjective evaluation. The proposed method outperformed SEGAN in terms of all factors, and outperformed DFLoss except NMOS. In addition, statistically significant differences were observed in SMOS and GMOS between the proposed method and the others according to the paired one sided test (). This result suggests that the proposed method improves not only objective metrics but also subjective quality.
5 Conclusion
In this study, we proposed the use of stabilization techniques to the functionapproximationbased method for stably improving OSQA scores. For stably training the auxiliary DNN approximating OSQA, we designed a new cost function (9) and adopted training techniques as described in Sec. 3.1.2. Experiments showed that the proposed method (i) was able to stably train the DNN, (ii) achieved the stateoftheart PESQ score on the public dataset, and (iii) obtained better subjective quality than the conventional methods. Thus, we concluded that the proposed method is effective for (i) stabilizing the training of DNNbased speech enhancement for increasing OSQA score, and (ii) improving subjective quality of the enhanced signal.
References
 [1] T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita, M. Fujimoto, C. Yu, W. J. Fabian, M. Espi, T. Higuchi, S. Araki, and T. Nakatani, “The NTT CHiME3 System: Advances in Speech Enhancement and Recognition for Mobile MultiMicrophone Devices,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2015.
 [2] K. Kobayashi, Y. Haneda, K. Furuya, and A. Kataoka, “A HandsFree Unit with Noise Reduction by using Adaptive Beamformer,” IEEE/ACM Trans. on Consum. Electron., 2008.

[3]
D. L. Wang and J. Chen, “Supervised Speech Separation Based on Deep Learning: An Overview,”
IEEE/ACM Trans. on Audio, Speech, and Lang. Process., 2018. 
[4]
H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “PhaseSensitive and RecognitionBoosted Speech Separation using Deep Recurrent Neural Networks,”
Proc. of Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), 2015.  [5] S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speech Enhancement Generative Adversarial Network,” Proc. of Interspeech, 2017.

[6]
H. Erdogan and T. Yoshioka, “Investigations on Data Augmentation and Loss Functions for Deep Learning Based SpeechBackground Separation,”
Proc. of Interspeech, 2018.  [7] M. H. Soni, N. Shah, and H. A. Patil, “TimeFrequency MaskingBased Speech Enhancement Using Generative Adversarial Network,” Proc. of Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), 2018.
 [8] F. G. Germain, Q. Chen, and V. Koltun, “Speech Denoising with Deep Feature Losses,” Proc. of Interspeech, 2019.
 [9] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, “DNNbased Source Enhancement SelfOptimized by Reinforcement Learning using Sound Quality Measurements,” Proc. of Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), 2017.
 [10] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, “DNNbased Source Enhancement to Increase Objective Sound Quality Assessment,” IEEE/ACM Trans. on Audio, Speech, and Lang. Process., 2018.
 [11] S. W. Fu, C. F. Liao, Y. Tsao, and S. D. Lin, “MetricGAN: Generative Adversarial Networks based Blackbox Metric Scores Optimization for Speech Enhancement,” Proc. of Int. Conf. on Machine Learning (ICML), 2019.
 [12] D. Baby and S. Verhulst, “Sergan: Speech Enhancement Using Relativistic Generative Adversarial Networks with Gradient Penalty,” Proc. of Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), 2019.
 [13] Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, and D. Takeuchi, “Speech Enhancement using SelfAdaptation and MultiHead SelfAttention,” Proc. of Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), 2020.
 [14] D. Takeuchi, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “RealTime Speech Enhancement using Equilibraited RNN,” Proc. of Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), 2020.
 [15] D. Takeuchi, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “Invertible DNNbased Nonlinear TimeFrequency Transform for Speech Enhancement,” Proc. of Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), 2020.
 [16] International Telecommunication Union, “Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for EndtoEnd Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs,” ITUT Recommendation P.862, 2001.
 [17] I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” Proc. of Neural Information Processing Systems (NIPS), 2014.
 [18] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral Normalization for Generative Adversarial Networks,” Proc. of Int. Conf. on Learning Representations (ICLR), 2018
 [19] D. Pfau and O. Vinyals, “Connecting Generative Adversarial Networks and ActorCritic Methods,” NIPS Workshop on Adversarial Training, 2016.
 [20] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” Proc. of Int. Conf. on Learning Representations (ICLR), 2015.
 [21] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. “The Marginal Value of Adaptive Gradient Methods in Machine Learning,” Proc. of Neural Information Processing Systems (NIPS), 2017.
 [22] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” Proc. of Int. Conf. on Machine Learning (ICML), 2018.
 [23] S. L. Smith, P. J. Kindermans, C. Ying, and Q. V.. Le, “Don’t Decay the Learning Rate, Increase the Batch Size,” Proc. of Int. Conf. on Learning Representations (ICLR), 2018.
 [24] C. ValentiniBotinho, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNNbased Speech Enhancement methods for NoiseRobust TexttoSpeech,” Proc. of 9th ISCA Speech Synth. Workshop (SSW), 2016.
 [25] Y. Hu and P. C. Loizou, “Evaluation of Objective Quality Measures for Speech Enhancement,” IEEE Trans. on Audio, Speech, and Lang. Process., 2008.
 [26] https://ccrma.stanford.edu/~francois/SpeechDenoisingWithDeepFeatureLosses/
 [27] International Telecommunication Union, “Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm,” ITUT Recommendation P.835, 2003.
Comments
There are no comments yet.