1 Introduction
Speech enhancement aims at improving the quality of the speech contaminated by the additive noise. It has quite a few applications including noise cancelling, audio editing, preprocessing for speech recognition, just to name a few. Denote the noisy speech as , we have
(1) 
where and are respectively the clean speech and the noise, with being the time index. Speech enhancement tries to recover the clean speech ^{1}^{1}1For simplicity, we omit the time index for the audio signal. from the noisy speech . Traditionally, Spectral Subtraction[specsub] and Wiener Filtering[wiener]
are two popular speech enhancement algorithms. Spectral Subtraction approach needs to estimate the noise spectrum and subtracts it from the noisy speech. However the characteristics of the noise is not trivial to approximate, and the noise spectrum might not be separable from the clean speech in frequency domain. Wiener Filtering instead tries to recover the clean speech by estimating the ratio between the spectrum of the clean speech and the noisy speech. The inaccurate approximation of the ratio limits the wide application of Weiner Filtering in practice. Recently, deep learning based algorithms have shown promising results in speech enhancement
[segan, deepfeat, mgan, rnnse, phase, waveform_utter]. These approaches can be further categorized into generative models[segan, mgan, add1, add2] and discriminative models[reg, multinoise, lstmEN, autoencoder]. In discriminative model, the deep neural network (DNN) takes the noisy speech as the input and tries to predict the clean speech, i.e. DNN is directly modeling the conditional distribution . While in generative model, conditional GAN [gan, cgan] is the prevalent method for speech enhancement, which models the distribution of , with the additional variable being latent variables. There are two key components in GAN, which are respectively called the generator and the discriminator . During the training, the parameters of the generator are tuned so that its prediction can fool the discriminator. In the meanwhile, the discriminator also evolves to be more capable of differentiating between the synthetic data from the generator and the real data. The training process can be written as follows:(2)  
(3) 
where
, usually called regularization function, is a traditional loss function such as
loss or cosine similarity loss. Both [mgan] and [segan] have shown that the GAN approach works well only if the traditional loss term is added in Eq. 3. This observation has also been confirmed in image synthesis[pixel2pixel] using conditional GAN, where the training example is a pair of images instead of audios. Note that can be by itself used as the loss function in discriminative models. The training of a discriminative model can be written as follows:(4) 
where is the function parametrized by that maps noisy speech to the enhanced speech . Comparing Eq. 4 with Eq. 3, there is an additional term in Eq. 3 which is the adversarial loss. In this regard, learning the generator in GAN can be seen as the training of a discriminative model with dynamic loss as the parameter of the adversarial loss keeps changing during training. Here the effect of is neglected, as the one to one mapping from to can still be learnt without in GAN, producing deterministic outputs. The stochasticity of the prediction from the generator is still an area of research, particularly in the onetoone mapping problem[pixel2pixel, z1, z2].
In most of previous work, the loss function is either computed by aggregating the / distance[segan, cnnse, rnnse] for each component of the audio or computed by evaluating the entire predicted audio as a whole in high dimensions such as cosine distance loss[cosineSim, phase]. However, as far as we are aware, there is no work in speech enhancement that optimizes the loss from coarse to fine in training. Coarsetofine strategy first optimizes the loss on the high dimensional audio sequences and then gradually reduces the granularity of the evaluated sequence to compute the loss. It not only takes advantage of fast convergence from coarse granularity but also reduces saturation and fine tunes the details as the granularity gets finer[c2f1, c2f2]. Thus the contributions of this paper are listed as follows:

[leftmargin=0.4cm]

We propose a general coarsetofine optimization for speech enhancement which can be applied to both generative and discriminative models.

We extend the idea of coarsetofine to the adversarial loss in the training of the generator of GAN and propose dynamic perceptual loss.

Experimental results show that the coarsetofine optimization outperforms a single granularity in quantitative metrics. Meanwhile, the proposed coarsetofine optimization and dynamic perceptual loss achieve the new stateoftheart for both discriminative models and generative models.
2 Basics
2.1 Cosine Similarity Loss
Cosine similarity loss[cosineSim]
is widely used to measure the similarity between two vectors, which is defined as:
(5) 
It is chosen in our algorithm because 1) it shows better accuracy than loss even with a single granularity. 2) it can be evaluated in different granularities, as opposed to loss which is essentially computed by aggregating the loss in every single dimension of (the finest granularity). As shown in Eq. 5, cosine similarity loss function actually computes the cosine value of the angle between two vectors. As the dimensionality of the vector increases, with the same cosine similarity loss between the prediction and the ground truth , the number of feasible solutions of also increases which adds uncertainties to the prediction. Therefore, if we optimize Eq. 5 in different granularities from high dimension (coarse) to low dimension (fine), the resulting prediction will be more constrained and better resemble the true audio sequence.
2.2 Time Domain vs Frequency Domain
Either the waveform or the spectrum of the audio can be the input and the output of the neural network. They are in essence the same as the Fourier Transform and Inverse of Fourier Transform can be represented by convolutional layers with fixed weights. However, in practice we observed that training directly from raw waveform takes longer than training from spectrum to converge to a reasonably good result. We also tried to encode the Fourier Transform as the convolutional layer in the DNN and initialize the corresponding layer with Fourier Transform coefficients but allow those coefficients to be further fine tuned with input and output being both raw waveform. We found the resulting accuracy is not better than the case where we just use fixed Fourier Transform coefficients. Furthermore, the training is observed more stable in frequency domain than that in time domain. Occasionally, the training in time domain could end with unacceptably poor results. Note that even the spectrum is used as the input/output, the loss can still be computed in time domain and backpropagation can be used to train the network because of the fact that all operations in inverse of Fourier Transform are differentiable. Thus in this paper, the DNN which maps the noisy speech
to the enhanced speech , is learnt in frequency domain. More precisely, we first apply Short Time Fourier Transform (STFT)[stft] to the noisy speech and obtain its spectrum. Then instead of directly predicting the ground truth spectrum from the input spectrum, the network will predict a complexvalued mask of ratio constrained by a function[cmask]. The spectrum of the predicted audiocan be simply derived by multiplying the input spectrum with the mask. By using the complexvalued mask as the output of the DNN, not only the magnitude but also the phase information is restored from the noisy input audio. In inference, the spectrum will be converted to the waveform via inverse STFT. In training, the cosine similarity loss will be computed on the waveform, following the inverse STFT. In all our experiments, the time window of STFT is 1024 with stride 256. The number of bins in frequencies is effectively 513 as the Fourier Transformation is applied on real numbers.
2.3 Network Architecture
EncoderDecoder style network is popularly used in both generative model[segan] and discriminative model[cnnse, phase]. The network architecture in our approach also follows the encoderdecoder style, enabling our proposed approach to be directly compared with other methods. For the discriminative model, we use the same network architecture as cRMn with 20 layers in [phase], a stateoftheart discriminative approach, to produce the enhanced speech. Fig. 1
visualizes the structure of this network. For our generative model (conditional GAN), the generator will be the same as the network in our discriminative model while the discriminator takes the encoder part of the generator as the backbone, followed by 2 additional convolution layers and one fully connected layer. Since the discriminator in conditional GAN would take both the data to be classified and the conditional data as input, the backbone of the discriminator will act as a siamese network
[deepface] and the additional convolutional layers and fully connected layers in our discriminator will further combine them and classify whether the input data is fake or real. With the same notation as in Fig. 1, the kernel shapes of additional convolution layers are respectively and without batch norm and the stride is . The output of fully connected layer is a onedimensional scalar. One of the benefits from siamese network is to efficiently compute socalled dynamic perceptual loss, which will be discussed in the next section. Fig. 2 shows the architecture of the discriminator network in our generative model.3 Coarse to Fine Optimization
The speech signal can be seen as either a single high dimensional vector or the concatenation of multiple low dimensional vectors. The size of the vector decides the granularity how we divide the signal . As mentioned earlier, computing loss in different granularities makes no difference as loss is evaluated on every component (finest granularity) of and then averaged. However the granularity matters when cosine similarity loss is computed! Denote the dimension of vectors of different granularities as and if , where is the number of vectors of ^{th} granularity and is the dimension of the vector of the corresponding granularity. When , the granularity is the coarsest and the dimension of that granularity is just the dimension of the original signal if we set . While , the original vector is divided into low dimensional vectors of the finest granularity.
In the training, we optimize the loss function from coarsest granularity to finest granularity. In each granularity, the loss function can be written as
(6) 
where is defined in Eq. 5 and is the slicing operation. The optimization for a particular granularity completes when the change of the loss is small or the number of iterations reaches a maximum. In this way, we can reduce the variance when the loss is computed only in high dimension and better resemble our predicted audio to the ground truth. In discriminative models, the coarsetofine optimization can be directly applied in the training. While in generative models, this strategy can also be deployed if we use cosine similarity loss as the regularization term in Eq. 3.
The coarsetofine strategy is not limited to optimize the cosine similarity loss. It can be extended to the problems where the objective to be optimized is not in fine granularity, but the prediction requires finer granularity. Following this philosophy, we propose dynamic perceptual loss. In GAN, the discriminator only gives a confidence score indicating whether the input audio is fake or real. In practice, when we optimize the generator by minimizing Eq. 3
, the first term enforce the prediction from the generator to be like real, i.e. the confidence score from the discriminator to be 1. However, a single score might not be enough to supervise the generator to generate a ’real’ audio as the spatial information is lost. Instead, the dynamic perceptual loss will not only enforces the confidence score in the discriminator, but also the deep features with different resolutions in the intermediate layers to be similar to the real audio. Denote the deep feature map in layer
as and the loss as . The dynamic perceptual loss () in layer can be written as(7) 
Since the siamese network is used in the discriminator of our generative model, it is straightforward to use it for computing both the deep features as well as the confidence score within the same network. Note that if we use other networks instead of siamese network like the one used in SEGAN[segan], the conditional data and the data to be classified are concatenated at the beginning of the network, which prevents us from computing the deep feature either for real audio or fake audio without concatenating the conditional data.
CSIG  CBAK  COVL  PESQ  SSNR  

Wiener[wiener]  3.23  2.68  2.67  2.22  5.07 
SEGAN[segan]  3.48  2.94  2.80  2.16  7.73 
WaveNet[wavenetSE]  3.62  3.23  2.98  N/A  N/A 
MMSEGAN[cmask]  3.80  3.12  3.14  2.53  N/A 
Deep Loss[deepfeat]  3.86(3.79)  3.33(3.27)  3.22(3.14)  (2.51)  (9.86) 
D[phase]  3.79  3.32  3.20  2.62  9.90 
D+M  3.94  3.35  3.33  2.73  9.40 
G  3.83  3.27  3.20  2.57  9.36 
G+M  3.94  3.33  3.31  2.67  9.50 
G+M+P  4.00  3.34  3.34  2.69  9.40 
4 Experiments
In this section, the experimental results show that in either discriminative models or generative models, the coarsetofine optimization will improve the current stateoftheart algorithms. Particularly in generative models, the proposed dynamic perceptual loss could further improve the accuracy obtained from optimizing the cosine similarity loss from coarse to fine.
4.1 Dataset and Metrics
4.1.1 Dataset
We evaluate our proposed algorithm on speech enhancement dataset by Valentini et al.[dataset], which is widely used in other popular speech enhancement methods[segan, mgan, deepfeat, phase]. This dataset consists of 11572 mono audio samples for training and 824 mono audio samples for testing. The duration of the audio ranges from 1 second to 15 seconds, with the average being around 3 seconds. The speech is recorded at 48 kHz. In training dataset, there are 10 different types of noise [noise] added to the clean speech with 4 signaltonoise (SNR) values: 15dB, 10dB, 5dB and 0dB. Thus the training dataset has 40 noisy conditions in total. In testing dataset, there are 5 types of noise which are added to the speech with 4 SNR values being 17.5dB, 12.5dB, 7.5dB and 2.5dB. The 28 speakers [speakers] in training dataset are different from the 2 speakers [speakers] in testing dataset, and all of them are native English speakers.
4.1.2 Metrics
We use five objective metrics to evaluate and compare the quality of the enhanced speech by the proposed coarsetofine optimization. SSNR, with the range from to , computes the segmental SNR in . CSIG and CBAK [measure] respectively predict Mean Opinion Score (MOS) of the signal distortion attending to the speech signal alone and the background intrusiveness attending to the background noise alone. COVL [measure] computes the MOS of the overall signal quality. CSIG, CBAK and COVL are all measured from 1 to 5. PESQ [measure2], standing for perceptual evaluation of the speech quality, is measured from to . For all these metrics, the higher the measure is, the better quality the enhanced speech will have. As is known, there is no single objective measure that correlates perfect with subjective evaluations for different speech distortions[loizou]. Therefore, we need to take all the above metrics into account when evaluating the speech quality (SSNR is included as widely used, though its correlation with overall speech quality is low[measure]).
4.2 Discriminative Model
In this experiment, we compare our coarsetofine optimization with the stateoftheart discriminative model [phase] using single granularity optimization. Without loss of generality^{2}^{2}2The author of [phase] confirmed there is technical error in the accuracy reported in the paper. The actual best accuracy from their method is similar to previous stateoftheart methods (e.g. [deepfeat])., we choose the network called cRMn in [phase] to be the network in our discriminative model, as shown in Fig. 1
. The model is trained for 180 epochs with batch size 96 using Adam
[adam] optimizer. The initial learning rate is set to be 0.0004 and is multiplied by 0.5 at epoch 40, 80 and 120. The weight decay is 0.0005. We downsample the input audio from 48 kHz to 16 kHz. And during training, similar to [segan], we divided the original audio into overlapped slices with the stride , each of which has samples (approximately 1 second). During testing, as in [segan], we divide the test utterance into nonoverlapped slices and concatenate the results as the final enhanced speech for the whole duration.In the training, we compute the cosine similarity loss for both signal and background noise as well [phase, wavenetSE] so that Eq. 5 will be sensitive to the scale change of the signal. The granularity on which we compute our loss starts from the entire duration of the speech and decreases by 2 times every 20 epochs. The finest granularity is . The vertical black dash line in Fig. 3
indicates the moment we decrease the granularity by 2x. From the result shown in Fig.
3, we can see our coarsetofine optimization steadily outperforms the singlegranularity optimization.4.3 Generative Model (GAN)
In the GAN experiment, the batch size is reduced to 64 due to the memory limit. And the number of epochs for training is increased to 360 so that the effective number of epochs for the generator training is 180 as the discriminator and generator are trained alternatively one after the other [segan]. The learning rate for the discriminator is constantly 0.0002. All other setups are the same as the discriminative model training. In the training of generator, the scalar coefficient for the regularization term is 40 while the coefficient for the discriminator adversarial loss is always 1. If the perceptual loss is considered, the balancing coefficient is 100. They are chosen in this way so that the scale of all the terms is almost the same. The regularization term for the generator is cosine similarity loss instead of as widely used in other GAN methods[segan, cmask]. We add a Gaussian noise with mean 0.0 and variance 0.01 between the encoder and the decoder of the generator.
As shown in Fig. 4, the plain GAN(G) means that the adversarial loss is the norm as used in [segan] and the regularization is cosine similarity loss. In coarsetofine optimization(G+M), we optimize the regularization in a coarsetofine way as used in the discriminative model. Furthermore, we apply the coarsetofine strategy to the adversarial loss term, i.e. dynamic perceptual loss(G+M+P). We start with the original loss computed in the last layer of the discriminator as G+M and every 80 epochs, we compute the loss on deep features of different resolution between the real and the fake utterance. The particular layers where the deep features are extracted are layer 9, 7, 5, 3. Fig. 4 clearly shows G+M+P outperforms G+M which is better than G. In Tab. 1, we compare our results with other popular speech enhancement algorithms. The accuracies of our models are always computed from the last iteration of the training without picking the best in histories. Tab. 1 further shows the effectiveness of our proposed coarsetofine optimization and the dynamic perceptual loss. The bold numbers highlight the best accuracy in either discriminative models or generative models, but not both. As shown in Tab. 1, our discriminative model (D+M) and generative model (G+M+D) both outperform corresponding stateoftheart methods in most metrics. We also notice that the single granularity in the discriminative model outperforms coarsetofine in SSNR. This can be explained by the consistent goal of measuring the difference over the entire speech between SSNR and single granularity loss, while not correlating well with overall speech quality compared with other metrics[measure].
5 Conclusion and Discussion
In this paper we proposed the coarsetofine strategy in optimizing the cosine similarity loss for both discriminative and generative models. Inspired by the coarsetofine idea, we further proposed the dynamic perceptual loss as the adversarial loss term in the generator training of GAN. Our experiments show the effectiveness of our proposed methods. In the future, we will look into the question of the better model choice for the task of speech enhancement: discriminative OR generative? And how to take both of their advantages into one model is still open. Dynamic perceptual loss might provide a direction, and it would also be very interesting to see whether it can be generalized to other applications besides speech enhancement.