1 Introduction
Single-channel speech enhancement (SE) strives for reducing the noise component from noisy speech to increase the intelligibility and perceived quality of the speech component [12]
. It has been widely used as a pre-processing in speech-related applications such as automatic speech recognition (ASR). In the past years, deep learning has been employed in single-channel SE and has achieved great success. Some works proposed deep learning based mask to filter out the noise on noisy input features
[17, 20, 21]. However, the mask approach is based on unrealistic presumption: the noise is strictly additive and the scale of the masked signal is the same as the clean target. Therefore, the feature mapping approach was introduced to deal with that problem by training a mapping network that direct transforms the noisy features to the clean ones [23, 13, 14, 4, 22, 2].Generative Adversarial Network (GAN) [7], Adversarial Training [5] and Cycle-Consistent Adversarial Networks (CycleGAN) [25] have drawn attention in the deep learning community since they demonstrate better generalization by using discriminator network to encourage the model to produce noise-invariant features in order to ease the mismatch issue between source and target domain. These approaches have also applied to SE [11, 15, 16]. Meng et al [16] proposed cycle-consistent speech enhancement (CSE) and adversarial cycle-consistent speech enhancement (ACSE) models in which an additional inverse mapping network is introduced to reconstruct the noisy features from the enhanced ones. Furthermore, a cycle-consistent constraint is enforced to minimize the reconstruction loss. CSE and ACSE models are designed for both, training with parallel and non-parallel data. The main difference between CSE and ACSE architecture is that ACSE used two discriminator networks to distinguish the enhanced and noised features from the noisy and clean features [16]. Evaluated on the CHiME-3 dataset, the CSE model achieves reasonable relative word error rate (WER) improvement, while the ACSE model, which is trained with unparalleled data, is less effective.
Generative multi-adversarial network (GAM), a framework that extends GAN to multiple discriminators, was proposed in Computer Vision
[3]. GMAN seems to produce higher quality samples in a fraction of the iterations when measured by a pairwise GAM-type metric [3]. Hosseini-Asl et al [9] proposed a Multi-Discriminators CycleGAN (MD-CycleGAN), similar to generative multi-adversarial network (GMAN), for unsupervised non-parallel speech domain adaptation. The MD-CycleGAN model employs multiple independent discriminators on the power spectrogram, each in charge of different frequency bands [9]. It demonstrates the effectiveness of MD-CycleGAN method on CTC End-to-End ASR with gender adaptation [24]. However, the input and the generated features are the power spectrogram, which is not common for the state-of-the-art ASR system (hyrid or E2E). Besides, MD-CycleGAN did not take into account the identity loss which is meaningful in non-stationary noise scenario, which we will explain it in more detail in section 2.In this paper, we contribute to previous work in the following aspects: 1) To the best of our knowledge, we are the first to propose a novel framework based on the Multi-Discriminators generative models [3, 9] for feature mapping in speech enhancement and the feature mapping model uses the same input as ASR (log-Mel filterbank) ; 2) We propose to train multiple generators and multiple discriminators on homogeneous data to improve the WER 3) We show that our models outperform strong baselines with up to 10.03 % and up to 14.9 % relatively WER improvements on CHiME-3 development set and evaluation set without retraining the ASR system that was trained with WSJ clean data.
2 Method
2.1 CycleGAN
The goal of CycleGAN is to solve the image to image translation task when learning a mapping G_A between an input image from a source domain A and an output image from a target domain B without using paired training data
[25]. The mapping is learnt such that the distribution of images from is indistinguishable from the distribution using an adversarial loss. Because this mapping is highly under-constrained, [25] coupled it with an inverse mapping and introduce a cycle consistency loss to push (and vice versa). The full objective is as follows:(1) |
where and are tunable parameters. And other losses are defined as follows:
(2) | |||
(3) | |||
(4) | |||
(5) | |||
(6) | |||
(7) |
The loss functions for two discriminators
and are(8) | |||
(9) |
Note that the outputs of and are also called fake_B and fake_A,respectively. Compared with the objective of ACSE model [16], the main difference is that ACSE does not take into account the identity loss as shown in equations (4) and (5). However, the noise does not constantly occur in every frame in the non-stationary noise scenario. Therefore, adding identity loss to the objective could benefit the model to generate original frame when the input feature is in the output domain. That is, and .
Figure 1 shows the network architecture of CycleGAN for converting a noisy signal to a clean signal. and are the domain of noisy signal and clean signal, respectively. The two generators are and . The goal of the two discriminators ( and ) is to predict whether the sample is from the actual distribution ( ‘real’) or produced by the generator (‘fake’) given the feature input .

2.2 Multi-Discriminators CycleGAN
The model is based on Multi-Discriminators generative models [3, 9] and CycleGAN [25]. Based on equation (2), the loss function for generator is formed of the prediction of discriminator . That is to say, if the discriminator is imperfect, then it can not guide the generator to the optimal area during the training. Therefore, we propose to strengthen the complexity of the discriminator by introducing multiple discriminators, e.g. and , to judge the (a.k.a fake_B) as shown in Figure 2. Each discriminator is responsible for a subregion of the log Mel-filterbank feature. The full objective is the same as equation (1), but the equation (2) is adapted to
(10) |
Where is the total number of discriminators to judge output. The loss function for each discriminator in equation (8) is adapted to
(11) |
The is to mark out some bins in log Mel-filterbank feature, e.g., the mask for marks off the 21th to 40th bins and the mask for marks off the 0 to 20th bins.
(12) | |||
(13) | |||
(14) | |||
(15) |
Furthermore, we can even increase the number of generators and therefore the number of discriminators by dividing the training data in different homogeneous subsets, e.g. by splitting the data by genders and by types of noises.

3 DataSet
CHiME-3 was developed as part of the 3rd CHiME Speech Separation and Recognition Challenge. It contains approximately 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio. The CHiME Challenges focus on distant-microphone ASR in real-world environments [1]. Table 1 shows that the training set contains 8738 noisy utterances: 1600 real noisy utterances from two male and two female speakers in the 4 noisy environments and 7138 simulated noisy utterances based on WSJ0 SI-84 training set in the 4 noisy environments [1]. Development set is composed of 410 (real noisy speech) 4 (environments) + 410 (simulated noisy speech) 4 (environments) = 3280 utterances. Test set consists 330 (real noisy speech) 4 (environments) + 330 (simulated noisy speech) X 4 (environments) = 2640 utterances [1].
Train set | Noisy type | # of utterance | hr. |
tr05_orig_clean | N/A | 7138 | 15 |
tr05_real_noisy | BUS,CAF,PED,STR | 1600 | 2 |
tr05_simu_noisy | BUS,CAF,PED,STR | 7138 | 15 |
Test set | Noisy type | # of utterance | hr. |
dt05_real_noisy | BUS,CAF,PED,STR | 1640 | 2.74 |
dt05_simu_noisy | BUS,CAF,PED,STR | 1640 | 2.89 |
et05_real_noisy | BUS,CAF,PED,STR | 1640 | 2.17 |
et05_simu_noisy | BUS,CAF,PED,STR | 1640 | 2.27 |
4 Experimental Setup
4.1 Baseline ASR
The ASR is trained with 80 hours of Wall Street Journal [6] using Kaldi [19]
default recipe to train HMM-GMM model to get alignment, and then we train the time delay neural network (TDNN) network
[18] with the augmented features (speed and volume perturbed). The input feature consists of 40-bin log-Mel filterbank and 100-bin ivectors. In decoding stage, we use a fairly large dictionary, a trigram pruned language model and run re-scoring with fourgram language model.4.2 Multi-Discriminators CycleGAN
This section mainly describes the network structure of Multi-Discriminators CycleGAN, the experimental setup and how we train it with CHiME-3 dataset. The experimental setup and code could be obtained in github111https://github.com/chiayuli/SEWork.
The generators and discriminators are trained with 3200 hours of data containing real noisy train set (tr05_real_noisy) and clean data (tr05_orig_clean). The feature is 40-bin log-Mel filterbank using hamming window. The context window of each frame is 5, so the input size for the generator and the discriminator is (1, 11, 40). The network for the generator is 9 blocks Resnet [8]
and the number of filters in the last convolutional layer in Resnet is 64. The network of the discriminator contains 3 convolutional layers with normalized layer. The number of filters in the first convolutional layer is 64. The batch size is 512 and the learning rate is 0.0002 with learning decay every 50 epochs. The optimizer is Adam
[10] and the model is trained for 200 epochs. The and are 0.5 and 10 in all the experiments.We design three architectures in this experiment. In the first architecture (A1), we train one single generator using the Multi-Discriminators CycleGAN with all the training set as shown in Figure 3, i.e. 3200 hours of unpaired speech data coming from all genders and all different types of noises. For this generator, we train up to three discriminators, e.g. CycleGAN-1G+2DA means one generator and two discriminators.
In the second architecture (A2), we split the data by gender and use two Multi-Discriminators CycleGANs to train with each of the subset simultaneously as shown in Figure 4. The name of architecture with prefix ”CycleGAN-2G” means that there are two CycleGANs. CycleGAN-2G+2DA means that there are two discriminators (), one for one . CycleGAN-2G+4DA means that there are four discriminators (), two for one .
In the last architecture (A3), we split the data not only by genders but also by types of noises. By doing so, we obtain eight subsets of training data: female & BUS, female & CAF, female & PED, female & STR, male & BUS, male & CAF, male & PED, male & STR as shown in Figure 5. We train each Multi-Discriminators CycleGAN with one of subsets simultaneously. The prefix ”CycleGAN-8G” means that there are eight Multi-Discriminators CycleGANs in the structure. Hence, CycleGAN-8G+8DA means that there are eight discriminators A (), one for one , in the structure. CycleGAN-8G+24DA means that there are 24 discriminators (), three for one , in the structure.



5 Result
Table 2 and 3 show the WERs of the ASR system in average and also broken down in different types of noises on the CHiME-3 development set and evaluation set. Note that the ASR system is trained with WSJ clean data and stays unchanged while the input features are enhanced with our proposed Multi-Discriminators CycleGAN. In both tables, the highlighted WERs are the WERs that are better than the ones without SE. The WERs in bold are the best WERs among the models with the same number of generators.
5.1 One Discriminator vs. Multi-Discriminators
On the development and evaluation set, the model with two discriminators for one generator A () performs better than the one having only one discriminator. The model with three discriminators performs better than the one having one or two discriminators. The same observation holds for the A2:CycleGAN-2G and A3:CycleGAN-8G models. These results suggest the importance of having multiple discriminators.
However for both A1:CycleGAN-1G and A2:CycleGAN-2G models, the WERs are not better than the WERs of the baseline (no SE) in all types of noises. The WERs are only better on the BUS and the STR noisy data but not on the CAF and the PED noisy data. The best model in the architecture of A2:CycleGAN-2G on the development set is A2:CycleGAN-2G+4DA with a WER of 31.44%. The best A2:CycleGAN-2G on the evaluation set is A2:CycleGAN-2G+6DA which achieves 57.53 WER. But we observe the best WERs with the A3:CycleGAN-8G+24DA models, i.e. 8 generators and three discriminators for each. It has a WER of 28.66% and 52.80% on the development set and the evaluation set, respectively and outperforms the baseline (no SE) in all types of noisy data. These results suggest the importance of having multiple generators.
Arch. | Method | AVG | BUS | CAF | PED | STR |
---|---|---|---|---|---|---|
No SE | 32.14 | 38.88 | 37.79 | 18.93 | 32.94 | |
A1 | CycleGAN-1G+1DA | 35.25 | 37.82 | 43.79 | 22.23 | 31.97 |
A1 | CycleGAN-1G+2DA | 33.53 | 35.39 | 39.47 | 22.84 | 31.03 |
A1 | CycleGAN-1G+3DA | 29.45 | 34.12 | 33.94 | 17.92 | 28.45 |
A2 | CycleGAN-2G+2DA | 34.05 | 37.11 | 45.22 | 23.4 | 30.47 |
A2 | CycleGAN-2G+4DA | 31.44 | 36.14 | 39.2 | 20.22 | 30.17 |
A2 | CycleGAN-2G+6DA | 32.64 | 37.17 | 39.57 | 21.76 | 32.06 |
A3 | CycleGAN-8G-8DA | 35.12 | 38.16 | 37.17 | 23.06 | 38.06 |
A3 | CycleGAN-8G+16DA | 30.19 | 33.26 | 35.91 | 19.68 | 30.48 |
A3 | CycleGAN-8G+24DA | 28.66 | 32.00 | 32.54 | 18.04 | 28.52 |
A3 | CycleGAN-8G+32DA | 29.51 | 33.6 | 33.44 | 18.69 | 28.79 |
Arch. | Method | AVG | BUS | CAF | PED | STR |
---|---|---|---|---|---|---|
No SE | 61.46 | 82.91 | 62.33 | 57.21 | 43.41 | |
A1 | CycleGAN-1G+1DA | 60.33 | 76.48 | 67.39 | 57.01 | 40.42 |
A1 | CycleGAN-1G+2DA | 57.64 | 71.49 | 64.14 | 54.28 | 40.64 |
A1 | CycleGAN-1G+3DA | 53.29 | 68.61 | 58.46 | 49.48 | 36.61 |
A2 | CycleGAN-2G+2DA | 61.54 | 74.78 | 69.87 | 62.14 | 39.39 |
A2 | CycleGAN-2G+4DA | 60.88 | 73.73 | 67.54 | 59.98 | 42.27 |
A2 | CycleGAN-2G+6DA | 57.53 | 70.48 | 65.17 | 54.97 | 39.50 |
A3 | CycleGAN-8G+8DA | 60.62 | 72.87 | 62.42 | 58.71 | 44.79 |
A3 | CycleGAN-8G+16DA | 55.30 | 66.37 | 60.29 | 51.03 | 37.80 |
A3 | CycleGAN-8G+24DA | 52.80 | 65.84 | 56.54 | 46.49 | 36.91 |
A3 | CycleGAN-8G+32DA | 54.01 | 67.00 | 56.29 | 49.44 | 36.70 |
5.2 One Generator vs. Multi-Generators
Figure 6 shows the comparisons in terms of WERs between SE systems with different number of generators given the same number of discriminators. For example, Figure 6 shows the WER comparison between A1:CycleGAN-1G+1DA, A2:CycleGAN-2G+2DA and A3:CycleGAN-8G-+8DA. These models all have only one discriminator for one . Figure 6 is the comparison between A1:CycleGAN-1G+2DA, A2:CycleGAN-2G+4DA and A3:CycleGAN-8G-+16DA which all have two discriminators for one . The comparison between the models having three discriminators for one , A1:CycleGAN-1G+3DA, A2:CycleGAN-2G+6DA and A3:CycleGAN-8G+24DA, is shown in Figure 6 .
For one discriminator A (1DA) and two discriminators A (2DA) models in Figure 6 and Figure 6 , the model with two generators (A2:CycleGAN-2G) performs better than the one with only one generator (A1:CycleGAN-1G) in the BUS, CAF and PED noisy scenarios. The model with eight generators (A3:CycleGAN-8G) slightly performs better than A2:CycleGAN-2G. For three discriminator A (3DA) models in Figure 6 , A3:CycleGAN-8G slightly performs better than A2:CycleGAN-2G, and A2:CyclGAN-2G and A3:CyclGAN-8G models perform much better than A1:CycleGAN-1G. In sum, our results show that adding more generators and train them with well split data, e.g. gender and noise types, plays an important role to improve the overall performance.



6 Analysis & Discussion
6.1 WER, Insertion, Deletion and Substitution
Table 4 shows the comparison (WER, Insertion, Deletion, Substitution and Correction) on CHiME-3 development set with and without Multi-Discriminators CycleGAN SE.
For all the A1:CycleGAN-1G, A2:CycleGAN-2G and A3:CycleGAN-8G models, the model having only discriminator A (1DA models) has two or three times more insertions than the one having multiple discriminators. It implies that 1DA models do not remove noisy signal very well, so the clean ASR system mis-recognize it as speech. The number of deletion from multiple discriminators models is less than the one from models, which means more words are recognized by the ASR system with Multi-Discriminators CycleGAN SE.
Arch. | Method | WER | INS | DEL | SUB | COR |
---|---|---|---|---|---|---|
No SE | 32.14 | 170 | 6054 | 2491 | 8715 | |
A1 | CycleGAN-1G+1DA | 35.25 | 723 | 5894 | 2942 | 9559 |
A1 | CycleGAN-1G+3DA | 29.45 | 240 | 5309 | 2437 | 7986 |
A2 | CycleGAN-2G+2DA | 34.05 | 779 | 5372 | 3083 | 9234 |
A2 | CycleGAN-2G+4DA | 31.44 | 303 | 5611 | 2611 | 8525 |
A3 | CycleGAN-8G+8DA | 35.12 | 718 | 5940 | 2867 | 9525 |
A3 | CycleGAN-8G+24DA | 28.66 | 398 | 4255 | 3120 | 7773 |
6.2 ASR outputs & Spectrograms
In this section, we compare cherry-picked ASR outputs in the development set without and with Multi-Discriminators CycleGAN SE and the corresponding log-Mel spectrograms. Table 5 shows the ASR output on the BUS real noisy utterance without SE, with A1:CycleGAN-1G SE and our best SE model (A3:CycleGAN-8G+24DA). As stated in Table 4, adding more CycleGAN and discriminator A (DA) for one helps easing the high deletion problem.
Figure 7, 8 and 9 show the comparison of log-Mel Spectrogram from orignal clean utterance, noisy utterance without SE, noisy utterance with A1:CycleGAN-1G+1DA, noisy utterance with A1:CycleGAN-1G+3DA and noisy utterance with A3:CycleGAN-8G+24DA SE. Between 0 to 1000 millisecond, our SE models reduce the noise and help the ASR to recognize the words successfully. However, from 1000 to 1500 milliseconds, the utterance with Multi-Discriminators CycleGAN SE might still contain some noisy signals, so the ASR could not recognize the two words ”WANT THEM” well. Besides, A1:CycleGAN-1G+1DA might not transform the noisy signal to clean signal well, so that the ASR system mis-recognizes /A/ as /O/, and the latter predictions (”GIVE UP”) have nothing in common with the two words ”WANT THEM” with respect to the pronunciation.
Table 6 shows another example of ASR output in CAF (SNR=10) noisy development set with and without Multi-Discriminators CycleGAN SE. The A3:CycleGAN-8G model performs much better than the A1:CycleGAN-1G models, there are many substitutions and insertions in A1:CycleGAN-1G models. But the A3:CycleGAN-8G model has more deletions than A1:CycleGAN-1G models in this example.
Arch. | Method | ASR output |
---|---|---|
Time | 0-1000 ms: OUR CUSTOMERS | |
1000-1500 ms: WANT THEM | ||
Reference | OUR CUSTOMERS WANT THEM | |
No SE | ||
A1 | CycleGAN-1G+1DA | OUR CUSTOMERS WON’T GIVE UP |
A1 | CycleGAN-1G+3DA | OUR CUSTOMERS |
A3 | CycleGAN-8G+24DA | OUR CUSTOMERS |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Arch. | Method | ASR output |
---|---|---|
Reference | WHAT ABOUT IN SOUTH AFRICA | |
ITSELF | ||
No SE. | WHAT ABOUT A SATISFACTORY | |
A1 | CycleGAN-1G+1DA | BUT WHAT ABOUT A SOUTH |
AFRICAN GETS OUT | ||
A1 | CycleGAN-1G+3DA | WHAT ABOUT A SAD FACT GETS OUT |
A3 | CycleGAN-8G+24DA | WHAT ABOUT SOUTH AFRICANS |
7 Conclusion
In this work, we investigate the performance of ASR in terms of WER using CycleGAN based feature mapping and our novel extension (multiple generators and multiple discriminators). Our experimental results show that multiple generators trained with well splitting subset is better than one generator trained with all the data. Besides, the models have multiple discriminator A () per generator A () improve the average WER. The best model A3:CycleGAN-8G+24DA improves the WER for all the four noisy scenarios and achieves 10.03 % relatively WER on development set and 14.09 % relatively WER on evaluation set.
References
- [1] (2017) CHiME3 LDC2017S24. USB Flash Drive. Philadelphia: Linguistic Data Consortium. Cited by: §3.
-
[2]
(2017)
Improving mask learning based speech enhancement system with restoration layers and residual connection
. in Proceedings of Interspeech. Cited by: §1. - [3] (2017) Generative multi-adversarial networks. in Proceedings of ICLR. Cited by: §1, §1, §2.2.
-
[4]
(2014)
Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition
. in Proceedings of ICASSP. Cited by: §1. -
[5]
(2016)
Domain-adversarial training of neural networks.
in Machine Learning Research
. Cited by: §1. - [6] (1993) CSR-I (wsj0) complete ldc93s6a. Web Download. Philadelphia: Linguistic Data Consortium. Cited by: §4.1.
- [7] (2014) Generative adversarial nets. in Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
- [8] (2015) Deep Residual Learning for Image Recognition. in Proceedings of CVPR. Cited by: §4.2.
- [9] (2018) A multi-discriminator cyclegan for unsupervised non-parallel speech domain adaptation. in Proceedings of Interspeech. Cited by: §1, §1, §2.2.
- [10] (2015) Adam: A Method for Stochastic Optimization. in the Proceedings of the 3rd International Conference for Learning Representations. Cited by: §4.2.
- [11] (2019) Noise adaptive speech enhancement using domain adversarial training. in Proceedings of Interspeech. Cited by: §1.
- [12] (2013) Speech enhancement: theory and practice.. CRC press. Cited by: §1.
-
[13]
(2013)
Speech enhancement based on deep denoising autoencoder.
. in Proceedings of Interspeech. Cited by: §1. - [14] (2012) Recurrent neural networks for noise reduction in robust asr.. in Proceedings of Interspeech. Cited by: §1.
- [15] (2018) Adversarial feature-mapping for speech enhancement. in Proceedings of Interspeech. Cited by: §1.
- [16] (2018) Cycle-consistent speech enhancement. in Proceedings of Interspeech. Cited by: §1, §2.1.
-
[17]
(2013)
Ideal ratio mask estimation using deep neural networks for robust speech recognition.
. in Proceedings of ICASSP. Cited by: §1. - [18] (2015) A Time Delay Neural Network Architecture for Efficient Modeling of Long Temporal Contexts. in Proceedings of Interspeech. Cited by: §4.1.
- [19] (2011-12) The Kaldi Speech Recognition Toolkit. In Proceedings of ASRU, Cited by: §4.1.
- [20] (2014) On training targets for supervised speech separation.. in IEEE/ACM Trans. Audio, Speech, Language Process.. Cited by: §1.
- [21] (2015) Speech enhance- ment with lstm recurrent neural networks and its application to noise-robust asr.. in Proceedings of LVA/ICA. Cited by: §1.
- [22] (2014) Single-channel speech separation with memory-enhanced recurrent neural networks. in Proceedings of ICASSP. Cited by: §1.
- [23] (2015) A regression approach to speech enhancement based on deep neural networks.. in IEEE/ACM Trans. Audio, Speech, Language Process. 23. Cited by: §1.
- [24] (2018) Improving End-to-End Speech Recognition with Policy Learning. In Proceedings of ICASSP abs/1712.07101. Cited by: §1.
- [25] (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In in Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251. Cited by: §1, §2.1, §2.2.