Improving Speech Recognition on Noisy Speech via Speech Enhancement with Multi-Discriminators CycleGAN

12/12/2021
by   Chia Yu Li, et al.
0

This paper presents our latest investigations on improving automatic speech recognition for noisy speech via speech enhancement. We propose a novel method named Multi-discriminators CycleGAN to reduce noise of input speech and therefore improve the automatic speech recognition performance. Our proposed method leverages the CycleGAN framework for speech enhancement without any parallel data and improve it by introducing multiple discriminators that check different frequency areas. Furthermore, we show that training multiple generators on homogeneous subset of the training data is better than training one generator on all the training data. We evaluate our method on CHiME-3 data set and observe up to 10.03 and up to 14.09

READ FULL TEXT VIEW PDF

Authors

page 4

page 5

page 6

03/11/2019

Bridging the Gap Between Monaural Speech Enhancement and Recognition with Distortion-Independent Acoustic Modeling

Monaural speech enhancement has made dramatic advances since the introdu...
05/03/2022

On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training

In this paper, we explore an improved framework to train a monoaural neu...
03/22/2019

Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

This paper describes multichannel speech enhancement for improving autom...
06/22/2016

A Curriculum Learning Method for Improved Noise Robustness in Automatic Speech Recognition

The performance of automatic speech recognition systems under noisy envi...
02/24/2021

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks

Multi-stage learning is an effective technique to invoke multiple deep-l...
09/02/2015

Enhancement and Recognition of Reverberant and Noisy Speech by Extending Its Coherence

Most speech enhancement algorithms make use of the short-time Fourier tr...
07/31/2020

Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones

A novel framework for meeting transcription using asynchronous microphon...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Single-channel speech enhancement (SE) strives for reducing the noise component from noisy speech to increase the intelligibility and perceived quality of the speech component [12]

. It has been widely used as a pre-processing in speech-related applications such as automatic speech recognition (ASR). In the past years, deep learning has been employed in single-channel SE and has achieved great success. Some works proposed deep learning based mask to filter out the noise on noisy input features

[17, 20, 21]. However, the mask approach is based on unrealistic presumption: the noise is strictly additive and the scale of the masked signal is the same as the clean target. Therefore, the feature mapping approach was introduced to deal with that problem by training a mapping network that direct transforms the noisy features to the clean ones [23, 13, 14, 4, 22, 2].

Generative Adversarial Network (GAN) [7], Adversarial Training [5] and Cycle-Consistent Adversarial Networks (CycleGAN) [25] have drawn attention in the deep learning community since they demonstrate better generalization by using discriminator network to encourage the model to produce noise-invariant features in order to ease the mismatch issue between source and target domain. These approaches have also applied to SE [11, 15, 16]. Meng et al [16] proposed cycle-consistent speech enhancement (CSE) and adversarial cycle-consistent speech enhancement (ACSE) models in which an additional inverse mapping network is introduced to reconstruct the noisy features from the enhanced ones. Furthermore, a cycle-consistent constraint is enforced to minimize the reconstruction loss. CSE and ACSE models are designed for both, training with parallel and non-parallel data. The main difference between CSE and ACSE architecture is that ACSE used two discriminator networks to distinguish the enhanced and noised features from the noisy and clean features [16]. Evaluated on the CHiME-3 dataset, the CSE model achieves reasonable relative word error rate (WER) improvement, while the ACSE model, which is trained with unparalleled data, is less effective.

Generative multi-adversarial network (GAM), a framework that extends GAN to multiple discriminators, was proposed in Computer Vision

[3]. GMAN seems to produce higher quality samples in a fraction of the iterations when measured by a pairwise GAM-type metric [3]. Hosseini-Asl et al [9] proposed a Multi-Discriminators CycleGAN (MD-CycleGAN), similar to generative multi-adversarial network (GMAN), for unsupervised non-parallel speech domain adaptation. The MD-CycleGAN model employs multiple independent discriminators on the power spectrogram, each in charge of different frequency bands [9]. It demonstrates the effectiveness of MD-CycleGAN method on CTC End-to-End ASR with gender adaptation [24]. However, the input and the generated features are the power spectrogram, which is not common for the state-of-the-art ASR system (hyrid or E2E). Besides, MD-CycleGAN did not take into account the identity loss which is meaningful in non-stationary noise scenario, which we will explain it in more detail in section 2.

In this paper, we contribute to previous work in the following aspects: 1) To the best of our knowledge, we are the first to propose a novel framework based on the Multi-Discriminators generative models [3, 9] for feature mapping in speech enhancement and the feature mapping model uses the same input as ASR (log-Mel filterbank) ; 2) We propose to train multiple generators and multiple discriminators on homogeneous data to improve the WER 3) We show that our models outperform strong baselines with up to 10.03 % and up to 14.9 % relatively WER improvements on CHiME-3 development set and evaluation set without retraining the ASR system that was trained with WSJ clean data.

2 Method

2.1 CycleGAN

The goal of CycleGAN is to solve the image to image translation task when learning a mapping G_A between an input image from a source domain A and an output image from a target domain B without using paired training data

[25]. The mapping is learnt such that the distribution of images from is indistinguishable from the distribution using an adversarial loss. Because this mapping is highly under-constrained, [25] coupled it with an inverse mapping and introduce a cycle consistency loss to push (and vice versa). The full objective is as follows:

(1)

where and are tunable parameters. And other losses are defined as follows:

(2)
(3)
(4)
(5)
(6)
(7)

The loss functions for two discriminators

and are

(8)
(9)

Note that the outputs of and are also called fake_B and fake_A,respectively. Compared with the objective of ACSE model [16], the main difference is that ACSE does not take into account the identity loss as shown in equations (4) and (5). However, the noise does not constantly occur in every frame in the non-stationary noise scenario. Therefore, adding identity loss to the objective could benefit the model to generate original frame when the input feature is in the output domain. That is, and .

Figure 1 shows the network architecture of CycleGAN for converting a noisy signal to a clean signal. and are the domain of noisy signal and clean signal, respectively. The two generators are and . The goal of the two discriminators ( and ) is to predict whether the sample is from the actual distribution ( ‘real’) or produced by the generator (‘fake’) given the feature input .

Figure 1: The architecture of CycleGAN

2.2 Multi-Discriminators CycleGAN

The model is based on Multi-Discriminators generative models [3, 9] and CycleGAN [25]. Based on equation (2), the loss function for generator is formed of the prediction of discriminator . That is to say, if the discriminator is imperfect, then it can not guide the generator to the optimal area during the training. Therefore, we propose to strengthen the complexity of the discriminator by introducing multiple discriminators, e.g. and , to judge the (a.k.a fake_B) as shown in Figure 2. Each discriminator is responsible for a subregion of the log Mel-filterbank feature. The full objective is the same as equation (1), but the equation (2) is adapted to

(10)

Where is the total number of discriminators to judge output. The loss function for each discriminator in equation (8) is adapted to

(11)

The is to mark out some bins in log Mel-filterbank feature, e.g., the mask for marks off the 21th to 40th bins and the mask for marks off the 0 to 20th bins.

(12)
(13)
(14)
(15)

Furthermore, we can even increase the number of generators and therefore the number of discriminators by dividing the training data in different homogeneous subsets, e.g. by splitting the data by genders and by types of noises.

Figure 2: The architecture of Multi-Discriminators CycleGAN (e.g. n=2)

3 DataSet

CHiME-3 was developed as part of the 3rd CHiME Speech Separation and Recognition Challenge. It contains approximately 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio. The CHiME Challenges focus on distant-microphone ASR in real-world environments [1]. Table 1 shows that the training set contains 8738 noisy utterances: 1600 real noisy utterances from two male and two female speakers in the 4 noisy environments and 7138 simulated noisy utterances based on WSJ0 SI-84 training set in the 4 noisy environments [1]. Development set is composed of 410 (real noisy speech) 4 (environments) + 410 (simulated noisy speech) 4 (environments) = 3280 utterances. Test set consists 330 (real noisy speech) 4 (environments) + 330 (simulated noisy speech) X 4 (environments) = 2640 utterances [1].

Train set Noisy type # of utterance hr.
tr05_orig_clean N/A 7138 15
tr05_real_noisy BUS,CAF,PED,STR 1600 2
tr05_simu_noisy BUS,CAF,PED,STR 7138 15
Test set Noisy type # of utterance hr.
dt05_real_noisy BUS,CAF,PED,STR 1640 2.74
dt05_simu_noisy BUS,CAF,PED,STR 1640 2.89
et05_real_noisy BUS,CAF,PED,STR 1640 2.17
et05_simu_noisy BUS,CAF,PED,STR 1640 2.27
Table 1: CHiME-3 dataset

4 Experimental Setup

4.1 Baseline ASR

The ASR is trained with 80 hours of Wall Street Journal [6] using Kaldi [19]

default recipe to train HMM-GMM model to get alignment, and then we train the time delay neural network (TDNN) network

[18] with the augmented features (speed and volume perturbed). The input feature consists of 40-bin log-Mel filterbank and 100-bin ivectors. In decoding stage, we use a fairly large dictionary, a trigram pruned language model and run re-scoring with fourgram language model.

4.2 Multi-Discriminators CycleGAN

This section mainly describes the network structure of Multi-Discriminators CycleGAN, the experimental setup and how we train it with CHiME-3 dataset. The experimental setup and code could be obtained in github111https://github.com/chiayuli/SEWork.

The generators and discriminators are trained with 3200 hours of data containing real noisy train set (tr05_real_noisy) and clean data (tr05_orig_clean). The feature is 40-bin log-Mel filterbank using hamming window. The context window of each frame is 5, so the input size for the generator and the discriminator is (1, 11, 40). The network for the generator is 9 blocks Resnet [8]

and the number of filters in the last convolutional layer in Resnet is 64. The network of the discriminator contains 3 convolutional layers with normalized layer. The number of filters in the first convolutional layer is 64. The batch size is 512 and the learning rate is 0.0002 with learning decay every 50 epochs. The optimizer is Adam

[10] and the model is trained for 200 epochs. The and are 0.5 and 10 in all the experiments.

We design three architectures in this experiment. In the first architecture (A1), we train one single generator using the Multi-Discriminators CycleGAN with all the training set as shown in Figure 3, i.e. 3200 hours of unpaired speech data coming from all genders and all different types of noises. For this generator, we train up to three discriminators, e.g. CycleGAN-1G+2DA means one generator and two discriminators.

In the second architecture (A2), we split the data by gender and use two Multi-Discriminators CycleGANs to train with each of the subset simultaneously as shown in Figure 4. The name of architecture with prefix ”CycleGAN-2G” means that there are two CycleGANs. CycleGAN-2G+2DA means that there are two discriminators (), one for one . CycleGAN-2G+4DA means that there are four discriminators (), two for one .

In the last architecture (A3), we split the data not only by genders but also by types of noises. By doing so, we obtain eight subsets of training data: female & BUS, female & CAF, female & PED, female & STR, male & BUS, male & CAF, male & PED, male & STR as shown in Figure 5. We train each Multi-Discriminators CycleGAN with one of subsets simultaneously. The prefix ”CycleGAN-8G” means that there are eight Multi-Discriminators CycleGANs in the structure. Hence, CycleGAN-8G+8DA means that there are eight discriminators A (), one for one , in the structure. CycleGAN-8G+24DA means that there are 24 discriminators (), three for one , in the structure.

Figure 3: A1: The architecture of CycleGAN-1G+2DA
Figure 4: A2: The architecture of CycleGAN-2G+4DA
Figure 5: A3: The architecture of CycleGAN-8G+16DA

5 Result

Table 2 and 3 show the WERs of the ASR system in average and also broken down in different types of noises on the CHiME-3 development set and evaluation set. Note that the ASR system is trained with WSJ clean data and stays unchanged while the input features are enhanced with our proposed Multi-Discriminators CycleGAN. In both tables, the highlighted WERs are the WERs that are better than the ones without SE. The WERs in bold are the best WERs among the models with the same number of generators.

5.1 One Discriminator vs. Multi-Discriminators

On the development and evaluation set, the model with two discriminators for one generator A () performs better than the one having only one discriminator. The model with three discriminators performs better than the one having one or two discriminators. The same observation holds for the A2:CycleGAN-2G and A3:CycleGAN-8G models. These results suggest the importance of having multiple discriminators.

However for both A1:CycleGAN-1G and A2:CycleGAN-2G models, the WERs are not better than the WERs of the baseline (no SE) in all types of noises. The WERs are only better on the BUS and the STR noisy data but not on the CAF and the PED noisy data. The best model in the architecture of A2:CycleGAN-2G on the development set is A2:CycleGAN-2G+4DA with a WER of 31.44%. The best A2:CycleGAN-2G on the evaluation set is A2:CycleGAN-2G+6DA which achieves 57.53 WER. But we observe the best WERs with the A3:CycleGAN-8G+24DA models, i.e. 8 generators and three discriminators for each. It has a WER of 28.66% and 52.80% on the development set and the evaluation set, respectively and outperforms the baseline (no SE) in all types of noisy data. These results suggest the importance of having multiple generators.

Arch. Method AVG BUS CAF PED STR
No SE 32.14 38.88 37.79 18.93 32.94
A1 CycleGAN-1G+1DA 35.25 37.82 43.79 22.23 31.97
A1 CycleGAN-1G+2DA 33.53 35.39 39.47 22.84 31.03
A1 CycleGAN-1G+3DA 29.45 34.12 33.94 17.92 28.45
A2 CycleGAN-2G+2DA 34.05 37.11 45.22 23.4 30.47
A2 CycleGAN-2G+4DA 31.44 36.14 39.2 20.22 30.17
A2 CycleGAN-2G+6DA 32.64 37.17 39.57 21.76 32.06
A3 CycleGAN-8G-8DA 35.12 38.16 37.17 23.06 38.06
A3 CycleGAN-8G+16DA 30.19 33.26 35.91 19.68 30.48
A3 CycleGAN-8G+24DA 28.66 32.00 32.54 18.04 28.52
A3 CycleGAN-8G+32DA 29.51 33.6 33.44 18.69 28.79
Table 2: The WERs of baseline ASR w/o and w/ Multi-Discriminators CycleGAN SE on dt05_real_noisy.
Arch. Method AVG BUS CAF PED STR
No SE 61.46 82.91 62.33 57.21 43.41
A1 CycleGAN-1G+1DA 60.33 76.48 67.39 57.01 40.42
A1 CycleGAN-1G+2DA 57.64 71.49 64.14 54.28 40.64
A1 CycleGAN-1G+3DA 53.29 68.61 58.46 49.48 36.61
A2 CycleGAN-2G+2DA 61.54 74.78 69.87 62.14 39.39
A2 CycleGAN-2G+4DA 60.88 73.73 67.54 59.98 42.27
A2 CycleGAN-2G+6DA 57.53 70.48 65.17 54.97 39.50
A3 CycleGAN-8G+8DA 60.62 72.87 62.42 58.71 44.79
A3 CycleGAN-8G+16DA 55.30 66.37 60.29 51.03 37.80
A3 CycleGAN-8G+24DA 52.80 65.84 56.54 46.49 36.91
A3 CycleGAN-8G+32DA 54.01 67.00 56.29 49.44 36.70
Table 3: The WERs of baseline ASR w/o and w/ Multi-Discriminators CycleGAN SE on et05_real_noisy.

5.2 One Generator vs. Multi-Generators

Figure 6 shows the comparisons in terms of WERs between SE systems with different number of generators given the same number of discriminators. For example, Figure 6 shows the WER comparison between A1:CycleGAN-1G+1DA, A2:CycleGAN-2G+2DA and A3:CycleGAN-8G-+8DA. These models all have only one discriminator for one . Figure 6 is the comparison between A1:CycleGAN-1G+2DA, A2:CycleGAN-2G+4DA and A3:CycleGAN-8G-+16DA which all have two discriminators for one . The comparison between the models having three discriminators for one , A1:CycleGAN-1G+3DA, A2:CycleGAN-2G+6DA and A3:CycleGAN-8G+24DA, is shown in Figure 6 .

For one discriminator A (1DA) and two discriminators A (2DA) models in Figure 6 and Figure 6 , the model with two generators (A2:CycleGAN-2G) performs better than the one with only one generator (A1:CycleGAN-1G) in the BUS, CAF and PED noisy scenarios. The model with eight generators (A3:CycleGAN-8G) slightly performs better than A2:CycleGAN-2G. For three discriminator A (3DA) models in Figure 6 , A3:CycleGAN-8G slightly performs better than A2:CycleGAN-2G, and A2:CyclGAN-2G and A3:CyclGAN-8G models perform much better than A1:CycleGAN-1G. In sum, our results show that adding more generators and train them with well split data, e.g. gender and noise types, plays an important role to improve the overall performance.

(a) The WERs of ASR w/ 1DA models SE (b) w/ 2DA models SE (c) w/ 3DA models SE
Figure 6: The performance comparisons on the data of female speakers in the development set among the models with one, two and eight generators.

6 Analysis & Discussion

6.1 WER, Insertion, Deletion and Substitution

Table 4 shows the comparison (WER, Insertion, Deletion, Substitution and Correction) on CHiME-3 development set with and without Multi-Discriminators CycleGAN SE.

For all the A1:CycleGAN-1G, A2:CycleGAN-2G and A3:CycleGAN-8G models, the model having only discriminator A (1DA models) has two or three times more insertions than the one having multiple discriminators. It implies that 1DA models do not remove noisy signal very well, so the clean ASR system mis-recognize it as speech. The number of deletion from multiple discriminators models is less than the one from models, which means more words are recognized by the ASR system with Multi-Discriminators CycleGAN SE.

Arch. Method WER INS DEL SUB COR
No SE 32.14 170 6054 2491 8715
A1 CycleGAN-1G+1DA 35.25 723 5894 2942 9559
A1 CycleGAN-1G+3DA 29.45 240 5309 2437 7986
A2 CycleGAN-2G+2DA 34.05 779 5372 3083 9234
A2 CycleGAN-2G+4DA 31.44 303 5611 2611 8525
A3 CycleGAN-8G+8DA 35.12 718 5940 2867 9525
A3 CycleGAN-8G+24DA 28.66 398 4255 3120 7773
Table 4: The insertion, deletion, substitution and correction on dt05_real_noisy set (27119 words) with and without Multi-Discriminators CycleGAN SE.

6.2 ASR outputs & Spectrograms

In this section, we compare cherry-picked ASR outputs in the development set without and with Multi-Discriminators CycleGAN SE and the corresponding log-Mel spectrograms. Table 5 shows the ASR output on the BUS real noisy utterance without SE, with A1:CycleGAN-1G SE and our best SE model (A3:CycleGAN-8G+24DA). As stated in Table 4, adding more CycleGAN and discriminator A (DA) for one helps easing the high deletion problem.

Figure 7, 8 and 9 show the comparison of log-Mel Spectrogram from orignal clean utterance, noisy utterance without SE, noisy utterance with A1:CycleGAN-1G+1DA, noisy utterance with A1:CycleGAN-1G+3DA and noisy utterance with A3:CycleGAN-8G+24DA SE. Between 0 to 1000 millisecond, our SE models reduce the noise and help the ASR to recognize the words successfully. However, from 1000 to 1500 milliseconds, the utterance with Multi-Discriminators CycleGAN SE might still contain some noisy signals, so the ASR could not recognize the two words ”WANT THEM” well. Besides, A1:CycleGAN-1G+1DA might not transform the noisy signal to clean signal well, so that the ASR system mis-recognizes /A/ as /O/, and the latter predictions (”GIVE UP”) have nothing in common with the two words ”WANT THEM” with respect to the pronunciation.

Table 6 shows another example of ASR output in CAF (SNR=10) noisy development set with and without Multi-Discriminators CycleGAN SE. The A3:CycleGAN-8G model performs much better than the A1:CycleGAN-1G models, there are many substitutions and insertions in A1:CycleGAN-1G models. But the A3:CycleGAN-8G model has more deletions than A1:CycleGAN-1G models in this example.

Arch. Method ASR output
Time 0-1000 ms: OUR CUSTOMERS
1000-1500 ms: WANT THEM
Reference OUR CUSTOMERS WANT THEM
No SE
A1 CycleGAN-1G+1DA OUR CUSTOMERS WON’T GIVE UP
A1 CycleGAN-1G+3DA OUR CUSTOMERS
A3 CycleGAN-8G+24DA OUR CUSTOMERS
Table 5: ASR outputs of an example in the BUS real noisy (SNR=7) development set with and without Multi-Discriminators CycleGAN SE.
(a) Clean
(b) No SE
(c) A1:CycleGAN-1G+1DA
(d) A1:CycleGAN-1G+3DA
(e) A3:CycleGAN-8G+24DA
Figure 7: The log-Mel Spectrogram from 0-500 ms from an utterance in BUS development set (the same utterance as in Table 5) without SE and with Multi-Discriminators CycleGAN SE.
(a) Clean
(b) No SE
(c) A1:CycleGAN-1G+1DA
(d) A1:CycleGAN-1G+3DA
(e) A3:CycleGAN-8G+24DA
Figure 8: The log-Mel Spectrogram from 500-1000 ms from an utterance in BUS development set (the same utterance as in Table 5) without SE and with Multi-Discriminators CycleGAN SE.
(a) Clean
(b) No SE
(c) A1:CycleGAN-1G+1DA
(d) A1:CycleGAN-1G+3DA
(e) A3:CycleGAN-8G+24DA
Figure 9: The log-Mel Spectrogram from 1000-1500 ms from an utterance in BUS development set (the same utterance as in Table 5) without SE and with Multi-Discriminators CycleGAN SE.
Arch. Method ASR output
Reference WHAT ABOUT IN SOUTH AFRICA
ITSELF
No SE. WHAT ABOUT A SATISFACTORY
A1 CycleGAN-1G+1DA BUT WHAT ABOUT A SOUTH
AFRICAN GETS OUT
A1 CycleGAN-1G+3DA WHAT ABOUT A SAD FACT GETS OUT
A3 CycleGAN-8G+24DA WHAT ABOUT SOUTH AFRICANS
Table 6: ASR outputs of an example in the CAF (SNR=10) noisy development set with and without Multi-Discriminators CycleGAN SE.

7 Conclusion

In this work, we investigate the performance of ASR in terms of WER using CycleGAN based feature mapping and our novel extension (multiple generators and multiple discriminators). Our experimental results show that multiple generators trained with well splitting subset is better than one generator trained with all the data. Besides, the models have multiple discriminator A () per generator A () improve the average WER. The best model A3:CycleGAN-8G+24DA improves the WER for all the four noisy scenarios and achieves 10.03 % relatively WER on development set and 14.09 % relatively WER on evaluation set.

References

  • [1] J. Barker and et al. (2017) CHiME3 LDC2017S24. USB Flash Drive. Philadelphia: Linguistic Data Consortium. Cited by: §3.
  • [2] Z. Chen, Y. Huang, J. Li, and Y. Gong (2017)

    Improving mask learning based speech enhancement system with restoration layers and residual connection

    .
    in Proceedings of Interspeech. Cited by: §1.
  • [3] I. Durugkar, I. Gemp, and S. Mahadevan (2017) Generative multi-adversarial networks. in Proceedings of ICLR. Cited by: §1, §1, §2.2.
  • [4] X. Feng, Y. Zhang, and J. Glass. (2014)

    Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition

    .
    in Proceedings of ICASSP. Cited by: §1.
  • [5] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks.

    in Machine Learning Research

    .
    Cited by: §1.
  • [6] J. S. Garofolo and et al (1993) CSR-I (wsj0) complete ldc93s6a. Web Download. Philadelphia: Linguistic Data Consortium. Cited by: §4.1.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. in Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep Residual Learning for Image Recognition. in Proceedings of CVPR. Cited by: §4.2.
  • [9] E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher (2018) A multi-discriminator cyclegan for unsupervised non-parallel speech domain adaptation. in Proceedings of Interspeech. Cited by: §1, §1, §2.2.
  • [10] D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. in the Proceedings of the 3rd International Conference for Learning Representations. Cited by: §4.2.
  • [11] C.-F. Liao, Y. Tsao, H.-Y. Lee, and H.-M. Wang (2019) Noise adaptive speech enhancement using domain adversarial training. in Proceedings of Interspeech. Cited by: §1.
  • [12] P. C. Loizou (2013) Speech enhancement: theory and practice.. CRC press. Cited by: §1.
  • [13] X. Lu, Y. Tsao, S. Matsuda, and C. Hori. (2013)

    Speech enhancement based on deep denoising autoencoder.

    .
    in Proceedings of Interspeech. Cited by: §1.
  • [14] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng (2012) Recurrent neural networks for noise reduction in robust asr.. in Proceedings of Interspeech. Cited by: §1.
  • [15] Z. Meng, J. Li, Y. Gong, and B.-H. Juang (2018) Adversarial feature-mapping for speech enhancement. in Proceedings of Interspeech. Cited by: §1.
  • [16] Z. Meng, J. Li, Y. Gong, and B.-H. Juang (2018) Cycle-consistent speech enhancement. in Proceedings of Interspeech. Cited by: §1, §2.1.
  • [17] A. Narayanan and D. Wang (2013)

    Ideal ratio mask estimation using deep neural networks for robust speech recognition.

    .
    in Proceedings of ICASSP. Cited by: §1.
  • [18] V. Peddinti, D. Povey, and S. Khudanpur (2015) A Time Delay Neural Network Architecture for Efficient Modeling of Long Temporal Contexts. in Proceedings of Interspeech. Cited by: §4.1.
  • [19] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely (2011-12) The Kaldi Speech Recognition Toolkit. In Proceedings of ASRU, Cited by: §4.1.
  • [20] Y. Wang, A. Narayanan, and D. Wang (2014) On training targets for supervised speech separation.. in IEEE/ACM Trans. Audio, Speech, Language Process.. Cited by: §1.
  • [21] F. Weninger, H. Erdogan, and S. W. et al. (2015) Speech enhance- ment with lstm recurrent neural networks and its application to noise-robust asr.. in Proceedings of LVA/ICA. Cited by: §1.
  • [22] F. Weninger, F. Eyben, and B. Schuller (2014) Single-channel speech separation with memory-enhanced recurrent neural networks. in Proceedings of ICASSP. Cited by: §1.
  • [23] Y. Xu, J. Du, L. R. Dai, and C. H. Lee (2015) A regression approach to speech enhancement based on deep neural networks.. in IEEE/ACM Trans. Audio, Speech, Language Process. 23. Cited by: §1.
  • [24] Y. Zhou, C. Xiong, and R. Socher (2018) Improving End-to-End Speech Recognition with Policy Learning. In Proceedings of ICASSP abs/1712.07101. Cited by: §1.
  • [25] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In in Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251. Cited by: §1, §2.1, §2.2.