In recent years, advances in deep learning have led to remarkable performance boost in automatic speech recognition (ASR)[1, 2, 3, 4, 5, 6]. However, ASR systems still suffer from large performance degradation when acoustic mismatch exists between the training and test conditions [7, 8]. Many factors contribute to the mismatch, such as variation in environment noises, channels and speaker characteristics. Domain adaptation is an effective way to address this limitation, in which the acoustic model parameters or input features are adjusted to compensate for the mismatch.
One difficulty with domain adaptation is that available data from the target domain is usually limited, in which case the acoustic model can be easily overfitted. To address this issue, regularization-based approaches are proposed in [9, 10, 11, 12]
to regularize the neuron output distributions or the model parameters. In[13, 14], transformation-based approaches are introduced to reduce the number of learnable parameters. In [15, 16, 17]
, the trainable parameters are further reduced by singular value decomposition of weight matrices of a neural network. Although these methods utilize the limited data from the target domain, they still require labelling for the adaptation data and can only be used in supervised adaptation.
Unsupervised domain adaptation is necessary when human labelling of the target domain data is unavailable. It has become an important topic with the rapid increase of the amount of untranscribed speech data for which the human annotation is expensive. Pawel et al. proposed to learn the contribution of hidden units by additional amplitude parameters  and differential pooling 20]. Although these methods lead to increased performance in the ASR task when no labels are available for the adaptation data, they still rely on the senone (tri-phone state) alignments against the unlabeled adaptation data through first pass decoding. The first pass decoding result is unreliable when the mismatch between the training and test conditions is significant. It is also time-consuming and can be hardly applied to huge amount of adaptation data. There are even situations when decoding adaptation data is not allowed because of the privacy agreement signed with the speakers. These methods depending on the first pass decoding of the unlabeled adaptation data is sometimes called “semi-supervised” adaptation in literature.
The goal of our study is to achieve purely unsupervised domain adaptation without any exposure to the labels or the decoding results of the adaptation data in the target domain. In  we show that the source-domain model can be effectively adapted without any transcription by using teacher-student (T/S) learning 
, in which the posterior probabilities generated by the source-domain model can be used in lieu of labels to train the target-domain model. However, T/S learning relies on the availability of parallel unlabeled data which can be usually simulated. However, if parallel data is not available, we cannot use T/S learning for model adaptation. In this study, we are exploring the solution to domain adaptation without parallel data and without transcription. Recently, adversarial training has become a very hot topic in deep learning because of its great success in estimating generative models. It was first applied to the area of unsupervised domain adaptation by Ganin et al. in  in a form of multi-task learning. In their work, the unsupervised adaptation is achieved by learning deep intermediate representations that are both discriminative for the main task (image classification) on the source domain and invariant with respect to mismatch between source and target domains. The domain invariance is achieved by the adversarial training of the domain classification objective functions. This can be easily implemented by augmenting any feed-forward models with a few standard layers and a gradient reversal layer (GRL). This GRL approach has been applied to acoustic models for unsupervised adaptation in  and for increasing noise robustness in [26, 27]. Improved ASR performance is achieved in both scenarios.
However, the GRL method focuses only on learning a domain-invariant representation, ignoring the unique characteristics of each domain, which could also be informative. Inspired by this, Bousmailis et al.  proposed the domain separation networks (DSNs)
to separate the deep representation of each training sample into two parts: one private component that is unique to its domain and one shared component that is invariant to the domain shift. In this work, we propose to apply DSN for unsupervised domain adaptation on a DNN-hidden Markov model (HMM) acoustic model, aiming to increase the noise robustness in speech recognition. In the proposed framework, the shared component is learned to be both senone-discriminative and domain-invariant through adversarial multi-task training of a shared component extractor and a domain classifier. The private component is trained to be orthogonal with the shared component to implicitly increase the degree of domain-invariance of the shared component. A reconstructor DNN is used to reconstruct the original speech feature from the private and shared components, serving for regularization. The proposed method achievesrelative WER improvement over the GRL training approach for robust ASR on the CHiME-3 dataset.
2 Domain Separation Networks
In the purely unsupervised domain adaptation task, we only have access to a sequence of speech frames from the source domain distribution, a sequence of senone labels aligned with source data and a sequence of speech frames from a target domain distribution. Senone labels or other types of transcription are not available for the target speech sequence .
When applying domain separation networks (DSNs) to the unsupervised adaptation task, our goal is to learn the shared (or common) component extractor DNN that maps an input speech frame from source domain or from target domain to a domain-invariant shared component or respectively. At the same time, learn a senone classifier DNN that maps the shared component from the source domain to the correct senone label .
To achieve this, we first perform adversarial training of the domain classifier DNN that maps the shared component or to its domain label or , while simultaneously minimizing the senone classificaton loss of given shared component from the source domain to ensure the senone-dicriminativeness of .
For the source or the target domain, we extract the source or the target private component or that is unique to the source or the target domain through a source or a target private component extractor or . The shared and private components of the same domain are trained to be orthogonal to each other to further enhance the degree of domain-invariance of the shared components. The extracted shared and private components of each speech frame are concatenated and fed as the input of a reconstructor to reconstruct the input speech frame or .
The architecture of DSN is shown in Fig. 1, in which all the sub-networks are jointly optimized using SGD. The optimized shared component extractor and senone classifier form the adapted acoustic model for subsequent robust speech recognition.
2.1 Deep Neural Networks Acoustic Model
The shared component extractor and senone predictor of the DSN are initialized from an DNN-HMM acoustic model. The DNN-HMM acoustic model is trained with labeled speech data from the source domain. The senone-level alignment is generated by a well-trained GMM-HMM system.
Each output unit of the DNN acoustic model corresponds to one of the senones in the set . The output unit for senone is the posterior probability obtained by a softmax function.
2.2 Shared Component Extraction with Adversarial Training
The well-trained acoustic model DNN in Section 2.1 can be decomposed into two parts: a share component extractor with parameters and a senone classifier with parameters . An input speech frame from source domain is first mapped by the to a K-dimensional shared component . is then mapped to the senone label posteriors by a senone classifier with parameters as follows.
where denotes the predicted senone label for source frame and .
The domain classifier DNN with parameters takes the shared component from source domain or target domain as the input to predict the two-dimensional domain label posteriors as follows (the 1st and 2nd output units stand for the source and target domains respectively).
where and denote the predicted domain labels for the source frame and the target frame respectively.
In order to adapt the source domain acoustic model (i.e., and ) to the unlabeled data from target domain, we want to make the distribution of the source domain shared component as close to that of the target domain as possible. In other words, we want to make the shared component domain-invariant. This can be realized by adversarial training, in which we adjust the parameters of shared component extractor to maximize the loss of the domain classifier below while adjusting the parameters to minimize the loss of the domain classifier below.
This minimax competition will first increase the capability of both the shared component extractor and the domain classifier and will eventually converge to the point where the shared component extractor generates extremely confusing representations that domain classifier is unable to distinguish (i.e., domain-invariant).
Simultaneously, we minimize the loss of the senone classifier below to ensure the domain-invariant shared component is also discriminative to senones.
Since the adversarial training of the domain classifier and shared component extractor has made the distribution of the target domain shared-component as close to that of as possible, the is also senone-discriminative and will lead to minimized senone classification error given optimized . Because of the domain-invariant property, good adaptation performance can be achieved when the target domain data goes through the network.
2.3 Private Components Extraction
To further increase the degree of domain-invariance of the shared components, we explicitly model the private component that is unique to each domain by a private component extractor DNN parameterized by . and map the source frame and the target frameand which are the private components of the source and target domains respectively. The private component for each domain is trained to be orthogonal to the shared component by minimizing the difference loss below.
is the squared Frobenius norm. All the vectors are assumed to be column-wise.
As a regularization term, the predicted shared and private components are then concatenated and fed into a reconstructor DNN with parameters to recover the input speech frames and from both source and target domains respectively. The reconstructor is trained to minimize the mean square error based reconstruction loss as follows:
where denotes concatenation of two vectors.
The total loss of DSN is formulated as follows and is jointly optimized with respect to the parameters.
All the parameters of DSN are jointly optimized through backprogation with stochastic gradient descent (SGD) as follows:
Note that the negative coefficient in Eq. (13) induces reversed gradient that maximizes the domain classification loss in Eq. (5) and makes the shared components domain-invariant. Without the reversal gradient, SGD would make representations different across domains in order to minimize Eq. (4). For easy implementation, GRL is introduced in , which acts as an identity transform in the forward pass and multiplies the gradient by during the backward pass.
The optimized shared component extractor and senone classifier form the adapted acoustic model for robust speech recognition.
In this work, we perform the pure unsupervised environment adaptation of the DNN-HMM acoustic model with domain separation networks for robust speech recognition on CHiME-3 dataset.
3.1 CHiME-3 Dataset
The CHiME-3 dataset is released with the 3rd CHiME speech Separation and Recognition Challenge , which incorporates the Wall Street Journal corpus sentences spoken in challenging noisy environments, recorded using a 6-channel tablet based microphone array. CHiME-3 dataset consists of both real and simulated data. The real speech data was recorded in four real noisy environments (on buses (BUS), in cafés (CAF), in pedestrian areas (PED), and at street junctions (STR)). To generate the simulated data, the clean speech is first convoluted with the estimated impulse response of the environment and then mixed with the background noise separately recorded in that environment . The noisy training data consists of 1600 real noisy utterances from 4 speakers, and 7138 simulated noisy utterances from 83 speakers in the WSJ0 SI-84 training set recorded in 4 noisy environments. There are 3280 utterances in the development set including 410 real and 410 simulated utterances for each of the 4 environments. There are 2640 utterances in the test set including 330 real and 330 simulated utterances for each of the 4 environments. The speakers in training set, development set and the test set are mutually different (i.e., 12 different speakers in the CHiME-3 dataset). The training, development and test data sets are all recorded in 6 different channels.
8738 clean utterances corresponding to the 8738 noisy training utterances in the CHiME-3 dataset are selected from the WSJ0 SI-85 training set to form the clean training data in our experiments. WSJ 5K word 3-gram language model is used for decoding.
3.2 Baseline System
In the baseline system, we first train a DNN-HMM acoustic model with clean speech and then adapt the clean acoustic model to noisy data using GRL unsupervised adaptation in . Hence, the source domain is with clean speech while the target domain is with noisy speech.
The 29-dimensional log Mel filterbank features together with 1st and 2nd order delta features (totally 87-dimensional) for both the clean and noisy utterances are extracted by following the process in 
. Each frame is spliced together with 5 left and 5 right context frames to form a 957-dimensional feature. The spliced features are fed as the input of the feed-forward DNN after global mean and variance normalization. The DNN has 7 hidden layers with 2048 hidden units for each layer. The output layer of the DNN has 3012 output units corresponding to 3012 senone labels. Senone-level forced alignment of the clean data is generated using a GMM-HMM system. The DNN is first trained with 8738 clean training utterances in CHiME-3 and the alignment to minimize the cross entropy loss and then tested with simulation and real development data of CHiME-3.
The DNN well-trained with clean data is then adapted to the 8738 noisy utterances from Channel 5 using GRL method. No senone alignment of the noisy adaptation data is used for the unsupervised adaptation. The feature extractor is initialized with the first 4 hidden layers of the clean DNN and the senone classifier is initialized with the last 3 hidden layers plus the output layers of the clean DNN. The domain classifier is a feedforward DNN with two hidden layers and each hidden layer has 512 hidden units. The output layer of the domain classifier has 2 output units representing source and target domains. The 2048 hidden units of the hidden layer of the DNN acoustic model is fed as the input to the domain classifier. A GRL is inserted in between the deep representation and the domain classifier for easy implementation. The GRL adapted system is tested on real and simulation noisy development data in CHiME-3 dataset.
3.3 Domain Separation Networks for Unsupervised Adaptation
We adapt the clean DNN acoustic model trained in Section 3.2 to the 8738 noisy utterances using DSN. No senone alignment of the noisy adaptation data is used for the unsupervised adaptation.
The DSN is implemented with CNTK 2.0 Toolkit . The shared component extractor is initialized with the first hidden layers of the clean DNN and the senone classifier is initialized with the last hidden layers plus the output layer of the clean DNN. indicates the position of shared component in the DNN acoustic model and ranges from to in our experiments. The domain classifier of the DSN has exactly the same architecture as that of the GRL.
The private component extractors and for the clean and noisy domains are both feedforward DNNs with 3 hidden layers and each hidden layer has 512 hidden units. The output layers of both and have 2048 output units. The reconstructor is a feedforward DNN with 3 hidden layers and each hidden layer has 512 hidden units. The output layer of the
has 957 output units with no non-linear activation functions to reconstruct the spliced input features.
The activation functions for the hidden units of is sigmoid. The activation functions for hidden units of , , andand are softmax. The activation functions for the output units of , are sigmoid. All the sub-networks except for and are randomly initialized. The learning rate is fixed at throughout the experiments. The adapted DSN is tested on real and simulation development data in CHiME-3 Dataset.
||Reversal Gradient Coefficient|
3.4 Result Analysis
Table 1 shows the WER performance of clean, GRL adapted and DSN adapted DNN acoustic models for ASR. The clean DNN achieves 29.44% and 28.25% WERs on the real and simulated development data respectively. The GRL adapted acoustic model achieves 27.16% and 27.16% WERs on the real and simulated development data. The best WER performance for DSN adapted acoustic model are 24.15% and 23.82% on real and simulated development data, which achieve 11.08% and 12.30% relative improvement over the GRL baseline system and achieve 17.97 % and 17.69% relative improvement over the unadapted acoustic model. The best WERs are achieved when and . By comparing the GRL and DSN performance at , we observe that the introduction of private components and reconstructor lead to 5.1% relative improvements in WER.
We investigate the impact of shared component position and the reversal gradient coefficient on the WER performance as in Table 2. We observe that the WER decreases with the growth of , which is reasonable as the higher hidden representation of a well-trained DNN acoustic model is inherently more senone-discriminative and domain-invariant than the lower layers and can serve as a better initialization for the DSN unsupervised adaptation.
In this work, we investigate the domain adaptation of the DNN acoustic model by using domain separation networks. Different from the conventional supervised, semi-supervised and T/S adaptation approaches, DSN is capable of adapting the acoustic model to the adaptation data without any exposure to its transcription, decoded lattices or unlabeled parallel data from the source domain. The shared component between source and target domains extracted by DSN through adversarial multi-task training is both domain-invariant and senone-discriminative. The extraction of private component that is unique to each domain significantly improves the degree of domain-invariance and the ASR performance.
When evaluated on the CHiME-3 dataset for environment adaption task, the DSN achieves 11.08% and 17.97% relative WER improvement over the GRL baseline system and the unadapted acoustic model. The WER decreases when higher hidden representations of the DNN acoustic model are used as the initial shared component. The WER first decreases and then increases with the growth of the reversal gradient coefficient.
-  Frank Seide, Gang Li, and Dong Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. INTERSPEECH, 2011.
Tara N Sainath, Brian Kingsbury, Bhuvana Ramabhadran, Petr Fousek, Petr Novak,
and Abdel-rahman Mohamed,
“Making deep belief networks effective for large vocabulary continuous speech recognition,”in Automatic Speech Recognition and Understanding (ASRU). IEEE, 2011, pp. 30–35.
-  Navdeep Jaitly, Patrick Nguyen, Andrew Senior, and Vincent Vanhoucke, “Application of pretrained deep neural networks to large vocabulary speech recognition,” in Proc. INTERSPEECH, 2012.
-  Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
-  Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Michael Seltzer, Geoff Zweig, Xiaodong He, Jason Williams, et al., “Recent advances in deep learning for speech research at microsoft,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8604–8608.
-  Dong Yu and Jinyu Li, “Recent progresses in deep learning based acoustic models,” IEEE/CAA Journal of Automatica Sinica, vol. 4, no. 3, pp. 396–409, 2017.
-  J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of noise-robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 22, no. 4, pp. 745–777, April 2014.
-  J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong, Robust Automatic Speech Recognition: A Bridge to Practical Applications, Academic Press, 2015.
-  D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 7893–7897.
-  Zhen Huang, Sabato Marco Siniscalchi, I-Fan Chen, Jinyu Li, Jiadong Wu, and Chin-Hui Lee, “Maximum a posteriori adaptation of network parameters in deep models,” in Interspeech, 2015.
-  H. Liao, “Speaker adaptation of context dependent deep neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 7947–7951.
-  Zhen Huang, Jinyu Li, Sabato Marco Siniscalchi, I-Fan Chen, Ji Wu, and Chin-Hui Lee, “Rapid adaptation for deep neural networks through multi-task learning.,” in Interspeech, 2015, pp. 3625–3629.
-  Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface, and Renato De Mori, “Linear hidden transformations for adaptation of hybrid ann/hmm models,” Speech Communication, vol. 49, no. 10, pp. 827 – 835, 2007, Intrinsic Speech Variations.
-  F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in 2011 IEEE Workshop on Automatic Speech Recognition Understanding, Dec 2011, pp. 24–29.
-  Jian Xue, Jinyu Li, and Yifan Gong, “Restructuring of deep neural network acoustic models with singular value decomposition.,” in Interspeech, 2013, pp. 2365–2369.
-  J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 6359–6363.
-  Y. Zhao, J. Li, and Y. Gong, “Low-rank plus diagonal adaptation for deep neural networks,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 5005–5009.
-  P. Swietojanski, J. Li, and S. Renals, “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1450–1463, Aug 2016.
-  P. Swietojanski and S. Renals, “Differentiable pooling for unsupervised speaker adaptation,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2015, pp. 4305–4309.
-  Z. Q. Wang and D. Wang, “Unsupervised speaker adaptation of batch normalized acoustic models for robust asr,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 4890–4894.
-  Jinyu Li, Michael L Seltzer, Xi Wang, Rui Zhao, and Yifan Gong, “Large-scale domain adaptation via teacher-student learning,” in INTERSPEECH, 2017.
-  Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong, “Learning small-size DNN with output-distribution-based criteria.,” in INTERSPEECH, 2014, pp. 1910–1914.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., pp. 2672–2680. Curran Associates, Inc., 2014.
Yaroslav Ganin and Victor Lempitsky,
“Unsupervised domain adaptation by backpropagation,”in
Proceedings of the 32nd International Conference on Machine Learning, Francis Bach and David Blei, Eds., Lille, France, 07–09 Jul 2015, vol. 37 of Proceedings of Machine Learning Research, pp. 1180–1189, PMLR.
-  Sining Sun, Binbin Zhang, Lei Xie, and Yanning Zhang, “An unsupervised deep domain adaptation approach for robust speech recognition,” Neurocomputing, vol. 257, pp. 79 – 87, 2017, Machine Learning and Signal Processing for Big Multimedia Analysis.
-  Yusuke Shinohara, “Adversarial multi-task learning of deep neural networks for robust speech recognition.,” in INTERSPEECH, 2016, pp. 2369–2372.
-  Dmitriy Serdyuk, Kartik Audhkhasi, Philémon Brakel, Bhuvana Ramabhadran, Samuel Thomas, and Yoshua Bengio, “Invariant representations for noisy speech recognition,” in NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop, 2016.
-  Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan, “Domain separation networks,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds., pp. 343–351. Curran Associates, Inc., 2016.
-  J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third chime speech separation and recognition challenge: Dataset, task and baselines,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Dec 2015, pp. 504–511.
T. Hori, Z. Chen, H. Erdogan, J. R. Hershey, J. Le Roux, V. Mitra, and
“The merl/sri system for the 3rd chime challenge using beamforming, robust feature extraction, and advanced speech recognition,”in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Dec 2015, pp. 475–481.
-  Jinyu Li, Dong Yu, Jui-Ting Huang, and Yifan Gong, “Improving wideband speech recognition using mixed-bandwidth training data in cd-dnn-hmm,” in Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012, pp. 131–136.
-  Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Zhiheng Huang, Brian Guenter, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Huaming Wang, et al., “An introduction to computational networks and the computational network toolkit,” Microsoft Technical Report MSR-TR-2014–112, 2014.