have greatly advanced the performance of automatic speech recognition (ASR) with a large amount of training data. However, the performance degrades when test data is from a new domain. Many DNN adaptation approaches were proposed to compensate for the acoustic mismatch between training and testing. In[4, 5, 6]
, regularization-based approaches restrict the neuron output distributions or the model parameters to stay not too far away from the source-domain model. In[7, 8], transformation-based approaches reduce the number of learnable parameters by updating only the transform-related parameters. In [9, 10]
, the trainable parameters are further reduced by singular value decomposition of weight matrices of a neural network. In addition, i-vector and speaker-code [12, 13] are used as auxiliary features to a neural network for model adaptation. In [14, 15], these adaptation methods were further investigated in end-to-end ASR [16, 17]. However, all these methods focus on addressing the overfitting issue given very limited adaptation data in the target-domain.
Teacher-student (T/S) learning [18, 19, 20, 21] has shown to be effective for large-scale unsupervised domain adaptation by minimizing the Kullback-Leibler (KL) divergence between the output distributions of the teacher and student models. The input to the teacher and student models needs to be parallel source- and target-domain adaptation data, respectively, since the output vectors of a teacher network need to be frame-by-frame aligned with those of the student network to construct the KL divergence between two distributions. Compared to one-hot labels, the use of frame-level senone (tri-phone states) posteriors from the teacher as the soft targets to train the student model well preserves the relationships among different senones at the output of the teacher network. However, the parallel data constraint of T/S learning restricts its application to the scenario where the paired target-domain data can be easily simulated from the source domain data (e.g. from clean to noisy speech). Actually, in many scenarios, the generation of parallel data in a new domain is almost impossible, e.g., to simulate paired accented or kids’ speech from standard adults’ speech.
Recently, adversarial learning [22, 23] was proposed for domain-invariant training [24, 25, 26], speaker adaptation , speech enhancement [28, 29, 30] and speaker verification [31, 32]. It was also shown to be effective for unsupervised domain adaptation without using parallel data [33, 34]
. In adversarial learning, an auxiliary domain classifier is jointly optimized with the source model to mini-maximize an adversarial loss. A deep representation is learned to be invariant to domain shifts and discriminative to senone classification. However, adversarial learning does not make use of the target-domain labels which carry important class identity information and is only suitable for the situation where neither parallel data nor target-domain labels are available.
How to perform effective domain adaptation using unpaired source- and target-domain data with labels? We propose a neural label embedding (NLE) method: instead of frame-by-frame knowledge transfer in T/S learning, we distill the knowledge of a source-domain model to a fixed set of label embeddings, or l-vectors, one for each senone class, and then transfer the knowledge to the target-domain model via these senone-specific l-vectors. Each -vector is a condensed representation of the DNN output distributions given all the features aligned with the same senone at the input. A simple DNN-based method is proposed to learn the -vectors by minimizing the average , Kullback-Leibler (KL) and symmetric KL distance to the output vectors with the same senone label. During adaptation, the l-vectors are used in lieu of their corresponding one-hot labels to train the target-domain model with cross-entropy loss.
NLE can be viewed as a knowledge quantization  in form of output-distribution vectors where each -vector is a code-vector (centroid) corresponding to a senone codeword. With the NLE method, knowledge is transferred from the source-domain model to the target-domain through a fixed codebook of senone-specific l-vectors instead of variable-length frame-specific output-distribution vectors in T/S learning. These distilled l-vectors decouple the target-domain model’s output distributions from those of the source-domain model and thus enable a more flexible and efficient senone-level knowledge transfer using unpaired data. When parallel data is available, compared to the T/S learning, NLE significantly reduces the computational cost during adaptation by replacing the forward-propagation of each source-domain frame through the source-domain model with a fast look-up in -vector codebook. In the experiments, we adapt a multi-conditional acoustic model trained with 6400 hours of US English to each of the 9 different accented English (120 hours to 830 hours) and kids’ speech (80 hours), the proposed NLE method achieves 5.4% to 14.1% and 6.0% relative word error rate (WER) reduction over one-hot label baseline on 9 accented English and kids’ speech, respectively.
2 Neural Label Embedding (NLE) for Domain Adaptation
In this section, we present the NLE method for domain adaptation without using parallel data. Initially, we have a well-trained source-domain network with parameters predicting a set of senones and source-domain speech frames with senone labels . We distill the knowledge of this powerful source-domain model into a dictionary of l-vectors, one for each senone label (class) predicted at the output layer. Each l-vector has the same dimentionality as the number of senone classes. Before training the target-domain model with parameters , we query the dictionary with the ground-truth one-hot senone labels of the target-domain speech frames to get their corresponding l-vectors. During adaptation, in place of the one-hot labels, the l-vectors are used as the soft targets to train the target-domain model. For NLE domain adaptation, the source-domain data does not have to be parallel to the target-domain speech frames , i.e., and do not have to be frame-by-frame synchronized and the number of frames does not have to be equal to .
The key step of the NLE method is to learn l-vectors from the source-domain model and data. As the carrier of knowledge transferred from the source-domain to the target-domain, the l-vector of a senone class should be a representation of the output distributions (senone-posterior distributions) of the source-domain DNN given features aligned with senone at the input, encoding the dependency between senone and all the other senones . A reasonable candidate is the centroid vector that minimizes the average distance to the output vectors generated from all the frames aligned with senone . Therefore, we need to learn a dictionary of -vectors corresponding to senones in the complete set , with each -vector being -dimensional. To serve as the training target of the target-domain model, the -vector needs to be normalized such that its elements satisfy
2.1 NLE Based on Distance Minimization (NLE-L2)
To compute the senone-specific centroid, the most intuitive solution is to minimize the average distance between the centroid and all the output vectors with the same senone label, which is equivalent to calculating the arithmetic mean of the output vectors aligned with the senone. Let denote a -dimensional output vector of given the input frame .
equals to the posterior probability of senonegiven , i.e., . For senone , the l-vector based on distance minimization is computed as
where is the number of source-domain frames aligned with senone and . The l-vectors under NLE-L2 are automatically normalized since each posterior vector in the mean computation satisfy Eq. (1).
2.2 NLE Based on KL Distance Minimization (NLE-KL)
KL divergence is an effective metric to measure the distance between two distributions. In NLE framework, the l-vector can be learned as a centroid with a minimum average KL distance to the output vectors of senone . Many methods have been proposed to iteratively compute the centroid of KL distance [36, 37, 38].
In this paper, we propose a simple DNN-based solution to compute this KL-based centroid. As shown in Fig. 1, we have an initial embedding matrix consisting of all the -vectors, i.e., . For each source-domain sample, we look up the senone label in to get its -vector and forward-propagate through to obtain the output vector . The KL distance between and its corresponding centroid -vector is
We sum up all the KL distances and get the KL distance loss below
To ensure each -vector is normalized to satisfy Eq. (1
), we perform a softmax operation over a logit vectorto obtain below
For fast convergence, is initialized with the arithmetic mean of the pre-softmax logit vectors of the source-domain network aligned with senone . The embedding matrix is trained to minimize by updating through standard back-propagation while the parameters of are fixed.
2.3 NLE Based on Symmetric KL Distance Minimization (NLE-SKL)
One shortcoming of KL distance is that it is asymmetric: the minimization of does not guarantee is also minimized. SKL compensates for this by adding up the two KL terms together and is thus a more robust distance metric for clustering. Therefore, for each senone, we learn a centroid -vector with a minimum average SKL distance to the output vectors of aligned with that senone by following the same DNN-based method in Section 2.2 except for replacing the KL distance loss with an SKL one.
The SKL distance between an -vector and an output vector is defined as
and the SKL distance loss is computed by summing up all pairs of SKL distances between output vectors and their centroids as follows
2.4 Train Target-Domain Model with NLE
As the condensed knowledge distilled from a large amount of source-domain data, the -vectors serve at the soft targets for training the target-domain model .
As shown in Fig. 2, we look up target-domain label in the optimized label embedding matrix for its -vector and forward-propagate through to get the output vector . We construct a cross-entropy loss using -vectors as the soft targets below
where is the posterior of senone given . We train to minimize by updating only . The optimized with is used for decoding.
Compared with the traditional one-hot training targets that convey only class identities, the soft -vectors transfer additional quantized knowledge that encodes the probabilistic relationships among different senone classes. Benefiting from this, the NLE-adapted acoustic model is expected to achieve higher ASR performance than using one-hot labels on target-domain test data. The steps of NLE for domain adaptation are summarized in Algorithm 1.
We perform two domain adaptation tasks where parallel source- and target-domain data is not accessible through data simulation: 1) adapt a US English acoustic model to accented English from 9 areas of the world; 2) adapt the same acoustic model to kids’ speech. In both tasks, the source-domain training data is 6400 hours of multi-conditional Microsoft US English production data, including Cortana, xBox and Conversation data. The data is collected from mostly adults from all over the US. It is a mixture of close-talk and far-field utterances from a variety of devices.
For the first task, the adaptation data consists of 9 different types of accented English A1-A9 in which A1, A2, A3, A8 are from Europe, A4, A5, A6 are from Asia, A7 is from Oceania, A9 is from North America. A7-A9 are native accents because they are from countries where most people use English as their first language. On the contrary, A1-A6 are non-native accents. Each English accent forms a specific target domain. For the second task, the adaptation data is 80 hours of US English speech collected from kids. The durations of different adaptation and test data are listed in Table 1. The training and adaptation data is transcribed. All data is anonymized with personally identifiable information removed.
3.1 Baseline System
with 6400 hours of training data. This teacher model has 6 hidden layers with 600 units in each layer. 80-dimensional log Mel filterbank features are extracted from training, adaptation and test data. The output layer has 9404 units representing 9404 senone labels. The BLSTM is trained to minimize the frame-level cross-entropy criterion. There is no frame stacking or skipping. A 5-gram LM is used for decoding with around 148M n-grams. Table2 (Row 1) shows the WERs of the multi-conditional BLSTM on different accents. This well-trained source-domain model is used as the initialization for all the subsequent re-training and adaptation experiments.
For accent adaptation, we train an accent-dependent BLSTM for each accented English using one-hot label with cross-entropy loss. Each accent-dependent model is trained with the speech of only one accent. As shown in Table 2, the one-hot re-training achieves 9.71% to 20.37% WERs on different accents. For kids adaptation, we train a kids-dependent BLSTM using kids’ speech with one-hot labels. In Table 2, we see that one-hot re-training achieves 26.99% WER on kids test data. We use these results as the baseline.
Note that, in this work, we do not compare NLE with KLD adaptation  since the effectiveness of KLD regularization reduces as the adaptation data increases and it is normally used when the adaptation data is very small (10 min or less).
3.2 NLE for Accent Adaptation
It is hard to simulate parallel accented speech from US English. We adapt the 6400 hours BLSTM acoustic model to 9 different English accents using NLE. We learn 9404-dimensional -vectors using NLE-L2, NLE-KL, and NLE-SKL as described in Sections 2.2 and 2.3 with the source-domain data and acoustic model. These -vectors are used as the soft targets to train the accent-dependent models with cross-entropy loss as in Section 2.4.
As shown in Table 2, NLE-L2, NLE-KL, and NLE-SKL achieve 9.48% to 18.54%, 9.43% to 18.74%, and 9.19% to 17.97% WERs, respectively, on different accents. NLE-SKL performs the best among the three NLE adaptation methods, with 11.8%, 14.1%, 11.5%, 10.3%, 8.3%, 7.5%, 13.5%, 12.2%, and 5.4% relative WER reductions over the one-hot label baseline on A1 to A9, respectively. NLE-SKL consistently outperforms NLE-L2 and NLE-KL on all the accents, with up to 4.0% and 4.9% relative WER reductions over NLE-L2 and NLE-KL, respectively. The relative reductions for native and non-native accents are similar except for A9. NLE-KL performs slightly better than NLE-L2 on 6 out of 9 accents, but slightly worse than NLE-L2 on the other 3. All the three NLE methods achieve much smaller relative WER reductions (about 5%) on A9 than the other accents (about 10%). This is reasonable because North American English is much more similar to the source-domain US English than the other accents. The source-domain model is not adapted much to the accent of the target-domain speech.
3.3 NLE for Kid Adaptation
Parallel kids’ speech cannot be obtained through data simulation either. We adapt the 6400 hours BLSTM acoustic model to the collected real kids’ speech using NLE. We use the same -vectors learned in Section 3.2 as the soft targets to train the kid-dependent BLSTM acoustic model by minimizing the cross-entropy loss. As shown in Table 2, NLE-L2, NLE-KL, and NLE-SKL achieve 25.93%, 25.83%, and 25.36% WERs on kids’ test set, respectively. NLE-SKL outperforms the other two NLE methods with a 6.0% relative WER reduction over the one-hot baseline. We find that NLE is more effective for accent adaptation than kids adaptation. One possible reason is that a portion of kids are at the age of teenagers whose speech is very similar to that of the adults’ in the 6400 hours source-domain data. Note that all the kids speech is collected in US and no accent adaptation is involved.
We propose a novel neural label embedding method for domain adaptation. Each senone label is represented by an -vector that minimizes the average , KL or SKL distances to all the source-domain output vectors aligned with the same senone. -vectors are learned through a simple average or a proposed DNN-based method. During adaptation, -vectors serve as the soft targets to train the target-domain model. Without parallel data constraint as in T/S learning, NLE is specially suited for the situation where paired target-domain data samples cannot be simulated from the source-domain ones. Given parallel data, NLE has significantly lower computational cost than T/S learning during adaptation since it replaces the DNN forward-propagation with a fast dictionary lookup.
We adapt a multi-conditional BLSTM acoustic model trained with 6400 hours US English to 9 different accented English and kids’ speech. NLE achieves 5.4% to 14.1% and 6.0% relative WER reductions over one-hot label baseline. NLE-SKL consistently outperforms NLE-L2 and NLE-KL on all adaptation tasks by up to relatively 4.0% and 4.9%, respectively. As a simple arithmetic mean, NLE-L2 performs similar to NLE-KL with dramatically reduced computational cost for -vector learning.
-  G. Hinton, L. Deng, D. Yu, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
T. Sainath, B. Kingsbury, B. Ramabhadran, et al.,
“Making deep belief networks effective for large vocabulary continuous speech recognition,”in Proc. ASRU, 2011, pp. 30–35.
L. Deng, J. Li, J. Huang, et al.,
“Recent advances in deep learning for speech research at Microsoft,”in ICASSP, 2013.
-  D. Yu, K. Yao, H. Su, et al., “Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition,” in Proc. ICASSP, May 2013.
-  Z. Huang, S. Siniscalchi, I. Chen, et al., “Maximum a posteriori adaptation of network parameters in deep models,” in Proc. Interspeech, 2015.
-  H. Liao, “Speaker adaptation of context dependent deep neural networks,” in Proc. ICASSP, May 2013.
-  R. Gemello, F. Mana, S. Scanzio, et al., “Linear hidden transformations for adaptation of hybrid ann/hmm models,” Speech Communication, vol. 49, no. 10, pp. 827 – 835, 2007.
-  F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Proc. ASRU, Dec 2011, pp. 24–29.
-  J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network acoustic models with singular value decomposition.,” in Interspeech, 2013.
-  J. Xue, J. Li, D. Yu, et al., “Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network,” in Proc. ICASSP, May 2014.
-  G. Saon, H. Soltau, et al., “Speaker adaptation of neural network acoustic models using i-vectors,” in ASRU, 2013.
-  O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code,” in Proc. ICASSP, May 2013.
-  S. Xue, O. Abdel-Hamid, H. Jiang, et al., “Fast adaptation of deep neural network based on discriminant codes for speech recognition,” in TASLP, vol. 22, no. 12, Dec 2014.
-  F. Weninger, J. Andrés-Ferrer, X. Li, et al., “Listen, attend, spell and adapt: Speaker adapted sequence-to-sequence asr,” Proc. Interspeech, 2019.
-  Z. Meng, Y. Gaur, J. Li, et al., “Speaker adaptation for attention-based end-to-end speech recognition,” Proc. Interspeech, 2019.
-  Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, et al., “Attention-based models for speech recognition,” in NIPS, 2015, pp. 577–585.
-  Z. Meng, Y. Gaur, J. Li, and Y. Gong, “Character-aware attention-based end-to-end speech recognition,” in Proc. ASRU. IEEE, 2019.
-  J. Li, R. Zhao, J.-T. Huang, et al., “Learning small-size DNN with output-distribution-based criteria.,” in Proc. INTERSPEECH, 2014, pp. 1910–1914.
-  J. Li, M. L Seltzer, X. Wang, et al., “Large-scale domain adaptation via teacher-student learning,” in INTERSPEECH, 2017.
-  Z. Meng, J. Li, Y. Zhao, et al., “Conditional teacher-student learning,” in Proc. ICASSP, 2019.
-  Z. Meng, J. Li, Y. Gaur, et al., “Domain adaptation via teacher-student learning for end-to-end speech recognition,” in Proc. ASRU. IEEE, 2019.
-  I. Goodfellow, J. Pouget-Adadie, et al., “Generative adversarial nets,” in Proc. NIPS, pp. 2672–2680. 2014.
Yaroslav Ganin and Victor Lempitsky,
“Unsupervised domain adaptation by backpropagation,”in Proc. ICML, Lille, France, 2015, vol. 37, pp. 1180–1189, PMLR.
-  Yusuke Shinohara, “Adversarial multi-task learning of deep neural networks for robust speech recognition.,” in INTERSPEECH, 2016, pp. 2369–2372.
-  Z. Meng, J. Li, Z. Chen, et al., “Speaker-invariant training via adversarial learning,” in Proc. ICASSP, 2018.
-  Z. Meng, J. Li, Y. Gong, et al., “Adversarial teacher-student learning for unsupervised domain adaptation,” in Proc. ICASSP. IEEE, 2018, pp. 5949–5953.
-  Z. Meng, J. Li, and Y. Gong, “Adversarial speaker adaptation,” in Proc. ICASSP, 2019.
-  S. Pascual, A. Bonafonte, et al., “Segan: Speech enhancement generative adversarial network,” in Interspeech, 2017.
-  Z. Meng, J. Li, and Y. Gong, “Cycle-consistent speech enhancement,” Interspeech, 2018.
-  Z. Meng, J. Li, and Y. Gong, “Adversarial feature-mapping for speech enhancement,” Interspeech, 2018.
-  Q. Wang, W. Rao, S. Sun, et al., “Unsupervised domain adaptation via domain adversarial training for speaker recognition,” ICASSP, 2018.
-  Z. Meng, Y. Zhao, J. Li, and Y. Gong, “Adversarial speaker verification,” in Proc. ICASSP, 2019.
-  S. Sun, B. Zhang, L. Xie, et al., “An unsupervised deep domain adaptation approach for robust speech recognition,” Neurocomputing, vol. 257, pp. 79 – 87, 2017.
-  Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Unsupervised adaptation with domain separation networks for robust speech recognition,” in Proc. ASRU, 2017.
-  Robert Gray, “Vector quantization,” IEEE Assp Magazine, vol. 1, no. 2, pp. 4–29, 1984.
-  K. Chaudhuri and A. McGregor, “Finding metric structure in information theoretic clustering.,” in COLT. Citeseer, 2008, vol. 8, p. 10.
-  R. Veldhuis, “The centroid of the symmetrical kullback-leibler distance,” IEEE Signal Processing Letters, vol. 9, 2002.
-  M. Das Gupta, S. Srinivasa, M. Antony, et al., “Kl divergence based agglomerative clustering for automated vitiligo grading,” in Proc. CVPR, 2015, pp. 2700–2709.
H. Sak, A. Senior, and F. Beaufays,
“Long short-term memory recurrent neural network architectures for large scale acoustic modeling,”in Interspeech, 2014.
-  H. Erdogan, T. Hayashi, J. R. Hershey, et al., “Multi-channel speech recognition: Lstms all the way through,” in CHiME-4 workshop, 2016, pp. 1–4.
-  Z. Meng, S. Watanabe, J. R. Hershey, et al., “Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition,” in ICASSP, 2017, pp. 271–275.