1 Introduction
Deep neural networks (DNNs) [1, 2, 3]
have greatly advanced the performance of automatic speech recognition (ASR) with a large amount of training data. However, the performance degrades when test data is from a new domain. Many DNN adaptation approaches were proposed to compensate for the acoustic mismatch between training and testing. In
[4, 5, 6], regularizationbased approaches restrict the neuron output distributions or the model parameters to stay not too far away from the sourcedomain model. In
[7, 8], transformationbased approaches reduce the number of learnable parameters by updating only the transformrelated parameters. In [9, 10], the trainable parameters are further reduced by singular value decomposition of weight matrices of a neural network. In addition, ivector
[11] and speakercode [12, 13] are used as auxiliary features to a neural network for model adaptation. In [14, 15], these adaptation methods were further investigated in endtoend ASR [16, 17]. However, all these methods focus on addressing the overfitting issue given very limited adaptation data in the targetdomain.Teacherstudent (T/S) learning [18, 19, 20, 21] has shown to be effective for largescale unsupervised domain adaptation by minimizing the KullbackLeibler (KL) divergence between the output distributions of the teacher and student models. The input to the teacher and student models needs to be parallel source and targetdomain adaptation data, respectively, since the output vectors of a teacher network need to be framebyframe aligned with those of the student network to construct the KL divergence between two distributions. Compared to onehot labels, the use of framelevel senone (triphone states) posteriors from the teacher as the soft targets to train the student model well preserves the relationships among different senones at the output of the teacher network. However, the parallel data constraint of T/S learning restricts its application to the scenario where the paired targetdomain data can be easily simulated from the source domain data (e.g. from clean to noisy speech). Actually, in many scenarios, the generation of parallel data in a new domain is almost impossible, e.g., to simulate paired accented or kids’ speech from standard adults’ speech.
Recently, adversarial learning [22, 23] was proposed for domaininvariant training [24, 25, 26], speaker adaptation [27], speech enhancement [28, 29, 30] and speaker verification [31, 32]. It was also shown to be effective for unsupervised domain adaptation without using parallel data [33, 34]
. In adversarial learning, an auxiliary domain classifier is jointly optimized with the source model to minimaximize an adversarial loss. A deep representation is learned to be invariant to domain shifts and discriminative to senone classification. However, adversarial learning does not make use of the targetdomain labels which carry important class identity information and is only suitable for the situation where neither parallel data nor targetdomain labels are available.
How to perform effective domain adaptation using unpaired source and targetdomain data with labels? We propose a neural label embedding (NLE) method: instead of framebyframe knowledge transfer in T/S learning, we distill the knowledge of a sourcedomain model to a fixed set of label embeddings, or lvectors, one for each senone class, and then transfer the knowledge to the targetdomain model via these senonespecific lvectors. Each vector is a condensed representation of the DNN output distributions given all the features aligned with the same senone at the input. A simple DNNbased method is proposed to learn the vectors by minimizing the average , KullbackLeibler (KL) and symmetric KL distance to the output vectors with the same senone label. During adaptation, the lvectors are used in lieu of their corresponding onehot labels to train the targetdomain model with crossentropy loss.
NLE can be viewed as a knowledge quantization [35] in form of outputdistribution vectors where each vector is a codevector (centroid) corresponding to a senone codeword. With the NLE method, knowledge is transferred from the sourcedomain model to the targetdomain through a fixed codebook of senonespecific lvectors instead of variablelength framespecific outputdistribution vectors in T/S learning. These distilled lvectors decouple the targetdomain model’s output distributions from those of the sourcedomain model and thus enable a more flexible and efficient senonelevel knowledge transfer using unpaired data. When parallel data is available, compared to the T/S learning, NLE significantly reduces the computational cost during adaptation by replacing the forwardpropagation of each sourcedomain frame through the sourcedomain model with a fast lookup in vector codebook. In the experiments, we adapt a multiconditional acoustic model trained with 6400 hours of US English to each of the 9 different accented English (120 hours to 830 hours) and kids’ speech (80 hours), the proposed NLE method achieves 5.4% to 14.1% and 6.0% relative word error rate (WER) reduction over onehot label baseline on 9 accented English and kids’ speech, respectively.
2 Neural Label Embedding (NLE) for Domain Adaptation
In this section, we present the NLE method for domain adaptation without using parallel data. Initially, we have a welltrained sourcedomain network with parameters predicting a set of senones and sourcedomain speech frames with senone labels . We distill the knowledge of this powerful sourcedomain model into a dictionary of lvectors, one for each senone label (class) predicted at the output layer. Each lvector has the same dimentionality as the number of senone classes. Before training the targetdomain model with parameters , we query the dictionary with the groundtruth onehot senone labels of the targetdomain speech frames to get their corresponding lvectors. During adaptation, in place of the onehot labels, the lvectors are used as the soft targets to train the targetdomain model. For NLE domain adaptation, the sourcedomain data does not have to be parallel to the targetdomain speech frames , i.e., and do not have to be framebyframe synchronized and the number of frames does not have to be equal to .
The key step of the NLE method is to learn lvectors from the sourcedomain model and data. As the carrier of knowledge transferred from the sourcedomain to the targetdomain, the lvector of a senone class should be a representation of the output distributions (senoneposterior distributions) of the sourcedomain DNN given features aligned with senone at the input, encoding the dependency between senone and all the other senones . A reasonable candidate is the centroid vector that minimizes the average distance to the output vectors generated from all the frames aligned with senone . Therefore, we need to learn a dictionary of vectors corresponding to senones in the complete set , with each vector being dimensional. To serve as the training target of the targetdomain model, the vector needs to be normalized such that its elements satisfy
(1) 
2.1 NLE Based on Distance Minimization (NLEL2)
To compute the senonespecific centroid, the most intuitive solution is to minimize the average distance between the centroid and all the output vectors with the same senone label, which is equivalent to calculating the arithmetic mean of the output vectors aligned with the senone. Let denote a dimensional output vector of given the input frame .
equals to the posterior probability of senone
given , i.e., . For senone , the lvector based on distance minimization is computed as(2) 
where is the number of sourcedomain frames aligned with senone and . The lvectors under NLEL2 are automatically normalized since each posterior vector in the mean computation satisfy Eq. (1).
2.2 NLE Based on KL Distance Minimization (NLEKL)
KL divergence is an effective metric to measure the distance between two distributions. In NLE framework, the lvector can be learned as a centroid with a minimum average KL distance to the output vectors of senone . Many methods have been proposed to iteratively compute the centroid of KL distance [36, 37, 38].
In this paper, we propose a simple DNNbased solution to compute this KLbased centroid. As shown in Fig. 1, we have an initial embedding matrix consisting of all the vectors, i.e., . For each sourcedomain sample, we look up the senone label in to get its vector and forwardpropagate through to obtain the output vector . The KL distance between and its corresponding centroid vector is
(3) 
We sum up all the KL distances and get the KL distance loss below
(4) 
To ensure each vector is normalized to satisfy Eq. (1
), we perform a softmax operation over a logit vector
to obtain below(5) 
For fast convergence, is initialized with the arithmetic mean of the presoftmax logit vectors of the sourcedomain network aligned with senone . The embedding matrix is trained to minimize by updating through standard backpropagation while the parameters of are fixed.
2.3 NLE Based on Symmetric KL Distance Minimization (NLESKL)
One shortcoming of KL distance is that it is asymmetric: the minimization of does not guarantee is also minimized. SKL compensates for this by adding up the two KL terms together and is thus a more robust distance metric for clustering. Therefore, for each senone, we learn a centroid vector with a minimum average SKL distance to the output vectors of aligned with that senone by following the same DNNbased method in Section 2.2 except for replacing the KL distance loss with an SKL one.
The SKL distance between an vector and an output vector is defined as
(6) 
and the SKL distance loss is computed by summing up all pairs of SKL distances between output vectors and their centroids as follows
(7) 
2.4 Train TargetDomain Model with NLE
As the condensed knowledge distilled from a large amount of sourcedomain data, the vectors serve at the soft targets for training the targetdomain model .
As shown in Fig. 2, we look up targetdomain label in the optimized label embedding matrix for its vector and forwardpropagate through to get the output vector . We construct a crossentropy loss using vectors as the soft targets below
(8) 
where is the posterior of senone given . We train to minimize by updating only . The optimized with is used for decoding.
Compared with the traditional onehot training targets that convey only class identities, the soft vectors transfer additional quantized knowledge that encodes the probabilistic relationships among different senone classes. Benefiting from this, the NLEadapted acoustic model is expected to achieve higher ASR performance than using onehot labels on targetdomain test data. The steps of NLE for domain adaptation are summarized in Algorithm 1.
Task  A1  A2  A3  A4  A5  A6  A7  A8  A9  Kids 
Adapt  160  140  190  120  150  830  250  330  150  80 
Test  11  8  11  7  11  11  11  11  13  3 
Adapt. Method  A1  A2  A3  A4  A5  A6  A7  A8  A9  Kids 
Unadapted  26.98  15.27  25.23  34.62  22.48  14.65  13.19  14.91  9.80  27.83 
OneHot  20.37  14.46  20.14  19.99  15.06  12.50  11.73  13.90  9.71  26.99 
NLEL2  18.39  12.91  18.54  18.39  14.14  12.04  10.54  12.55  9.48  25.93 
NLEKL  18.30  12.86  18.74  18.42  14.25  11.90  10.39  12.52  9.43  25.83 
NLESKL  17.97  12.42  17.82  17.94  13.81  11.56  10.15  12.21  9.19  25.36 
3 Experiments
We perform two domain adaptation tasks where parallel source and targetdomain data is not accessible through data simulation: 1) adapt a US English acoustic model to accented English from 9 areas of the world; 2) adapt the same acoustic model to kids’ speech. In both tasks, the sourcedomain training data is 6400 hours of multiconditional Microsoft US English production data, including Cortana, xBox and Conversation data. The data is collected from mostly adults from all over the US. It is a mixture of closetalk and farfield utterances from a variety of devices.
For the first task, the adaptation data consists of 9 different types of accented English A1A9 in which A1, A2, A3, A8 are from Europe, A4, A5, A6 are from Asia, A7 is from Oceania, A9 is from North America. A7A9 are native accents because they are from countries where most people use English as their first language. On the contrary, A1A6 are nonnative accents. Each English accent forms a specific target domain. For the second task, the adaptation data is 80 hours of US English speech collected from kids. The durations of different adaptation and test data are listed in Table 1. The training and adaptation data is transcribed. All data is anonymized with personally identifiable information removed.
3.1 Baseline System
We train a sourcedomain bidirectional long shortterm memory (BLSTM)hidden Markov model acoustic model
[39, 40, 41]with 6400 hours of training data. This teacher model has 6 hidden layers with 600 units in each layer. 80dimensional log Mel filterbank features are extracted from training, adaptation and test data. The output layer has 9404 units representing 9404 senone labels. The BLSTM is trained to minimize the framelevel crossentropy criterion. There is no frame stacking or skipping. A 5gram LM is used for decoding with around 148M ngrams. Table
2 (Row 1) shows the WERs of the multiconditional BLSTM on different accents. This welltrained sourcedomain model is used as the initialization for all the subsequent retraining and adaptation experiments.For accent adaptation, we train an accentdependent BLSTM for each accented English using onehot label with crossentropy loss. Each accentdependent model is trained with the speech of only one accent. As shown in Table 2, the onehot retraining achieves 9.71% to 20.37% WERs on different accents. For kids adaptation, we train a kidsdependent BLSTM using kids’ speech with onehot labels. In Table 2, we see that onehot retraining achieves 26.99% WER on kids test data. We use these results as the baseline.
Note that, in this work, we do not compare NLE with KLD adaptation [4] since the effectiveness of KLD regularization reduces as the adaptation data increases and it is normally used when the adaptation data is very small (10 min or less).
3.2 NLE for Accent Adaptation
It is hard to simulate parallel accented speech from US English. We adapt the 6400 hours BLSTM acoustic model to 9 different English accents using NLE. We learn 9404dimensional vectors using NLEL2, NLEKL, and NLESKL as described in Sections 2.2 and 2.3 with the sourcedomain data and acoustic model. These vectors are used as the soft targets to train the accentdependent models with crossentropy loss as in Section 2.4.
As shown in Table 2, NLEL2, NLEKL, and NLESKL achieve 9.48% to 18.54%, 9.43% to 18.74%, and 9.19% to 17.97% WERs, respectively, on different accents. NLESKL performs the best among the three NLE adaptation methods, with 11.8%, 14.1%, 11.5%, 10.3%, 8.3%, 7.5%, 13.5%, 12.2%, and 5.4% relative WER reductions over the onehot label baseline on A1 to A9, respectively. NLESKL consistently outperforms NLEL2 and NLEKL on all the accents, with up to 4.0% and 4.9% relative WER reductions over NLEL2 and NLEKL, respectively. The relative reductions for native and nonnative accents are similar except for A9. NLEKL performs slightly better than NLEL2 on 6 out of 9 accents, but slightly worse than NLEL2 on the other 3. All the three NLE methods achieve much smaller relative WER reductions (about 5%) on A9 than the other accents (about 10%). This is reasonable because North American English is much more similar to the sourcedomain US English than the other accents. The sourcedomain model is not adapted much to the accent of the targetdomain speech.
3.3 NLE for Kid Adaptation
Parallel kids’ speech cannot be obtained through data simulation either. We adapt the 6400 hours BLSTM acoustic model to the collected real kids’ speech using NLE. We use the same vectors learned in Section 3.2 as the soft targets to train the kiddependent BLSTM acoustic model by minimizing the crossentropy loss. As shown in Table 2, NLEL2, NLEKL, and NLESKL achieve 25.93%, 25.83%, and 25.36% WERs on kids’ test set, respectively. NLESKL outperforms the other two NLE methods with a 6.0% relative WER reduction over the onehot baseline. We find that NLE is more effective for accent adaptation than kids adaptation. One possible reason is that a portion of kids are at the age of teenagers whose speech is very similar to that of the adults’ in the 6400 hours sourcedomain data. Note that all the kids speech is collected in US and no accent adaptation is involved.
4 Conclusion
We propose a novel neural label embedding method for domain adaptation. Each senone label is represented by an vector that minimizes the average , KL or SKL distances to all the sourcedomain output vectors aligned with the same senone. vectors are learned through a simple average or a proposed DNNbased method. During adaptation, vectors serve as the soft targets to train the targetdomain model. Without parallel data constraint as in T/S learning, NLE is specially suited for the situation where paired targetdomain data samples cannot be simulated from the sourcedomain ones. Given parallel data, NLE has significantly lower computational cost than T/S learning during adaptation since it replaces the DNN forwardpropagation with a fast dictionary lookup.
We adapt a multiconditional BLSTM acoustic model trained with 6400 hours US English to 9 different accented English and kids’ speech. NLE achieves 5.4% to 14.1% and 6.0% relative WER reductions over onehot label baseline. NLESKL consistently outperforms NLEL2 and NLEKL on all adaptation tasks by up to relatively 4.0% and 4.9%, respectively. As a simple arithmetic mean, NLEL2 performs similar to NLEKL with dramatically reduced computational cost for vector learning.
References
 [1] G. Hinton, L. Deng, D. Yu, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.

[2]
T. Sainath, B. Kingsbury, B. Ramabhadran, et al.,
“Making deep belief networks effective for large vocabulary continuous speech recognition,”
in Proc. ASRU, 2011, pp. 30–35. 
[3]
L. Deng, J. Li, J. Huang, et al.,
“Recent advances in deep learning for speech research at Microsoft,”
in ICASSP, 2013.  [4] D. Yu, K. Yao, H. Su, et al., “Kldivergence regularized deep neural network adaptation for improved large vocabulary speech recognition,” in Proc. ICASSP, May 2013.
 [5] Z. Huang, S. Siniscalchi, I. Chen, et al., “Maximum a posteriori adaptation of network parameters in deep models,” in Proc. Interspeech, 2015.
 [6] H. Liao, “Speaker adaptation of context dependent deep neural networks,” in Proc. ICASSP, May 2013.
 [7] R. Gemello, F. Mana, S. Scanzio, et al., “Linear hidden transformations for adaptation of hybrid ann/hmm models,” Speech Communication, vol. 49, no. 10, pp. 827 – 835, 2007.
 [8] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in contextdependent deep neural networks for conversational speech transcription,” in Proc. ASRU, Dec 2011, pp. 24–29.
 [9] J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network acoustic models with singular value decomposition.,” in Interspeech, 2013.
 [10] J. Xue, J. Li, D. Yu, et al., “Singular value decomposition based lowfootprint speaker adaptation and personalization for deep neural network,” in Proc. ICASSP, May 2014.
 [11] G. Saon, H. Soltau, et al., “Speaker adaptation of neural network acoustic models using ivectors,” in ASRU, 2013.
 [12] O. AbdelHamid and H. Jiang, “Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code,” in Proc. ICASSP, May 2013.
 [13] S. Xue, O. AbdelHamid, H. Jiang, et al., “Fast adaptation of deep neural network based on discriminant codes for speech recognition,” in TASLP, vol. 22, no. 12, Dec 2014.
 [14] F. Weninger, J. AndrésFerrer, X. Li, et al., “Listen, attend, spell and adapt: Speaker adapted sequencetosequence asr,” Proc. Interspeech, 2019.
 [15] Z. Meng, Y. Gaur, J. Li, et al., “Speaker adaptation for attentionbased endtoend speech recognition,” Proc. Interspeech, 2019.
 [16] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, et al., “Attentionbased models for speech recognition,” in NIPS, 2015, pp. 577–585.
 [17] Z. Meng, Y. Gaur, J. Li, and Y. Gong, “Characteraware attentionbased endtoend speech recognition,” in Proc. ASRU. IEEE, 2019.
 [18] J. Li, R. Zhao, J.T. Huang, et al., “Learning smallsize DNN with outputdistributionbased criteria.,” in Proc. INTERSPEECH, 2014, pp. 1910–1914.
 [19] J. Li, M. L Seltzer, X. Wang, et al., “Largescale domain adaptation via teacherstudent learning,” in INTERSPEECH, 2017.
 [20] Z. Meng, J. Li, Y. Zhao, et al., “Conditional teacherstudent learning,” in Proc. ICASSP, 2019.
 [21] Z. Meng, J. Li, Y. Gaur, et al., “Domain adaptation via teacherstudent learning for endtoend speech recognition,” in Proc. ASRU. IEEE, 2019.
 [22] I. Goodfellow, J. PougetAdadie, et al., “Generative adversarial nets,” in Proc. NIPS, pp. 2672–2680. 2014.

[23]
Yaroslav Ganin and Victor Lempitsky,
“Unsupervised domain adaptation by backpropagation,”
in Proc. ICML, Lille, France, 2015, vol. 37, pp. 1180–1189, PMLR.  [24] Yusuke Shinohara, “Adversarial multitask learning of deep neural networks for robust speech recognition.,” in INTERSPEECH, 2016, pp. 2369–2372.
 [25] Z. Meng, J. Li, Z. Chen, et al., “Speakerinvariant training via adversarial learning,” in Proc. ICASSP, 2018.
 [26] Z. Meng, J. Li, Y. Gong, et al., “Adversarial teacherstudent learning for unsupervised domain adaptation,” in Proc. ICASSP. IEEE, 2018, pp. 5949–5953.
 [27] Z. Meng, J. Li, and Y. Gong, “Adversarial speaker adaptation,” in Proc. ICASSP, 2019.
 [28] S. Pascual, A. Bonafonte, et al., “Segan: Speech enhancement generative adversarial network,” in Interspeech, 2017.
 [29] Z. Meng, J. Li, and Y. Gong, “Cycleconsistent speech enhancement,” Interspeech, 2018.
 [30] Z. Meng, J. Li, and Y. Gong, “Adversarial featuremapping for speech enhancement,” Interspeech, 2018.
 [31] Q. Wang, W. Rao, S. Sun, et al., “Unsupervised domain adaptation via domain adversarial training for speaker recognition,” ICASSP, 2018.
 [32] Z. Meng, Y. Zhao, J. Li, and Y. Gong, “Adversarial speaker verification,” in Proc. ICASSP, 2019.
 [33] S. Sun, B. Zhang, L. Xie, et al., “An unsupervised deep domain adaptation approach for robust speech recognition,” Neurocomputing, vol. 257, pp. 79 – 87, 2017.
 [34] Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Unsupervised adaptation with domain separation networks for robust speech recognition,” in Proc. ASRU, 2017.
 [35] Robert Gray, “Vector quantization,” IEEE Assp Magazine, vol. 1, no. 2, pp. 4–29, 1984.
 [36] K. Chaudhuri and A. McGregor, “Finding metric structure in information theoretic clustering.,” in COLT. Citeseer, 2008, vol. 8, p. 10.
 [37] R. Veldhuis, “The centroid of the symmetrical kullbackleibler distance,” IEEE Signal Processing Letters, vol. 9, 2002.
 [38] M. Das Gupta, S. Srinivasa, M. Antony, et al., “Kl divergence based agglomerative clustering for automated vitiligo grading,” in Proc. CVPR, 2015, pp. 2700–2709.

[39]
H. Sak, A. Senior, and F. Beaufays,
“Long shortterm memory recurrent neural network architectures for large scale acoustic modeling,”
in Interspeech, 2014.  [40] H. Erdogan, T. Hayashi, J. R. Hershey, et al., “Multichannel speech recognition: Lstms all the way through,” in CHiME4 workshop, 2016, pp. 1–4.
 [41] Z. Meng, S. Watanabe, J. R. Hershey, et al., “Deep long shortterm memory adaptive beamforming networks for multichannel robust speech recognition,” in ICASSP, 2017, pp. 271–275.
Comments
There are no comments yet.