has been widely applied to a variety of deep learning tasks in speech, language and image processing including model compression[1, 2], domain adaptation [3, 4, 5], small-footprint natural machine translation (NMT) , low-resource NMT , far-field automatic speech recognition (ASR) [8, 9], low-resource language ASR 
and neural network pre-training
. T/S learning falls in the category of transfer learning, where the network of interest, as a student, is trained by mimicking the behavior of a well-trained network, as a teacher, in the presence of the same or stereo training samples. Formally, the T/S learning works by minimizing the Kullback-Leibler (KL) divergence between the output distribution of the student and teacher models, other than from the hard labels derived from the transcriptions.
Compared to using conventional one-hot hard label as the training target, the transfer of soft posteriors  well preserves the probabilistic relationships among different classes encoded at the output of the teacher model. Because soft labels provide more information than hard labels for the model training, the T/S learning results in better performance as reported in [1, 2, 8]. The largest benefits of using pure soft labels is learning without any hard labels, enabling the use of much larger amount of unlabeled data to improve the student model performance [1, 8].
One shortcoming of the T/S learning is that a teacher model, not always perfect, sporadically makes the incorrect predictions that mislead the student model towards a suboptimal performance. In such a case, it may be beneficial to utilize hard labels of the training data to alleviate this effect. Hinton et. al. 
later proposed an interpolated T/S learning called knowledge distillation, in which a weighted sum of the soft posteriors and the one-hot hard label is used to train the student model. One issue is that the simple linear combination with one-hot vectors destroys the relationships among different classes embedded naturally in the soft posteriors produced by the teacher model. Moreover, proper setting of the interpolation weight with a fixed value is known to be critical and it varies with the adaptation scenarios and the qualities of the teacher and ground truth labels.
In this paper, we propose a conditional T/S learning scheme, where the student model becomes smart so that it can criticize the knowledge imparted by the teacher model to make better use of the teacher and the ground truth. At the initial stage, when the student model is very weak, it blindly follows whatever knowledge infused by the teacher model and uses the soft posteriors as the solely training targets. As the student model grows stronger, it begins to selectively choose the learning source from either the teacher model or the ground truth labels conditioned on whether the teacher’s prediction coincides with the ground truth. That is, the student model would learn exclusively from the teacher when the teacher makes correct prediction on training samples, and otherwise from the ground truth when the teacher is wrong. With conditional T/S learning, the student makes good use of rich and correct knowledge encompassed by the teacher, while avoids receiving inaccurate knowledge generated by the teacher. Another advantage of the conditional T/S learning over the conventional T/S learning is that it forgoes tuning the interpolation weight between two knowledge sources.
We applied the proposed approach to two tasks, domain adaptation and speaker adaptation. In domain adaptation, the student model is trained using the noise corrupted data in the target domain as input and the soft target obtained from the teacher posterior computed on the corresponding clean data. We demonstrate the effectiveness of the proposed approach using CHiME-3 dataset. In speaker adaptation, it can be shown that the conventional T/S learning is equivalent to the KLD adaptation , where the speaker-independent model acts as a teacher and the speaker dependent model acts as a student. Similarly, we apply the conditional T/S learning to further boost the performance of the KL divergence (KLD) adaptation. We demonstrate the improvement over the KLD adaptation for supervised and unsupervised adaptation on the Microsoft Windows Phone short message dictation task.
2 Teacher-Student Learning
In T/S learning, a well-trained teacher network takes in an sequence of training samples and predicts a sequence of class labels. Here, each class is represented by an integer and is the total number of classes in the classification task. The goal is to learn a student network that can accurately predict the class labels for each of the its input samples by using the knowledge transferred from the teacher network. To ensure effective knowledge transfer, the input sample sequences and need to be parallel to each other, i.e, each pair of train samples and share the same ground truth class label .
2.1 T/S Learning with Soft Labels
T/S learning minimizes the Kullback-Leibler (KL) divergence between the output distributions of the teacher network and the student network given parallel data and are at the input to the networks . The KL divergence between the teacher and student output distributions and is formulated as:
is the sample index, and are the parameters of the teacher and student networks, respectively, and are the posteriors of class predicted by the teacher and student network given the input samples and
, respectively. To learn a student network that approximates the given teacher network, we minimize the KL divergence with respect to only the parameters of the student network while keeping the parameters of the teacher network fixed, equivalent to minimizing the loss function below:111In some cases, the senone posteriors generated by the teacher network are flattened by a temperature before serving as the soft labels . But in speech area, is normally fixed at [9, 13, 14]. We obtain the best performance when and the same conclusion is also reported in [15, 6].
2.2 T/S Learning with Interpolated Labels
However, in T/S learning, the knowledge from the teacher is not accurate when the teacher’s classification decision is incorrect. To deal with this, Hinton et. al.  later suggested an interpolated T/S method which uses a weighted sum of the soft posteriors and the one-hot hard label to train the student model. Assuming that the sequence of one-hot ground truth class labels that both and are aligned with is , The interpolated T/S learning aims to minimizing the loss function below:
where is the weight for the class posteriors and is the indicator function which equals to 1 if the condition in the squared bracket is satisfied and 0 otherwise. Note that the interpolated T/S learning becomes soft T/S when and becomes standard cross-entropy training with hard labels when
. Although interpolated T/S compensates for the imperfection in knowledge transfer, the linear combination of soft and hard labels destroys the correct relationships among different classes embedded naturally in the soft class posteriors and deviates the student model parameters from the optimal direction. Moreover, the search for the best student model is subject to the heuristic tuning ofbetween and .
3 Conditional Teacher-Student Learning
Instead of blindly combining the soft and hard labels, the student network needs to be critical about the knowledge infused by the teacher network, i.e., to judge whether the class posteriors are accurate or not before learning from them. One natural judgment is that the teacher’s knowledge is deemed accurate when it correctly predicts the ground truth given the input samples, and deemed inaccurate otherwise. Therefore, the training target for the student model should be conditioned on the correctness of the teacher’s prediction, i.e., the student network exclusively uses the soft posteriors from the teacher as the training target when the teacher is correct and uses the hard label instead when the teacher is wrong as shown in Fig. 1.
In other words, assuming to be the sequence of conditional class label vectors used as the target to train the student network, the element of becomes
under conditional T/S learning. That is to say, the conditional class label is a soft vector of class posteriors if the teacher is correct and a hard one-hot vector if the teacher is wrong. The loss function to be minimized is formulated as the cross-entropy between the conditional class labels and the class posteriors generated by the student network as follows:
The student network parameters are optimized through standard back propagation with stochastic gradient decent. With conditional T/S learning, the student can learn from only the selected accurate knowledge generated by the teacher while simultaneously take advantage of the well-preserved probabilistic relationships among different classes and is thus expected to achieve improved performance in classification tasks.
4 Conditional T/S Learning for Acoustic Model Adaptation
With the advent of deep acoustic models, the performance of ASR has been greatly improved [16, 17, 18].A deep acoustic model takes the speech frames as the input and predicts the corresponding senone posteriors at the output layer. To achieve robust ASR over different domains and speakers, we apply conditional T/S learning to the domain and speaker adaptation of deep acoustic models. In these tasks, both teacher and student networks represent deep acoustic models, and are sequences of input speech frames, and denotes one senone in the set of all possible senones predicted by the teacher and student acoustic models.
4.1 Conditional T/S Learning for Domain Adaptation
ASR suffers from performance degradation when a well-trained acoustic model is applied in a new domain . T/S learning [3, 8, 9] and adversarial learning [20, 21, 22, 23, 24] are two effective approaches that can suppress this domain mismatch by adapting a source-domain acoustic model to target-domain speech. T/S learning is more suited for the situation where unlabeled parallel data is available for adaptation,222The parallel data can be either recorded or simulated as in . in which a sequence of source-domain speech features is fed as the input to a source-domain teacher model and a parallel sequence of target-domain features is at the input to the target-domain student model to optimize the student model parameters by minimizing the T/S loss in Eq. (2).
To further improve T/S learning, we introduce the conditional T/S learning by using the ground truth hard labels of the adaptation data and propose the following steps for domain adaptation.
Use a well-trained source-domain acoustic model as the teacher network and initialize the student network with the parameters of the teacher.
Use paralleled source and target domain adaptation data as and , respectively. All pairs of and are frame-by-frame synchronized.
Use the optimized student network as the adapted acoustic model for decoding test utterances in the target domain.
4.2 Conditional T/S Learning for Speaker Adaptation
Speaker adaptation aims at learning a set of speaker-dependent (SD) acoustic models by adapting an speaker-independent (SI) acoustic model to the speech of target speakers. Different from domain adaptation, speaker adaptation has only access to limited adaptation data from target speakers and has no access to the source-domain data.
, singular value decomposition-based[30, 31], subspace-based [32, 33] and adversarial learning-based [34, 35] approaches. Among these, KL divergence (KLD) regularization  is one of the most popular methods to prevent the adapted model from overfitting the limited speaker data. This regularization is realized by augmenting the training criterion with the KLD between the output distributions of the SD and SI models.
Apparently, the KLD adaptation is a special case of the interpolated T/S learning , in which the SI model acts as a teacher, the SD model acts as a student, and both take the adaptation data as input. The teacher network is more like a regularizer that constrains the student network from straying too far away from the teacher network. However, the linear combination between soft posteriors and hard labels does not make full use of two knowledge sources, and the best regularization weight is subject to heuristic tuning. We apply the conditional T/S learning to further improve the KLD adaptation. That is, when the SI model makes the right predictions, the SD model exclusively learns from the SI model; when the SI model is wrong, the adaptation target backs off to the hard labels.
Note that since the SD model grows from the SI model, the adaptation can be interpreted as a self-taught learning process. In the step of learning from the SI model, the SD model basically reviews what it has already known once again, which sounds not quite informative. However, if we remove this step, i.e., adapt the SD model only when the SI model makes a mistake, the performance degrades. This is because using partial training set leads to catastrophic forgetting and skews the estimation of the senone distributions for the target speaker towards those samples the teacher model makes mistakes on and there is no guarantee that the student model can work well on those samples the teacher model is good at.
The conditional T/S learning for speaker adaptation consists of the following steps.
Use a well-trained SI acoustic model as the teacher network and initialize the student network with the parameters of the teacher.
Use adaptation data from a target speaker as both and .
Use the optimized student network as the SD acoustic model for this target speaker.
For unsupervised speaker adaptation, we use the SI model to generate the hard labels to judge the SI model itself. Since the recognition hypotheses are generated through the cooperation of the SI acoustic model along with the language model, the derived hard labels are expected to be more accurate than the senone classification decisions generated by only the SI model at the frame level.
5.1 Domain Adaptation
As a major category of domain adaptation, we first verify conditional T/S learning with environment adaptation experiments. Specifically, we adapt a well-trained clean acoustic model to the noisy training data of CHiME-3  using different methods. The CHiME-3 dataset incorporates Wall Street Journal (WSJ) corpus sentences spoken in challenging noisy environments, recorded using a 6-channel tablet. The real far-field noisy speech from the 5th microphone channel in CHiME-3 development data set is used for testing. A standard WSJ 5K word 3-gram language model (LM) is used for decoding.
. The features are fed as the input of the LSTM after global mean and variance normalization. The LSTM has 4 hidden layers with 1024 hidden units for each layer. A 512-dimensional projection layer is inserted on top each hidden layer to reduce the number of parameters. The output layer of the LSTM has 3012 output units corresponding to 3012 senone labels. There is no frame stacking, and the output HMM senone label is delayed by 5 frames. Senone-level forced alignment of the clean data is generated using a Gaussian mixture model-HMM system. The clean CHiME-3 LSTM acoustic model achieves 7.43% and 38.96% WERs on clean and real noisy test data of CHiME-3, respectively. The clean LSTM acoustic model serves as the teacher network in the subsequent T/S learning methods. Trained with noisy and clean data using their one-hot hard labels, the multi-style LSTM acoustic model achieves 19.84% WER on the noisy test data.
For domain adaptation in , parallel data consisting of 9137 pairs of clean and noisy utterances in the CHiME-3 training set are used as the adaptation data for T/S learning. In order to make the student model invariant to environments, the training data for student model should include both clean and noisy data. Therefore, we extend the original T/S learning work by also including 9137 pairs of the clean and clean utterances in CHiME-3 for adaptation as in . As shown in Table 1, soft T/S learning achieves 18.20% average WERs after environment adaptation, which is 51.3% relative improvement over the clean model. To further improve the student model, we perform conditional T/S learning with the help of hard labels as described in Section 4.1. As a comparison, we conduct interpolated T/S learning  with different weights for soft labels. The conditional T/S learning achieves 16.42% average WERs with 9.8% and 11.7% relative improvements over soft T/S learning and the best performed interpolated T/S , respectively.
Note that we can get a better student model if we have a better teacher model. Then, we did a quick experiment by using a 375 hour-trained Cortana model which was used in  as the teacher model to learn the student model with the same CHiME-3 parallel data. The soft T/S model gets 13.56% WER which is significantly better the one in Table 1, and the conditional T/S can reach 11.13% WER, which stands for 17.9% relative improvement over soft T/S.
5.2 Speaker Adaptation
We further perform speaker adaptation on a Microsoft internal Windows Phone short message dictation (SMD) task. The test set consists of 7 speakers with a total number of 20,203 words. A separate adaptation set of 200 sentences per speaker is used for model adaptation. We train an SI LSTM acoustic model with 2600 hours of Microsoft internal live US English data. This SI model has 4 hidden LSTM layers with 1024 units in each layer and the output size of each LSTM layer is reduced to 512 by linear projection. The acoustic feature is 80-dimensional log Mel filterbank. The output layer has a dimension of 5980. The LSTM-RNN is trained to minimize the frame-level cross-entropy criterion. There is no frame stacking, and the output HMM state label is delayed by 5 frames. A trigram LM is used for decoding with around 8M n-grams. This SI LSTM acoustic model achieves 13.95% WER on the SMD test set.
We perform conditional T/S learning as in Section 4.2 to adapt the SI LSTM with 200 utterances in the adaptation set for each test speaker. For supervised adaptation, the hard labels come from the human transcription though forced alignment. For unsupervised adaptation, we use the SI model to generate the hypothesis. As a comparison, the standard adaptation with hard labels and KLD adaptation  with regularization weights of and are also conducted to adapt the SI LSTM. Note that the adaptation with hard labels is equivalent to KLD adaptation with . As in Table 2, the KLD adaptation produces its best WERs of 12.54% and 13.55% for supervised and unsupervised adaptation at , respectively. The conditional T/S learning outperforms the KLD adaptation. It achieves 12.17% WER for supervised adaptation, which is 12.8% and 3.0% relative gain over the SI model and the best performed KLD adaptation . For unsupervised adaptation, the conditional T/S learning achieves 13.21% WER, which is 5.3% and 2.5% relative gain over the SI acoustic model and KLD adaptation.
We proposed a conditional T/S learning method, in which the student network selectively learns from either the soft posteriors generated by the teacher network or the one-hot hard label conditioned on whether the teacher makes correct decisions or not. Instead of blindly following whatever knowledge the teacher infuses as in the conventional T/S learning, the conditional T/S learning pursues the most trustworthy knowledge throughout the training, eliminating the burden tuning interpolation weights. We applied conditional T/S learning to domain adaptation and obtained 9.8% relative WER improvement over a strong T/S learning baseline on the CHiME-3 dataset. For speaker adaptation, the conditional T/S learning outperformed the KLD adaptation, which is equivalent to the interpolated T/S learning. It achieved 12.8% and 5.3% relative WER gains for supervised and unsupervised adaptations, respectively, over a well-trained SI LSTM model.
-  J. Li, R. Zhao, J.-T. Huang, and Y. Gong, “Learning small-size DNN with output-distribution-based criteria.,” in Proc. INTERSPEECH, 2014, pp. 1910–1914.
-  G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” CoRR, vol. abs/1503.02531, 2015.
-  J. Li, M. L Seltzer, X. Wang, et al., “Large-scale domain adaptation via teacher-student learning,” in INTERSPEECH, 2017.
-  Zhong Meng, Jinyu Li, Yifan Gong, and Biing-Hwang (Fred) Juang, “Adversarial teacher-student learning for unsupervised domain adaptation,” in Proc. ICASSP, 2018.
-  L. Mošner, M. Wu, A. Raju, et al., “Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning,” arXiv preprint arXiv:1901.02348, 2019.
-  Yoon Kim and Alexander M. Rush, “Sequence-level knowledge distillation,” in EMNLP, 2016, pp. 1317–1327.
Yun Chen, Yang Liu, Yong Cheng, and Victor O.K. Li,
“A teacher-student framework for zero-resource neural machine translation,”in Proc. ACL, 2017, pp. 1925–1935.
-  J. Li, R. Zhao, Z. Chen, et al., “Developing far-field speaker system via teacher-student learning,” arXiv preprint arXiv:1804.05166, 2018.
-  S. Watanabe, T. Hori, J. Le Roux, et al., “Student-teacher network learning with enhanced features,” in Proc. ICASSP, 2017.
-  J. Cui, B. Kingsbury, B. Ramabhadran, et al., “Knowledge distillation across ensembles of multilingual models for low-resource languages,” in Proc. ICASSP. IEEE, 2017.
-  Z. Tang, D. Wang, and Z. Zhang, “Recurrent neural network training with dark knowledge transfer,” in Proc. ICASSP, 2016.
-  T. Asami, R. Masumura, Y. Yamaguchi, et al., “Domain adaptation of dnn acoustic models using knowledge distillation,” in Proc. ICASSP, 2017, pp. 5185–5189.
-  W. Chan, N. R. Ke, and I. Lane, “Transferring knowledge from a rnn to a dnn,” in INTERSPEECH, 2015.
-  T. Tan, Yanmin Q., and D. Yu, “Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition,” in ICASSP, 2018.
-  L. Lu, M. Guo, and S. Renals, “Knowledge distillation for small-footprint highway networks,” in ICASSP, 2017.
-  N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “Application of pretrained deep neural networks to large vocabulary speech recognition,” in Proc. INTERSPEECH, 2012.
-  G. Hinton, L. Deng, D. Yu, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
-  Li Deng, Jinyu Li, Jui-Ting Huang, et al., “Recent advances in deep learning for speech research at Microsoft,” in ICASSP, 2013.
-  J. Li, L. Deng, Y. Gong, et al., “An overview of noise-robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, 2014.
-  S. Sun, B. Zhang, L. Xie, and Y. Zhang, “An unsupervised deep domain adaptation approach for robust speech recognition,” Neurocomputing, 2017.
-  Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Unsupervised adaptation with domain separation networks for robust speech recognition,” in Proceeding of ASRU, 2017.
-  Z. Meng, J. Li, and Y. Gong, “Attentive adversarial learning for domain-invariant training,” in Proc. ICASSP, 2019.
-  Z. Meng, J. Li, and Y. Gong, “Cycle-consistent speech enhancement,” Interspeech, 2018.
-  Z. Meng, J. Li, and Y. Gong, “Adversarial feature-mapping for speech enhancement,” Interspeech, 2018.
-  D. Yu, K. Yao, H. Su, et al., “Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition,” in Proc. ICASSP, 2013.
-  H. Liao, “Speaker adaptation of context dependent deep neural networks,” in Proc. ICASSP, May 2013, pp. 7947–7951.
-  Z. Huang, J. Li, S. Siniscalchi, et al., “Rapid adaptation for deep neural networks through multi-task learning,” in Interspeech, 2015.
-  F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Proc. ASRU, Dec 2011, pp. 24–29.
-  P. Swietojanski, J. Li, and S. Renals, “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1450–1463, Aug 2016.
-  J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network acoustic models with singular value decomposition.,” in Interspeech, 2013, pp. 2365–2369.
-  Y. Zhao, J. Li, and Y. Gong, “Low-rank plus diagonal adaptation for deep neural networks,” in Proc. ICASSP, 2016.
-  S. Xue, O. Abdel-Hamid, H. Jiang, et al., “Fast adaptation of deep neural network based on discriminant codes for speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Dec 2014.
-  L. Samarakoon and K. C. Sim, “Factorized hidden layer adaptation for deep neural network based acoustic modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2241–2250, Dec 2016.
-  Zhong Meng, Jinyu Li, and Yifan Gong, “Adversarial speaker adaptation,” in Proc. ICASSP, 2019.
-  Z. Meng, J. Li, Z. Chen, et al., “Speaker-invariant training via adversarial learning,” in Proc. ICASSP, 2018.
-  J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third CHiME speech separation and recognition challenge: Dataset, task and baselines,” in Proc. ASRU, 2015, pp. 504–511.
-  F. Beaufays H. Sak, A. Senior, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Interspeech, 2014.
-  Z. Meng, S. Watanabe, J. R. Hershey, and H. Erdogan, “Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition,” in ICASSP, 2017.
-  H. Erdogan, T. Hayashi, J. R. Hershey, et al., “Multi-channel speech recognition: Lstms all the way through,” in CHiME-4 workshop, 2016, pp. 1–4.
-  J. Li, D. Yu, J.-T. Huang, and Y. Gong, “Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM,” in Proc. SLT. IEEE, 2012, pp. 131–136.