Acoustic model adaptation aims to improve automatic speech recognition (ASR) accuracy by reducing the mismatch between training and test conditions. In feature-space adaptation, transformations of acoustic features are estimated to maximise the log-likelihood of the adaptation data [1, 2]. A subset of the weights of a neural network acoustic model [3, 4, 5, 6]
are adapted in model-based adaptation. Hybrid adaptation uses auxiliary features such as i-vectors[7, 8] or speaker codes  to inform the acoustic model about speaker identities. Experiments have shown that these approaches are complementary and can be usefully combined .
A label sequence is provided for the adaptation data in supervised adaptation, but for unsupervised adaptation only an unlabelled recording is available. Conventionally, the best path from a first pass decoding is used to estimate labels for unsupervised adaptation . In this paper we focus on unsupervised test-time adaptation of neural network acoustic models.
An important challenge for unsupervised model adaptation of neural networks is that we do not want to overfit to errors made in the first pass decoding. In the past this challenge was tackled by filtering adaptation data by confidences produced by an ASR system [11, 12, 13, 14, 15] or by using an ASR quality estimation . Alternatively, neural network based acoustic models were prevented from overfitting to those errors by limiting expressivity of the adaptation by drastically reducing the number of adapted parameters, for example by adapting only amplitudes of hidden units [17, 18]
or by using low rank linear transformations. Furthermore, strong regularisers were used to prevent the outputs or weights of the acoustic model from diverging too far from the original model [20, 6]
. In this paper we explore an alternative approach in which we use a lattice obtained from the first pass decoding as the supervision for unsupervised model adaptation, since lattices contain all information about the ASR system’s confidence and possible phone confusions that can be leveraged during adaptation of the acoustic model. Lattices were previously used as supervision for unsupervised adaptation of Gaussian Mixture Models (GMM) using maximum likelihood linear regression (MLLR) for unsupervised training of GMMs using maximum likelihood training. However, in this paper we are interested in unsupervised adaptation of much larger discriminative models.
An effective unsupervised adaptation technique for neural network acoustic models requires three components. First, it is necessary to select a suitable subset of model parameters for adaptation in order to allow rapid adaptation using small amounts of adaptation data. Second, the system should filter the possible adaptation data with respect to its suitability for adaptation, as the first pass decoding may produce erroneous transcripts. Third, it needs a reliable adaptation schedule that updates the selected adaptation parameters using data filtered by the second component, while preventing overfitting to the adaptation data.
In this paper we explore an alternative solution to the data filtering component, in which all the adaptation data is used to adapt the whole neural network acoustic model, but the uncertainty in the decoding is captured through the use of complete lattices for supervision. Our approach is inspired by recent work on semi-supervised learning using the sequence level lattice-free maximum mutual information (LF-MMI) objective, in which it is shown that using lattices as supervision is beneficial compared to using only best paths in the semi-supervised learning setting. We acknowledge that semi-supervised training and test-time adaptation are essentially equal, but we emphasise that this approach allows us to reliably adapt all weights of neural network models using unsupervised adaptation and a discriminative training criterion, which was problematic in the past. Moreover, test-time speaker adaptation uses much less data than semi-supervised training (from 5 minutes to 1 hour) and it uses the same data for adaptation and testing.
We compare the lattice approach to using the best path, both obtained from the first pass decoding for unsupervised adaptation using the LF-MMI framework. This is experimentally explored using three transcription tasks – TED talks [23, 24], multi-genre TV broadcasts , and a low-resource language, Somali – which have a wide range of baseline word error rates (WERs) (from 10% to 57%). We demonstrate improvements compared to using only the best path as supervision. Moreover, we show that by using this approach adapting all the parameters achieves better results than commonly used methods that adapt only a small subset of the weights, such as LHUC .
2.1 Lattice supervision and LF-MMI
Discriminative training using criteria such as maximum mutual information (MMI)  has been shown to be sensitive to the accuracy of the transcripts [12, 27]. In lieu of better transcripts, a range of transcript filtering approaches have previously been explored [12, 13, 14]. In unsupervised or semi-supervised approaches, in which we generate hypothesis transcriptions by decoding with a seed model, we can alternatively use a lattice of supervision. For instance, with the MMI criterion:
the numerator lattice can contain multiple hypotheses for the same audio segment up to some lattice pruning factor. If set to 0, is left with only the best path.
Lattice supervision has previously been used in work on unsupervised adaptation  and training  of GMMs, as well as discriminative  and semi-supervised training  of neural network models. Following Manohar et al. , we explore the use of lattice supervision versus that of only using the best path in the denominator lattice-free version of MMI (LF-MMI) . LF-MMI was introduced by Povey et al.  as a method to train neural network acoustic models with a sequence discriminative criterion (MMI) without an initial cross-entropy (CE) stage to generate lattices approximating all possible word sequences (e.g. ). The word-level denominator lattice is instead replaced with a phone-level denominator graph encoding all possible sequences given a 3 or 4-gram phone language model. To further reduce complexity, the model outputs at one third of the frame rate. In the numerator a frame-by-frame mask allows phones to appear with some tolerance relative to its original alignment. A mixture of regularisation methods are required to reduce overfitting; for more details we refer to . Povey et al.  demonstrated up to 8% relative improvements in WER over previous CE systems followed by sequence discriminative training with the state Minimum Bayes Risk (sMBR) criterion . An extension to the LF-MMI framework was recently proposed that enables flat-start training with neural networks .
Traditionally, only a subset of the acoustic model weights is adapted to prevent overfitting to the transcripts obtained from a first pass decode. This is usually done by inserting a linear layer after the input layer, hidden layers or output layers. For example, an adaption of activations of a hidden layer with speaker dependent weights , can be expressed as follows:
However, even these simple techniques tend to overfit because too many parameters are used for adaptation. Learning Hidden Unit Contributions (LHUC) [17, 3] is a technique in which only elements on the diagonal of the speaker dependent matrix are adapted – i.e. each hidden unit maybe viewed as having a speaker adaptive amplitude parameter. Since a much smaller number of weights is adapted, this technique is not that prone to overfitting to adaptation data.
. This was done by maintaining speaker-dependent LHUC parameters for each speaker and training them jointly with speaker-independent parameters. In order to obtain a good speaker-independent model that can be used for decoding and adaptation to new speakers, we trained speaker-independent LHUC parameters instead of speaker dependent with probability 0.5 during training. This is similar to a setup that was shown to well in
. At the beginning of training all speaker dependent parameters were sampled from a normal distribution withand . It was important to turn off L2 regularisation and parameter shrinkage for the speaker dependent parameters in Kaldi, as otherwise all speaker dependent parameters would converge to zero.
We conducted test-time model adaptation experiments on three datasets: the TED-LIUM corpus of TED talks [23, 24], multi-genre TV broadcasts from the MGB 1 Challenge  and a corpus of Somali from the IARPA MATERIAL programme. All models were trained and adapted using the Kaldi toolkit . We describe the respective baseline models in sections 3.1-3.3 and the adaptation of the models in 3.4.
We trained three time-delayed neural network (TDNN) models  with LF-MMI  following Kaldi TED-LIUM recipe 1f. All models had the same architecture with 7 hidden layers with 450 units. The first model was trained without i-vector features, the second model was trained with i-vector features and the third model was trained without i-vector features but with SAT-LHUC . All models were trained only on TED talks that were recorded before 2012 in order to conform with the IWSLT  evaluation guidelines which resulted in 130 hours of training data. We performed adaptation on the TED-LIUM dev and test data which contained 8 speakers with an average speech duration of 11.9 minutes and 11 speakers with an average speech duration of 14.2 minutes respectively.
with LF-MMI following Kaldi Switchboard recipe 7p. The model has 12 layers with 1280 units each (apart from the penultimate layer) and a bottleneck dimension of 256, with batch normalisation and dropout layers interleaved throughout. The model was trained for 8 epochs. We used alignments obtained with the HMM-GMM recipe provided with the MGB challenge, and trained on transcripts from a lightly supervised decode that had a maximum matching error rate (MER) with the original subtitles of 40. This yields roughly 649 hours of data, or 1960 hours after speed perturbation. For adaptation we carry across all training parameters (including dropout which we found particularly important for this data) and we rescore the supervision lattices with a 4-gram language model (LM) as in, estimated on about 640 million words of BBC subtitle text. We adapt and test on the longitudinal eval set which consists of 10 hours across two TV shows and a total of 19 episodes, each between 30 and 45 minutes in length. We do not have speaker clustering and therefore extract i-vectors per utterance and perform episode level adaptation. Finally, we rescore the decoded output with the 4-gram LM.
We carried out experiments on Somali “surprise language” data released to participants on the IARPA-MATERIAL programme111https://www.iarpa.gov/index.php/research-programs/material. Training data comprises 499 narrow-band telephone conversations sides, totalling 37 hours of speech. Test data comprises narrowband telephone conversations (NB); and wideband (WB) data from the news and topical broadcast domains that are mismatched to the training material. We trained a TDNN-F model using the neural network architecture from Kaldi TED-LIUM recipe 1g. The model had 14 hidden layers with 1024 units. The weight matrices were factored into two matrices with a bottleneck dimension 128. The model used filterbank, pitch and probability of voicing  features together with multilingual bottleneck features obtained from a neural network that was trained on all Babel languages [41, 42]
. We used per utterance cepstral mean and variance normalisation, since there were no speaker clusters for the wideband test data.
The model was trained on narrowband data with speed perturbation and evaluated on both narrowband and wideband data. We used data scraped from the web to build a language model for wideband data. We performed speaker adaptation on narrowband data which consisted of 117 speakers with an average speech duration of 4.7 minutes and file-level adaptation on wideband data which consisted of 119 files with an average speech duration of 5 minutes.
3.4 Adaptation methods
In this paper we were primarily interested in comparing model adaptation methods that use either one best path (called BP in the Results section) or a lattice (called LAT in the Results section) obtained from the first pass decoding for supervision. We adjusted a recipe for semi-supervised training using LF-MMI  to instead perform test-time adaptation. Our main hypothesis was that methods using lattices for supervision are much less likely to overfit to incorrectly transcribed segments in the adaptation data. In the past when only the best path was used for model adaptation, several techniques for data selection were required [12, 13, 14]. In this paper we compared adapting using only utterances with top , or average utterance confidence. We conducted model adaptation experiments in two regimes: in the first regime, we adapted all parameters of the acoustic model (called ALL in the Results section); in the second regime, we adapted only LHUC parameters inserted after every hidden layer of the acoustic model (called LHUC in the Results section). When adapting all parameters, we adapted the model for three epochs, starting with the learning rate which was used in the last iteration during training. We gradually decreased the learning rate down to one tenth (one fifth for MGB) of the initial learning rate. This learning schedule was chosen in order to imitate continued learning of the model. When adapting LHUC parameters, we adapted the model for three epochs with a fixed learning rate of , which we found to work well in previous experiments.
We conducted the first set of experiments on the TED-LIUM dataset. Adaptation of the model without i-vectors using lattices achieves relative improvement when adapting LHUC parameters and relative improvement when adapting all parameters, whereas improvements when adapting using best path and all adaptation data were much smaller or even negative (Table 1). We observed a similar trend for the remaining two models. Adaptation of the model using i-vectors (Table 2) using lattices as supervision improves performance of a speaker adaptive baseline. This confirms that model-based adaptation is complementary with i-vectors. Unfortunately, our implementation of SAT-LHUC (Table 3) did not outperform test-only LHUC adaptation. We plan to explore other possibilities of SAT-LHUC training in the future.
For the MGB corpus we adapted to entire episodes in the longitudinal eval set, rather than to speakers, as noted in Section 3.2. This provides more adaptation data (30-45 minutes per episode), but perhaps at the cost of losing finer granularity for adaptation. Adapting all parameters with lattice supervision provided the best results (Table 4). This is to our knowledge the best results shown on the longitudinal evaluation set . Using the best path with all parameters yields almost no gains (). When only adapting a subset of the parameters with LHUC the results are more stable, but does not perform as well as all parameters with lattice supervision.
We also evaluated adaptation using lattices as supervision on the Somali data. As can be seen from the table, Somali data is very challenging – the initial WER of the model is very high on both NB and WB data at and respectively. These results are similar to other experiments conducted on other IARPA-MATERIAL programme languages with the same TDNN-F neural network architecture . Here we show that adapting such a model using a best path as supervision does not reduce the WER, because the best path contains too many errors. Nevertheless, adaptation using lattices as supervision gives absolute improvements. Even though the relative improvement is small, it is interesting to see that using lattices as supervision allows us to improve performance at all. We believe that adapting to entire files is sub-optimal, because the speaker variance in the wide-band data might be too high. Therefore, we plan to perform per utterance adaptation experiments in the future.
One common way to prevent adaptation to erroneous first pass transcripts is to filter the adaptation data by confidences , for example by the average utterance confidence. This filtering can be done by using a hard threshold, or by using only the fraction of utterances with the highest confidences. Either way one extra hyper-parameter is introduced. In Table 6 we compare adaptation using lattices as supervision with adaptation using only best paths on various fractions of the adaptation data when adapting all parameters. We experiment with the TED-LIUM model without i-vectors, and the Somali model. As can be seen from the table, filtering utterances improves results when using best path supervision. The biggest improvement can be achieved when using only of the adaptation data. Even then the TED-LIUM model does not obtain similar performance as when adapted using lattices for supervision. Furthermore, adaptation of the Somali model using best path supervision only barely matches the unadapted baseline. This is probably due to the fact that the WER of the initial Somali model is high and that the lattice provides much more information than a combination of best path supervision and corresponding confidences. We also performed the same filtering experiment with lattices as supervision. We found that using a threshold of 75% – 100% achieves the best results. Overall, adaptation using lattice supervision does not benefit from filtering utterances as much as adaptation using best path supervision.
In this paper we compared unsupervised model adaptation using a lattice with the best path obtained from the first pass decoding as supervision. Our experiments show that using the lattice as supervision achieves better results than using the best path, even when confidence-based data selection is used to remove transcripts with many possible errors. This is due to the fact that the lattice from the first pass decoding contains much more information, such as confidence and phonetic confusions, than the best path. We find that the use of a lattice as supervision is particularly important when adapting all parameters, when over-fitting to incorrect first-pass transcriptions is a particular problem: in many cases we outperform a strong baseline that adapts only LHUC parameters. Moreover, we showed that when using lattices as supervision it is possible to adapt a model whose initial WER is higher than , for which adapting with best path supervision often produced worse WERs than the unadapted baseline.
Our finding that use of lattices greatly aids the two adaptation methods we considered motivates further investigation into whether other test-time adaptation techniques – many of which show limited gains in an unsupervised setting – could benefit similarly. This will be the subject of further work.
Acknowledgements: This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via Air Force Research Laboratory (AFRL) contract #FA8650-17-C-9117. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, AFRL or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. This work was also partially supported by: the EU H2020 projects SUMMA (grant agreement 688139) and ELG (grant agreement 825627), and a PhD studentship funded by Bloomberg.
-  C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models,” Computer speech & language, vol. 9, no. 2, pp. 171–185, 1995.
-  M. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer speech & language, vol. 12, no. 2, pp. 75–98, 1998.
-  P. Swietojanski, J. Li, and S. Renals, “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, pp. 1450–1463, 2016.
J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network,” inICASSP, 2014.
-  H. Liao, “Speaker adaptation of context dependent deep neural networks,” in ICASSP, 2013.
-  D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition,” in ICASSP, 2013.
-  N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
-  G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vectors,” in ASRU, 2013.
-  O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code,” in ICASSP, 2013.
-  L. Samarakoon and K. C. Sim, “On combining i-vectors and discriminative adaptation methods for unsupervised speaker normalization in DNN acoustic models,” in IEEE ICASSP, 2016.
-  P. C. Woodland, “Speaker adaptation for continuous density HMMs: A review,” in ISCA Workshop on Adaptation Methods for Speech Recognition, 2001.
-  L. Mathias, G. Yegnanarayanan, and J. Fritsch, “Discriminative training of acoustic models applied to domains with unreliable transcripts [speech recognition applications],” in ICASSP, 2005.
-  S.-H. Liu, F.-H. Chu, S.-H. Lin, and B. Chen, “Investigating data selection for minimum phone error training of acoustic models,” in Multimedia and Expo, 2007 IEEE International Conference on. IEEE, 2007.
-  S. Walker, M. Pedersen, I. Orife, and J. Flaks, “Semi-supervised model training for unbounded conversational speech recognition,” arXiv preprint arXiv:1705.09724, 2017.
-  K. Veselý, L. Burget, and J. Černocký, “Semi-supervised DNN training with word selection for ASR,” in Interspeech, 2017.
-  D. Falavigna, M. Matassoni, S. Jalalvand, M. Negri, and M. Turchi, “Dnn adaptation by automatic quality estimation of asr hypotheses,” Computer Speech & Language, vol. 46, pp. 585–604, 2017.
-  P. Swietojanski and S. Renals, “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models,” in SLT, 2014.
-  L. Samarakoon and K. C. Sim, “Subspace lhuc for fast adaptation of deep neural network acoustic models.” in INTERSPEECH, 2016.
Y. Zhao, J. Li, K. Kumar, and Y. Gong, “Extended low-rank plus diagonal adaptation for deep and recurrent neural networks,” inICASSP, 2017.
X. Li and J. Bilmes, “Regularized adaptation of discriminative classifiers,” inICASSP, 2006.
-  T. Fraga-Silva, J.-L. Gauvain, and L. Lamel, “Lattice-based unsupervised acoustic model training,” in ICASSP, 2011.
-  V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi-supervised training of acoustic models using lattice-free MMI,” in ICASSP, 2018.
-  A. Rousseau, P. Deléglise, and Y. Estève, “TED-LIUM: an automatic speech recognition dedicated corpus,” in LREC, 2012.
-  A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in LREC, 2014.
-  P. Bell, M. Gales, T. Hain, J. Kilgour, P. Lanchantin, X. Liu, A. McParland, S. Renals, O. Saz, M. Wester, and P. C. Woodland, “The MGB challenge: Evaluating multi-genre broadcast media recognition,” in ASRU, 2015.
-  L. Bahl, P. Brown, P. De Souza, and R. Mercer, “Maximum mutual information estimation of hidden markov model parameters for speech recognition,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’86., vol. 11. IEEE, 1986, pp. 49–52.
-  K. Yu, M. Gales, L. Wang, and P. C. Woodland, “Unsupervised training and directed manual transcription for LVCSR,” Speech Communication, vol. 52, no. 7-8, pp. 652–663, 2010.
-  M. Padmanabhan, G. Saon, and G. Zweig, “Lattice-based unsupervised mllr for speaker adaptation,” in ASR2000-automatic speech recognition: challenges for the New Millenium ISCA Tutorial and Research Workshop (ITRW), 2000.
-  D. Povey, “Discriminative training for large vocabulary speech recognition,” Ph.D. dissertation, University of Cambridge, 2005.
-  D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” Interspeech, 2016.
-  K. Veselý, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks,” in Interspeech, 2013.
J. Kaiser, B. Horvat, and Z. Kacic, “A novel loss function for the overall risk criterion based discriminative training of HMM models,” inSixth International Conference on Spoken Language Processing, 2000.
-  H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-to-end speech recognition using lattice-free MMI,” Interspeech, 2018.
-  T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, “A compact model for speaker-adaptive training,” in ICSLP, 1996, pp. 1137–1140.
-  P. Swietojanski and S. Renals, “SAT-LHUC: Speaker adaptive training for learning hidden unit contributions,” in ICASSP, 2016.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlíček, Y. Qian, P. Schwarz, J. Silovský, G. Stemmer, and K. Veselý, “The Kaldi speech recognition toolkit,” in ASRU, 2011.
-  V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts.” in Interspeech, 2015.
-  M. Federico, M. Cettolo, L. Bentivogli, M. Paul, and S. Stüker, “Overview of the IWSLT 2012 evaluation campaign,” in IWSLT, 2012.
-  D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohamadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factorization for deep neural networks,” Interspeech, 2018.
-  P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal, and S. Khudanpur, “A pitch extraction algorithm tuned for automatic speech recognition,” in ICASSP, 2014.
-  J. Cui, B. Kingsbury, B. Ramabhadran, A. Sethy, K. Audkhasi, X. Cui, E. Kislal, L. Mangu, M. Nussbaum-Thom, M. Picheny, Z. Tüske, P. Golik, R. Schlüter, H. Ney, M. J. F. Gales, K. M. Knill, A. Ragni, H. Wang, and P. C. Woodland, “Multilingual representations for low resource speech recognition and keyword search,” in IEEE ICASSP, 2016.
-  M. J. F. Gales, K. M. Knill, and A. Ragni, “Low-resource speech recognition and keyword-spotting,” in SPECOM, 2017.