Learning Spoken Language Representations with Neural Lattice Language Modeling

by   Chao-Wei Huang, et al.
National Taiwan University

Pre-trained language models have achieved huge improvement on many NLP tasks. However, these methods are usually designed for written text, so they do not consider the properties of spoken language. Therefore, this paper aims at generalizing the idea of language model pre-training to lattices generated by recognition systems. We propose a framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks. The proposed two-stage pre-training approach reduces the demands of speech data and has better efficiency. Experiments on intent detection and dialogue act recognition datasets demonstrate that our proposed method consistently outperforms strong baselines when evaluated on spoken inputs. The code is available at https://github.com/MiuLab/Lattice-ELMo.



There are no comments yet.


page 1

page 2

page 3

page 4


Adapting Pretrained Transformer to Lattices for Spoken Language Understanding

Lattices are compact representations that encode multiple hypotheses, su...

PERT: Pre-training BERT with Permuted Language Model

Pre-trained Language Models (PLMs) have been widely used in various natu...

Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models

Chinese pre-trained language models usually process text as a sequence o...

Towards Generalized Models for Task-oriented Dialogue Modeling on Spoken Conversations

Building robust and general dialogue models for spoken conversations is ...

Conditional Prompt Learning for Vision-Language Models

With the rise of powerful pre-trained vision-language models like CLIP, ...

textless-lib: a Library for Textless Spoken Language Processing

Textless spoken language processing research aims to extend the applicab...

Text-Free Prosody-Aware Generative Spoken Language Modeling

Speech pre-training has primarily demonstrated efficacy on classificatio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of spoken language understanding (SLU) aims at extracting useful information from spoken utterances. Typically, SLU can be decomposed with a two-stage method: 1) an accurate automatic speech recognition (ASR) system transcribes the input speech into texts, and then 2) language understanding techniques are applied to the transcribed texts. These two modules can be developed separately, so most prior work developed the backend language understanding systems based on manual transcripts 

Yao et al. (2014); Guo et al. (2014); Mesnil et al. (2014); Goo et al. (2018).

Despite the simplicity of the two-stage method, prior work showed that a tighter integration between two components can lead to better performance. Researchers have extended the ASR 1-best results to n-best lists or word confusion networks in order to preserve the ambiguity of the transcripts.  Tur et al. (2002); Hakkani-Tür et al. (2006); Henderson et al. (2012); Tür et al. (2013); Masumura et al. (2018). Another line of research focused on using lattices produced by ASR systems. Lattices are directed acyclic graphs (DAGs) that represent multiple recognition hypotheses. An example of ASR lattice is shown in Figure 1. Ladhak et al. (2016)

introduced LatticeRNN, a variant of recurrent neural networks (RNNs) that generalize RNNs to lattice-structured inputs in order to improve SLU.

Zhang and Yang (2018)

proposed a similar idea for Chinese name entity recognition.

Sperber et al. (2019); Xiao et al. (2019); Zhang et al. (2019) proposed extensions to enable the transformer model Vaswani et al. (2017) to consume lattice inputs for machine translation. Huang and Chen (2019) proposed to adapt the transformer model originally pre-trained on written texts to consume lattices in order to improve SLU performance. Buckman and Neubig (2018) also found that utilizing lattices that represent multiple granularities of sentences can improve language modeling.

Figure 1: Illustration of a lattice.

With recent introduction of large pre-trained language models (LMs) such as ELMo Peters et al. (2018), GPT Radford (2018) and BERT Devlin et al. (2019), we have observed huge improvements on natural language understanding tasks. These models are pre-trained on large amount of written texts so that they provide the downstream tasks with high-quality representations. However, applying these models to the spoken scenarios poses several discrepancies between the pre-training task and the target task, such as the domain mismatch between written texts and spoken utterances with ASR errors. It has been shown that fine-tuning the pre-trained language models on the data from the target tasks can mitigate the domain mismatch problem Howard and Ruder (2018); Chronopoulou et al. (2019). Siddhant et al. (2018) focused on pre-training a language model specifically for spoken content with huge amount of automatic transcripts, which requires a large collection of in-domain speech.

In this paper, we propose a novel spoken language representation learning framework, which focuses on learning contextualized representations of lattices based on our proposed lattice language modeling objective. The proposed framework consists of two stages of LM pre-training to reduce the demands for lattice data. We conduct experiments on benchmark datasets for spoken language understanding, including intent classification and dialogue act recognition. The proposed method consistently achieves superior performance, with relative error reduction ranging from 3% to 42% compare to pre-trained sequential LM.

Figure 2:

Illustration of the proposed framework. The weights of the pre-trained LatticeLSTM LM are fixed when training the target task classifier (shown in white blocks), while the weights of the newly added LatticeLSTM classifier are trained from scratch (shown in colored block).

2 Neural Lattice Language Model

The two-stage framework that learns contextualized representations for spoken language is proposed and detailed below.

2.1 Problem Formulation

In the SLU task, the model input is an utterance containing a sequence of words , and the goal is to map to its corresponding class . The inputs can also be stored in a lattice form, where we use edge-labeled lattices in this work. A lattice is defined by a set of nodes and a set of transitions . A weighted transition is defined as , where and denote the previous node and next node respectively, denotes the associated word, and

denotes the transition probability. We use

and to denote the sets of incoming and outgoing transitions of a node . denotes the sub-lattice which consists of all paths between the starting node and a node .

2.2 LatticeRNN

The LatticeRNN Ladhak et al. (2016) model generalizes sequential RNN to lattice-structured inputs. It traverses the nodes and transitions of a lattice in a topological order. For each transition , LatticeRNN takes as input and the representation of its previous node as the previous hidden state, and then produces a new hidden state of , . The representation of a node is obtained by pooling the hidden states of the incoming transitions. In this work, we employ the WeightedPool variant proposed by Ladhak et al. (2016), which computes the node representation as

Note that we can represent any sequential text as a linear-chain lattice, so LatticeRNN can be seen as a strict generalization of RNNs to DAG-like structures. This property enables us to initialize the weights in a LatticeRNN with the weights of a RNN as long as they use the same recurrent cell.

2.3 Lattice Language Modeling

Language models usually estimate

by factorizing it into

where denotes the previous context. Training a LM is essentially asking the model to predict a distribution of the next word given the previous words. We extend the sequential LM analogously to lattice language modeling, where the model is expected to predict the next transitions of a node given . The ground truth distribution is therefore defined as:

LatticeRNN is adopted as the backbone of our lattice language model. Since the node representation encodes all information of , we pass to a linear decoder to obtain the distribution of next transitions:

where denotes the parameters of the LatticeRNN and denotes the trainable parameters of the decoder. We train our lattice language model by minimizing the KL divergence between the ground truth distribution and the predicted distribution .

Note that the objective for training sequential LM is a special case of the lattice language modeling objective defined above, where the inputs are linear-chain lattices. Hence, a sequential LM can be viewed as a lattice LM trained on linear-chain lattices only. This property inspires us to pre-train our lattice LM in a 2-stage fashion described below.

2.4 Two-Stage Pre-Training

Inspired by ULMFiT Howard and Ruder (2018), we propose a two-stage pre-training method to train our lattice language model. The proposed method is illustrated in Figure 2.

  • Stage 1: Pre-train on sequential texts
    In the first stage, we follow the recent trend of pre-trained LMs by pre-training a bidirectional LSTM Hochreiter and Schmidhuber (1997) LM on general domain text corpus. Here the cell architecture is the same as ELMo Peters et al. (2018).

  • Stage 2: Pre-train on lattices
    In this stage, we use a bidirectional LatticeLSTM with the same cell architecture as the LSTM pre-trained in the previous stage. Note that in the backward direction we use reversed lattices as input. We initialize the weights of the LatticeLSTM with the weights of the pre-trained LSTM. The LatticeLSTM is further pre-trained on lattices from the training set of the target task with the lattice language modeling objective described above.

We consider this two-stage method more approachable and efficient than directly pre-training a lattice LM on large amount of lattices because 1) general domain written data is much easier to collect than lattices which require spoken data, and 2) LatticeRNNs are considered less efficient than RNNs due to the difficulty of parallelization in computing.

2.5 Target Task Classifier Training

After pre-training, our model is capable of providing representations for lattices. Following Peters et al. (2018), the pre-trained lattice LM is used to produce contextualized node embeddings for downstream classification tasks, as illustrated in the right part of Figure 2. We use the same strategy as Peters et al. (2018)

to linearly combine the hidden states from different layers into a representation for each node. The classifier is a newly added 2-layer LatticeLSTM, which takes the node representations as input, followed by max-pooling over nodes, a linear layer and finally a softmax layer. We use the cross entropy loss to train the classifier on each target classification tasks. Note that the parameters of the pre-trained lattice LM are fixed during this stage.

Manual (a) biLSTM - 97.00 71.19 79.99
(b) (a) + ELMo - 96.80 72.18 81.48
Lattice oracle (c) biLSTM 92.97 94.02 63.92 70.49
(d) (c) + ELMo 96.21 95.14 65.14 73.34
ASR 1-Best (e) biLSTM 91.60 91.89 60.54 67.35
(f) (e) + ELMo 94.99 91.98 61.65 68.52
(g) BERT-base 95.97 93.29 61.23 67.90
Lattices (h) biLatticeLSTM 91.69 93.43 61.29 69.95
(i) Proposed 95.84 95.37 62.88 72.04
(j) (i) w/o Stage 1 94.65 95.19 61.81 71.71
(k) (i) w/o Stage 2 95.35 94.58 62.41 71.66
(l) (i) evaluated on 1-best 95.05 92.40 61.12 68.04
Table 2: Results of our experiments in terms of accuracy (%). Some audio files in ATIS are missing, so the testing sets of manual transcripts and ASR transcripts are different. Hence, we do not report the results for ATIS using manual transcripts. The best results obtained by using ASR output for each dataset are marked in bold.
Train 4,478 13,084 103,326 73,588
Valid 500 700 8,989 15,037
Test 869 700 15,927 14,800
#Classes 22 7 43 5
WER(%) 15.55 45.61 28.41 32.04
Oracle WER 9.19 18.79 17.15 21.53
Table 1: Data statistics.

3 Experiments

In order to evaluate the quality of the pre-trained lattice LM, we conduct the experiments for two common tasks in spoken language understanding.

3.1 Tasks and Datasets

Intent detection and dialogue act recognition are two common tasks about spoken language understanding. The benchmark datasets used for intent detection are ATIS (Airline Travel Information Systems) Hemphill et al. (1990); Dahl et al. (1994); Tur et al. (2010) and SNIPS Coucke et al. (2018). We use the NXT-format of the Switchboard Stolcke et al. (2000) Dialogue Act Corpus (SWDA) Calhoun et al. (2010) and the ICSI Meeting Recorder Dialogue Act Corpus (MRDA) Shriberg et al. (2004) for benchmarking dialogue act recognition. The SNIPS corpus only contains written text, so we synthesize a spoken version of the dataset using a commercial text-to-speech service. We use an ASR system trained on WSJ Paul and Baker (1992) with Kaldi Povey et al. (2011) to transcribe ATIS, and an ASR system released by Kaldi to transcribe other datasets. The statistics of datasets are summarized in Table 1. All tasks are evaluated with overall classification accuracy.

3.2 Model and Training Details

In order to conduct fair comparison with ELMo Peters et al. (2018), we directly adopt their pre-trained model as our pre-trained sequential LM. The hidden size of the LatticeLSTM classifier is set to 300. We use adam as the optimizer with learning rate 0.0001 for LM pre-training and 0.001 for training the classifier. The checkpoint with the best validation accuracy is used for evaluation.

3.3 Results

The results in terms of the classification accuracy are shown in Table 2. All reported numbers are averaged over at least three training runs. Rows (a) and (b) can be considered as the performance upperbound, where we use manual transcripts to train and evaluate the models. We also use BERT-base Devlin et al. (2019) as a strong baseline, which takes ASR 1-best as the input (row (g)). Compare with the results on manual transcripts, using ASR results largely degrades the performance due to recognition errors, as shown in rows (e)-(g). In addition, adding pre-trained ELMo embeddings brings consistent improvement over the biLSTM baseline, except for SNIPS when using manual transcripts (row (b)). The baseline models trained on ASR 1-best are also evaluated on lattice oracle paths. We report the results as the performance upperbound for the baseline models (rows (c)-(d)).

In the lattice setting, the baseline bidirectional LatticeLSTM Ladhak et al. (2016) (row (h)) consistently outperforms the biLSTM with 1-best input (row (e)), demonstrating the importance of taking lattices into account. Our proposed method achieves the best results on all datasets except for ATIS (row(i)), with relative error reduction ranging from 3.2% to 42% compare to biLSTM+ELMo (row(f)). The proposed method also achieves performance comparable to BERT-base on ATIS. We perform ablation study for the proposed two-stage pre-training method and report the results in rows (j) and (k). It is clear that skipping either stage degrades the performance on all datasets, demonstrating that both stages are crucial in the proposed framework. We also evaluate the proposed model on 1-best results (row (l)). The results show that it is still beneficial to use lattice as input after fine-tuning.

4 Conclusion

In this paper, we propose a spoken language representation learning framework that learns contextualized representation of lattices. We introduce the lattice language modeling objective and a two-stage pre-training method that efficiently trains a neural lattice language model to provide the downstream tasks with contextualized lattice representations. The experiments show that our proposed framework is capable of providing high-quality representations of lattices, yielding consistent improvement on SLU tasks.


We thank reviewers for their insightful comments. This work was financially supported from the Young Scholar Fellowship Program by Ministry of Science and Technology (MOST) in Taiwan, under Grant 109-2636-E-002-026.


  • J. Buckman and G. Neubig (2018) Neural lattice language models. Transactions of the Association for Computational Linguistics 6, pp. 529–541. Cited by: §1.
  • S. Calhoun, J. Carletta, J. M. Brenier, N. Mayo, D. Jurafsky, M. Steedman, and D. Beaver (2010) The nxt-format switchboard corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Language resources and evaluation 44 (4), pp. 387–419. Cited by: §3.1.
  • A. Chronopoulou, C. Baziotis, and A. Potamianos (2019)

    An embarrassingly simple approach for transfer learning from pretrained language models

    In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2089–2095. Cited by: §1.
  • A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, et al. (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: §3.1.
  • D. A. Dahl, M. Bates, M. Brown, W. Fisher, K. Hunicke-Smith, D. Pallett, C. Pao, A. Rudnicky, and E. Shriberg (1994) Expanding the scope of the atis task: the atis-3 corpus. In Proceedings of the workshop on Human Language Technology, pp. 43–48. Cited by: §3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §3.3.
  • C. Goo, G. Gao, Y. Hsu, C. Huo, T. Chen, K. Hsu, and Y. Chen (2018) Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 753–757. Cited by: §1.
  • D. Guo, G. Tur, W. Yih, and G. Zweig (2014) Joint semantic utterance classification and slot filling with recursive neural networks. In 2014 IEEE Spoken Language Technology Workshop, pp. 554–559. Cited by: §1.
  • D. Hakkani-Tür, F. Béchet, G. Riccardi, and G. Tur (2006) Beyond ASR 1-best: using word confusion networks in spoken language understanding. Computer Speech & Language 20 (4), pp. 495–514. Cited by: §1.
  • C. T. Hemphill, J. J. Godfrey, and G. R. Doddington (1990) The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990, Cited by: §3.1.
  • M. Henderson, M. Gašić, B. Thomson, P. Tsiakoulis, K. Yu, and S. Young (2012) Discriminative spoken language understanding using word confusion networks. In 2012 IEEE Spoken Language Technology Workshop, pp. 176–181. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation. Cited by: 1st item.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Cited by: §1, §2.4.
  • C. Huang and Y. Chen (2019) Adapting pretrained transformer to lattices for spoken language understanding. In Proceedings of 2019 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 845–852. Cited by: §1.
  • F. Ladhak, A. Gandhe, M. Dreyer, L. Mathias, A. Rastrow, and B. Hoffmeister (2016) LatticeRNN: recurrent neural networks over lattices. In Proceedings of INTERSPEECH, pp. 695–699. Cited by: §1, §2.2, §3.3.
  • R. Masumura, Y. Ijima, T. Asami, H. Masataki, and R. Higashinaka (2018) Neural confnet classification: fully neural network based spoken utterance classification using word confusion networks. In Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6039–6043. Cited by: §1.
  • G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, and G. Zweig (2014) Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (3), pp. 530–539. Cited by: §1.
  • D. B. Paul and J. M. Baker (1992) The design for the wall street journal-based csr corpus. In Proceedings of the Workshop on Speech and Natural Language, HLT ’91. External Links: ISBN 1-55860-272-0 Cited by: §3.1.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §1, 1st item, §2.5, §3.2.
  • D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. (2011) The Kaldi speech recognition toolkit. Technical report Cited by: §3.1.
  • A. Radford (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • E. Shriberg, R. Dhillon, S. Bhagat, J. Ang, and H. Carvey (2004) The ICSI meeting recorder dialog act (MRDA) corpus. In Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004, Cambridge, Massachusetts, USA, pp. 97–100. Cited by: §3.1.
  • A. Siddhant, A. Goyal, and A. Metallinou (2018) Unsupervised transfer learning for spoken language understanding in intelligent agents. arXiv preprint arXiv:1811.05370. Cited by: §1.
  • M. Sperber, G. Neubig, N. Pham, and A. Waibel (2019)

    Self-attentional models for lattice inputs

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1185–1197. Cited by: §1.
  • A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. V. Ess-Dykema, and M. Meteer (2000) Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics 26 (3), pp. 339–374. Cited by: §3.1.
  • G. Tür, A. Deoras, and D. Z. Hakkani-Tür (2013) Semantic parsing using word confusion networks with conditional random fields. In Proceedings of INTERSPEECH, Cited by: §1.
  • G. Tur, D. Hakkani-Tür, and L. Heck (2010) What is left to be understood in ATIS?. In Proceedings of 2010 IEEE Spoken Language Technology Workshop (SLT), pp. 19–24. Cited by: §3.1.
  • G. Tur, J. Wright, A. Gorin, G. Riccardi, and D. Hakkani-Tür (2002) Improving spoken language understanding using word confusion networks. In Seventh International Conference on Spoken Language Processing, Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010. Cited by: §1.
  • F. Xiao, J. Li, H. Zhao, R. Wang, and K. Chen (2019)

    Lattice-based transformer encoder for neural machine translation

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3090–3097. Cited by: §1.
  • K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi (2014) Spoken language understanding using long short-term memory neural networks. In 2014 IEEE Spoken Language Technology Workshop, pp. 189–194. Cited by: §1.
  • P. Zhang, N. Ge, B. Chen, and K. Fan (2019) Lattice transformer for speech translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6475–6484. Cited by: §1.
  • Y. Zhang and J. Yang (2018) Chinese NER using lattice LSTM. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1554–1564. Cited by: §1.