Improving N-gram Language Models with Pre-trained Deep Transformer

by   Yiren Wang, et al.

Although n-gram language models (LMs) have been outperformed by the state-of-the-art neural LMs, they are still widely used in speech recognition due to its high efficiency in inference. In this paper, we demonstrate that n-gram LM can be improved by neural LMs through a text generation based data augmentation method. In contrast to previous approaches, we employ a large-scale general domain pre-training followed by in-domain fine-tuning strategy to construct deep Transformer based neural LMs. Large amount of in-domain text data is generated with the well trained deep Transformer to construct new n-gram LMs, which are then interpolated with baseline n-gram systems. Empirical studies on different speech recognition tasks show that the proposed approach can effectively improve recognition accuracy. In particular, our proposed approach brings significant relative word error rate reduction up to 6.0


page 1

page 2

page 3

page 4


On the Effectiveness of Neural Text Generation based Data Augmentation for Recognition of Morphologically Rich Speech

Advanced neural network models have penetrated Automatic Speech Recognit...

Deep Transformer based Data Augmentation with Subword Units for Morphologically Rich Online ASR

Recently Deep Transformer models have proven to be particularly powerful...

An Empirical Study of Efficient ASR Rescoring with Transformers

Neural language models (LMs) have been proved to significantly outperfor...


Two techniques provide the fabric of the Cambridge University Engineerin...

NN-grams: Unifying neural network and n-gram language models for Speech Recognition

We present NN-grams, a novel, hybrid language model integrating n-grams ...

Bayesian Transformer Language Models for Speech Recognition

State-of-the-art neural language models (LMs) represented by Transformer...

Applying GPGPU to Recurrent Neural Network Language Model based Fast Network Search in the Real-Time LVCSR

Recurrent Neural Network Language Models (RNNLMs) have started to be use...

1 Introduction

-gram language models (LMs) are widely used in the automatic speech recognition (ASR) systems due to its simplicity and high efficiency in inference. However,

-gram LMs suffer from performance bottleneck caused by the poor generalization to unseen -grams and lack of ability to capture long range dependencies. Neural language models [3, 9, 14]

have overcome such deficiencies with distributed representation learning in the continuous space and achieved the state-of-the-art language modeling performances. However, the high computational cost hampers inference latency and makes it hard to directly integrate neural LMs into the first-pass decoding of an ASR system. Instead, neural LMs are commonly used in second-pass rescoring and have shown effectiveness in improving recognition accuracy 

[5, 4, 19]. Still, since the N-best list or lattice for rescoring generally depends on -gram LMs from the first-pass decoding, improving the performance of -gram LMs is of great importance.

Different approaches have been proposed to improve

-gram LMs with neural LMs, including probability based methods that directly convert probabilities of neural LMs to the

-gram LMs [2, 18, 1]

, and text generation based methods that leverage shallow recurrent neural networks to generate text for

-gram training [16]. Empirical studies have shown that the latter generally leads to better performances [1]. However, previous work in this line has not fully leveraged the state-of-the-art deep neural networks such as deep Transformer [17, 11], and is less applicable to situations where the in-domain data are too limited to train a good neural LM.

From another perspective, constructing good LMs depends on adequate high-quality training data. Unfortunately, in many cases, only limited in-domain data are accessible, making the data sparsity problem even more severe for -gram LMs, and also introducing optimization difficulties for training neural LMs. Effectively leveraging the rich general domain corpora could help ease challenge and construct high-capacity neural networks with good generalization.

In this paper, we propose a new text generation based data augmentation approach that fully utilizes the large dataset and high-capacity neural networks to improve the -gram LMs. Inspired by the recent advances in language model pre-training approaches [11, 6, 12], where the key idea is to pre-train the model on a unlabeled corpora and then fine-tune on different supervised downstream tasks, we introduce a general domain pre-training followed by in-domain fine-tuning strategy to construct high-capacity neural LMs for text generation. Specifically, a deep Transformer based neural LM is first pre-trained on a large general domain corpora, and then fine-tuned on the in-domain dataset. The well trained neural LM is then used to generate a large amount of high quality text to construct a synthetic corpus. The new -gram LMs trained on the synthetic datasets are interpolated with baseline -gram LMs and used for ASR decoding.

This paper has two main contributions: (1) We are the first to leverage deep Transformer for text generation to improve ASR systems. In contrast to previous methods using shallow feedforward or recurrent neural networks [16]

, our choice of deep Transformer network, which has been shown to excel at capturing long-term dependencies in text 

[17, 11, 6], results in stronger modeling ability and therefore guarantees better generation quality. (2) We propose a general domain pre-training followed by in-domain fine-tuning strategy, which takes full advantage of the combination of large dataset and high-capacity models. Previous approaches with standard in-domain training are likely to encounter difficulties when there is very limited in-domain training data, either in optimization that leads to sub-optimal performances, or in generalization that results in text generated with poor diversity. The pre-training strategy contributes to overcome such limitations, making our approach more robust and generally applicable to different practical scenarios.

Experiments on two datasets show that our approach can effectively improve recognition accuracy over the strong baseline ASR systems and the existing LSTM based neural LM data augmentation methods. In particular, our proposed approach brings significant relative word error rate reduction up to for domains with limited in-domain data, which shows that our approach is very effective to improve speech recognition systems on new domains without extra efforts to collect in-domain data manually.

2 Related Work

Neural LMs

Neural language models are proposed to overcome the curse of dimensionality by learning distributed word representations and probability function of word sequences 

[3]. Different neural network architectures have been proposed, including feedforward NN [3], RNN [9, 10], LSTM [14] and Transformer [17, 11], among which the self-attention based Transformer is the state-of-the-art architecture for many sequence modeling tasks due to its superiority in capturing longer-range linguistic structure. Although still less used in the first-pass ASR decoding due to its high inference latency, neural LMs have shown effectiveness in many other applications such as natural language generation [15, 8].

LM pre-training Recent emerging language model pre-training approaches, such as OpenAI GPT [11], GPT2 [12] and BERT [6]

, have shown effectiveness for improving many natural language processing tasks. The key idea is to pre-train a deep Transformer network on unlabeled corpora then fine-tune the parameters on the downstream tasks. These approaches, however, mainly focus on the semi-supervised learning paradigm that leverages unsupervised pre-training to improve the supervised downstream tasks. In contrast, we focus mainly on the language modeling task and leverage the general domain pre-training to promote in-domain modeling.

Improving -gram LMs Different methods have been proposed to improve -gram LMs with neural LMs, including converting a feedforward nerual LM into an -gram LM by directly assigning the probabilities  [1], converting recurrent neural network (RNN) LM into backoff LMs and further improved quality with an iterative approach [2]. The closest line of work to ours leverages multiple RNNLMs from different domains to generate text data for improving -gram LMs [16]. However, their use of shallow RNN models and in-domain training restricts the generation ability. In contrast, our choice of pre-trained deep Transformer is able to generate text in higher quality by capturing longer range dependency, and better diversity through better model generalization.

3 Approach

We introduce the details of the proposed data augmentation method for improving -gram LMs in this section.

3.1 Overall Pipeline

The overall pipeline of the proposed approach is depicted in Fig 1 (left), which consists of four steps including pre-training, fine-tuning, generation and interpolation. Specifically, we first pre-train a deep Transformer on the large general-domain corpus and then fine-tune on the target domain dataset. We use the obtained neural LM to generate large amount of high quality in-domain text data, which is then used to construct a synthetic dataset for -gram LM training. The new -gram LMs are eventually interpolated with the previous baseline -gram LMs and evaluated in the ASR system.

Figure 1: (left) Overall pipeline of the proposed data augmentation approach. (right) Transformer architecture for neural LM.

3.2 Pre-training and Fine-tuning

We propose a general-domain pre-training followed by in-domain fine-tuning strategy. Given a large and diverse collection , where each is a sequence of word or subword units , we use the standard left-to-right language modeling objective to maximize the likelihood:


where the conditional probability is modeled by a neural network with parameters and is the history up to .

Fine-tuning is then performed for each domain of interest. Given a in-domain collection , , the model is trained using the same objective in Eqn. 1. The model parameters are initialized by the pre-trained model and further optimized on till converge.

In this work, we use the deep Transformer decoder [11], a variant of Transformer [17], as the architecture for our neural LM . As is illustrated in Fig 1 (right), the model is composed of a stack of

transformer blocks. Each block has two types of basic layers: (1) Multi-head self attention layer, which generates an adaptive weighted sum of the input hidden representations from previous layer; (2) Feed forward layer, which applies non-linear transformation to the hidden vector. Each basic layer is associated with layer normalization, residual connection and dropout.

This is the first work that leverages deep Transformer for text generation for ASR. Intuitively, deep Transformer is superior to previous shallow feed forward or recurrent neural LMs in two ways: (1) the self attention mechanism eases the challenge of long-range dependency learning, which is particularly important for high-quality text generation; (2) the high model capacity and depth leads to better modeling and generalization ability. Our proposed pre-training strategy helps overcome previous limitations in lack of in-domain training data by making use of the largely available general-domain data, and makes it possible to construct such strong neural LM.

3.3 Text Generation

We generate a large amount of text data with the well trained neural LM. The generation is performed by sampling from the model distribution given the prefixed context. Specially, we first construct a prefix corpus , where with prefix tokens. During text generation, is fed into the model as the prefixed context. The neural LM produces probabilities over the vocabulary :



is the logit vector, and

is the temperature for sampling. Then the output tokens are sampled from the probability distribution.

A rule-based data filtration is performed on the generated synthetic corpus to ensure data quality for -gram LM training. We designed multiple different rules including filter by maxmimum and minimum sequence length, filter by out-of-vocabulary words, filter by domain-specific keyword, and filter by number of duplicated generation. The thresholds for each filtering rule are selected based on data distribution of the in-domain training data.

4 Experiment

We evaluate the effectiveness of the proposed approach on two speech recognition datasets. We compare the performance of deep Transformer with shallow LSTMs, as well as the pre-training strategy with the traditional in-domain training. As we will show, the proposed combination of pre-training strategy and deep Transformer lead to substantial improvements.

4.1 Datasets

We conduct experiments on two in-house speech recognition datasets: the speech assistant dataset (denoted as Assistant), and the conversational speech dataset (denoted as Conversation). Both datasets are completely anonymized and no user-identifiable information (UII) is access to both annotators and researchers.

Assistant The Assistant dataset consists of English utterances that are commands users give to Facebook Portal111Portal is a video enabled smart device. after the wakeword “Hey Portal” to carry out certain actions. The utterances can be categorized into various sub-domains by the type of actions, such as making phone calls to their friends (calling), and device control (device), or getting weather information (weather). We use a mixed set of utterances that are randomly sampled from both internal dogfooding and Facebook Portal live traffic. Internal dogfooding is an activity from internal employees with signed agreements to have their activity reviewed and tested. We choose to exclude some domains that contain limited utterance patterns such as calling as enriching the training data is not helpful for these domains. All these sampled utterances are voice morphed before sending to annotators for transcription. In total, we use a collection of k utterances as training data, k as development data, and k as test data.

Conversation The Conversation dataset was collected through crowd-sourcing. It consists of conversations between each pair of crowd-sourcers with more than topics that are commonly mentioned in daily life, including family, travel, etc. We split the data into training (k), development (k), and test (k) sets.

General-domain Pre-training We use a large in-house English text corpus as general domain data for neural LM pre-training, which contains a random sample of M public posts and comments users shared on Facebook. We use byte pair encoding (BPE222[13] to segment word tokens into subword units, forming a -subword vocabulary used for both Assistant and Conversation dataset. We directly converted the text data into machine reading format for model training and did not manually look into the actual content.

4.2 Experiment Setups

Baselines We compare our proposed approaches with two baselines, including (1) baseline -gram without augmentation (Baseline[7], and (2) data augmentation with text generated by LSTM trained on in-domain data [16] (LSTM (in-domain)).

ASR System

We use a state-of-the-art hybrid ASR system that utilizes multi-layer Latency Controlled Bidirectional Long Short-Term Memory RNNs (LC-BLSTM) for acoustic modeling with grapheme representations. And it uses pruned 4-gram LMs in the first-pass decoding with an in-house dynamic decoder, where the final LMs are interpolated with LMs trained from both in-domain and general-domain training data. For each approach on each dataset, we optimize all model hyper-parameters on the development sets.

Model Settings We adopt the GPT configuration following [11], with the dimension of word embeddings, hidden states and non-linear layers set as , and respectively. The numbers of both decoder blocks and attention heads are set as , and the dropout rate is . For the LSTM baseline, we adopt a model with similar model size as Transformer for fair comparison. We use a stack of LSTM layers, where the dimension of word embeddings, hidden states set as and respectively. The dropout rate is . We use the Adam optimization scheme following [11]. The models are trained on

V100 GPUs, and based on the PyTorch implementation of Transformer


Text Generation We extract the prefix sequences with tokens from the in-domain training data, where . We keep the top sampling hypotheses with length penalty set as . The temperature for sampling is set as for the best balance of generation quality and diversity.

Evaluation For evaluation, we interpolate the new -gram LM with the baseline -grams and evaluate the methods via word error rate (WER) of the ASR system, and report WER reductions (WERR) over the baseline approach.

4.3 Assistant

LSTM (in-domain)
Transformer (in-domain)
Transformer (pre-trained)
Table 1: The overall relative word error rate reduction (WERR) for each data augmentation approach on Assistant.
Device Weather Music
LSTM (in-domain)
Transformer (in-domain)
Transformer (pre-trained)
Table 2: Relative word error rate reduction (WERR) for each data augmentation approach on different Assistant sub-domains, including device, weather and music.
Pre-trained Fine-tuned
LSTM (in-domain)
Transformer (in-domain)
Transformer (pre-trained)
Table 3: Word-level perplexity of neural LMs on Assistant test set.
Domain Examples
replay the current track
Music what album is this track from
play french playlist on spotify
what’s the hourly forecast for today
Weather what’s the weather in youngstown ohio
what’s the temperature in delray beach florida
Table 4: Examples generated by Transformer pre-trained and fine-tuned on Assistant in-domain data. The examples are excluded from the in-domain training data.

We report the overall word error rate reduction over the baseline approach on Assistant and several sub-domains including device, weather and calling in Table 1 and Table 2, respectively. From these tables we have the following observations:

1. The proposed data augmentation approaches effectively improve the overall quality of -gram LMs in the ASR systems. In particular, our pre-trained deep Transformer achieves over relative reduction in WER, which significantly outperforms the LSTM-based approach.

2. The proposed approach is particularly beneficial for sub-domains with less training data. Due to the unbalance of Assistant dataset, some sub-domains like weather and music are important yet have only a small training collection. The proposed approach with pre-trained Transformer achieves over and relative WER reduction in the weather and music sub-domains and outperforms augmentation with LSTM or Transformer constructed with traditional training strategy without pre-training by large margin. For the large domains such as device, we observe fewer gains from pre-training as the in-domain data is already sufficient to train a neural LM with good performance in these large domains. However, we can still see that Transformer outperforms LSTM, and pre-training slightly further improves the performance.

3. The improvements have been brought by both use of deep Transformer architecture and pre-training strategy. WER reduction is observed by replacing the LSTM to deep Transformer network for neural LM, which indicates the superiority of the model architecture. With the pre-training strategy, the model can be even better utilized and results in the best ASR decoding performance.

We further present detailed analysis on the different neural LMs. Table 3 shows the perplexity of neural LMs with different architecture and training strategy, which verifies that deep Transformer has better modeling performance than the previous LSTM/RNN, and demonstrates that the general domain pre-training and in-domain fine-tuning strategy is an important component for high-quality deep model construction. Table 4 presents multiple generated cases in the music and weather sub-domains, which are not originally included in the in-domain Assistant training collection. The examples illustrate that the pre-trained neural LM can generate high-quality text for data augmentation and enrich the sequence patterns to help ease the problem of data sparsity for -gram LMs.

4.4 Conversation

LSTM (in-domain) M
Transformer (in-domain) M
Transformer (pre-trained) M
LSTM (in-domain) M
Transformer (in-domain) M
Transformer (pre-trained) M
Table 5: Relative word error rate reduction (WERR) over the baseline approach on Conversation test set. “#Aug” denotes the number of augmented training data.
Pre-trained Fine-tuned
LSTM (in-domain)
Transformer (in-domain)
Transformer (pre-trained)
Table 6: Word-level perplexity of neural LMs on Conversation test set. “pre-trained” denotes model pre-trained on general background data and then fine-tuned on Conversation dataset. “in-domain” denotes model trained only on Conversation.

We further evaluate the approach on the Conversation dataset, which has a much smaller training collection (k) than Assistant (k), with more complex and diverse patterns. As can been seen from Table 6, the pre-trained deep Transformer demonstrates significant superiority over the neural LMs with traditional training scheme in such a scenario with the lack of in-domain training data.

The performances are presented in Table 5. With million instances of synthetic in-domain training corpus, the proposed approach with pre-trained deep Transformer achieves over relative WER reduction, compared with relative reduction of Transformer and of LSTM with traditional in-domain training strategy. The performance continues to grow when we enlarge the volume of generated data to millions, and achieves over relative WER reduction over the strong baseline system.

These results corroborate our motivation and demonstrate that:

1. The proposed approach is simple yet effective in improving -gram LMs in ASR. The number of augmented training data can be easily scaled up for further decoding performance improvements with minimal computational cost.

2. The general domain pre-training then in-domain fine-tuning strategy is the key component of the proposed method. The superiority of pre-training over the traditional in-domain training strategy is at two scales: (i) The large-scale general pre-training enables construction of the state-of-the-art deep Transformer rather than shallow RNNs [16], which leads to strong neural LMs with large model capacity and better generalization to generate text with both high quality and good diversity. (ii) In the cases with lack of in-domain training data, direct in-domain training results in sub-optimal performances of the neural LMs (Table 6). The pre-training stategy overcomes the problem, making it more robust and generally applicable to different scenarios.

5 Conclusion

In this paper, we introduce a text generation based data augmentation approach that effectively improves -gram LMs and achieves better ASR decoding accuracy. Our contributions are at two scales: (1) We are the first to leverage deep Transformer for text generation for ASR systems; (2) We proposed a general domain pre-training followed by in-domain fine-tuning strategy that enables us to fully leverage the large corpora and the high-capacity neural networks. The approach is general and widely applicable to different data domains to help improve the first-pass decoding accuracy of the ASR systems.


  • [1] H. Adel, K. Kirchhoff, N. T. Vu, D. Telaar, and T. Schultz (2014) Comparing approaches to convert recurrent neural networks into backoff language models for efficient decoding. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §1, §2.
  • [2] E. Arısoy, S. F. Chen, B. Ramabhadran, and A. Sethy (2013) Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (1), pp. 184–192. Cited by: §1, §2.
  • [3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2003) A neural probabilistic language model.

    Journal of machine learning research

    3 (Feb), pp. 1137–1155.
    Cited by: §1, §2.
  • [4] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1.
  • [5] A. Deoras, T. Mikolov, and K. Church (2011) A fast re-scoring strategy to capture long-distance dependencies. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1116–1127. Cited by: §1.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §1, §2.
  • [7] D. Le, X. Zhang, W. Zheng, C. Fügen, G. Zweig, and M. Seltzer (2019) FROM senones to chenones: tied context-dependent graphemes for hybrid speech recognition. In IEEE Automatic Speech Recognition and Understanding Workshop, Cited by: §4.2.
  • [8] R. Masumura, T. Asami, T. Oba, H. Masataki, S. Sakauchi, and A. Ito (2015) Combinations of various language model technologies including data expansion and adaptation in spontaneous speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §2.
  • [9] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: §1, §2.
  • [10] T. Mikolov, S. Kombrink, L. Burget, J. Černockỳ, and S. Khudanpur (2011) Extensions of recurrent neural network language model. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531. Cited by: §2.
  • [11] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Technical report, OpenAI. Cited by: §1, §1, §1, §2, §2, §3.2, §4.2.
  • [12] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1, §2.
  • [13] R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Cited by: §4.1.
  • [14] M. Sundermeyer, R. Schlüter, and H. Ney (2012) LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, Cited by: §1, §2.
  • [15] I. Sutskever, J. Martens, and G. E. Hinton (2011) Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024. Cited by: §2.
  • [16] M. Suzuki, N. Itoh, T. Nagano, G. Kurata, and S. Thomas (2019) Improvements to n-gram language model using text generated from neural language model. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7245–7249. Cited by: §1, §1, §2, §4.2, §4.4.
  • [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §1, §2, §3.2.
  • [18] R. Wang, M. Utiyama, I. Goto, E. Sumita, H. Zhao, and B. Lu (2013) Converting continuous-space language models into n-gram language models for statistical machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 845–850. Cited by: §1.
  • [19] W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolcke (2018) The microsoft 2017 conversational speech recognition system. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5934–5938. Cited by: §1.