1 Introduction and problem statement
Today’s state-of-the-art in language modeling for ASR relies on neural Language Models (LM) [1, 2, 3], capable of handling continuous space and thereby outperforming traditional Back-off N-gram LMs (BNLM). BNLMs cannot exploit long context based syntactic dependencies and are also less flexible in terms of generalization for unseen cases, as semantic knowledge (such as embeddings reflecting similarity) is not captured while training them.
Neural LMs however have an undesired property, they are computationally very heavy in decoding, so neural LMs cannot be effectively used in a single decoding pass, they are rather exploited by rescoring lattices obtained from a first decoding pass with a BNLM. It is obvious, but can also be shown, that information is lost during the first decoding pass, as the pruning of the recognition network is based only on short context syntax, discarding both longer context syntactic and quasi all semantic knowledge. Another problem arising is the increased latency of the system through the two decoding passes, which hampers exploitation in strict online requirements.
To reduce these limitations in exploiting neural LMs for ASR, several solutions have been proposed [4, 5, 6]. In  it was shown that using the neural LM to generate an augmented training corpus to train an improved BNLM is the best performing strategy. Such a BNLM trained on augmented corpus can be used in a single pass or in the first pass of decoding. Sometimes these are called approximative models as they try to capture the knowledge of the neural model through their augmented training corpus. Suzuki et al.  uses a domain balanced mixture of the training corpora to train a shallow RNNLM for text generation, and improve speech recognition results for Japanese, Korean and English. Wang et al.  report using general domain pre-trained Transformer  to augment text corpora used to train LMs. They demonstrate that the pre-trained and fine-tuned Transformer performs significantly better in data augmentation than LSTMs or simple in-domain Transformer models.
Another burden of language modeling for morphologically rich languages are the different syntactic properties of the language compared to English. Heavy agglutination results in much larger vocabularies, which is a problem in itself, but causes other problems too: individual word forms occur less often and hence, the size of the training corpus should accordingly be augmented to maintain the predictive power of the dataset. Moreover, as suffixes express grammatical relations usually provided by word order in English, morphologically rich languages tend to be more permissive in choosing word order, leading to higher variation. This impairs BNLM estimation badly, but may also cause that word embeddings become less powerful in terms of syntactic and semantic consistency, even despite using long context windows.
To alleviate these problems linked to the different organization of morphologically rich languages, subword unit modeling is an often used alternative. Subword unit based ASR has been demonstrated to improve WER for several morphologically rich languages [12, 13]. Suzuki et al.  use subword approach for data augmentation to enrich text corpora to train BNLM, but compose these subwords back into words to prepare the final LM, unlike our approach that retokenizes words into subword units in the final LM.
In this paper we aim to improve LM for an online call center ASR system in the morphologically rich Hungarian. We use parliamentary text to pre-train a GPT-2 structure Transformer LM , and fine-tune it on the target domain. With this model we generate training text for a BNLM. We demonstrate that such Transformer based data augmentation is efficient in morphologically rich Hungarian, if vocabulary is large enough and a large BNLM is used. Retokenizing the augmented training corpus to subword units, and training a subword-based BNLM on it, we demonstrate that (i) the ASR accuracy further improves compared to the word based baseline augmented BNLM, and (ii) the footprint and complexity of the resulting subword unit augmented BNLM significantly decrease. As subword unit LMs are known to perform better on a wide range of morphologically rich languages, we hypothesize that our approach is transferable to other such languages. We consider as novelties of our paper the following: (i) we propose the retokenization of the Transformer augmented LM training corpus; (ii) we are the first to use the GPT-2 Transformer structure to augment LM training corpora; (iii) we are the first to apply a Transformer based LM for a Hungarian ASR task; (iv) we demonstrate that the subword-based neural text augmentation can be exceptionally efficient in modeling OOV words.
2 Datasets and models
In-domain training data is extracted from the Hungarian Call Center Speech Database (HCCSD) consisting of anonymised telephone customer service calls and the corresponding manual transcripts. We selected 290 hours of recordings for training the acoustic model of our ASR system (see Table 1). The in-domain LMs are built on the transcripts of the training set containing 3.4M word tokens and 100k unique word forms. As the available in-domain training text data is very limited, we also utilize a general text corpus for pre-training the Transformer LM, which was collected from the website of the Hungarian National Assembly111https://www.parlament.hu/ and contains official transcripts of parliamentary speeches.
|# of word tokens||3,401,775||45,773||66,312|
|word OOV rate [%]||–||2.7||2.5|
|# of word tokens||57,601,277||–||–|
For testing purposes another 20 hours of transcribed conversations are selected from the HCCSD and is split into two disjoint sets (see Table 1
). The validation set (~7.5 hours) is used for hyperparameter optimization for the tokenizers and the language models. The evaluation set (~12 hours) is reserved to compare the performance of the different modeling approaches addressed in this paper (see Sec.4).
2.2 Back-off n-gram language models
Count-based, back-off language models (BNLMs) have low computational cost and fit well into Weighted Finite-State Transducer (WFST) framework, hence are still widely used in online, single-pass ASR systems. We carry out training and interpolation of BNLMs with the SRI language modeling toolkit applying Chen and Goodman’s modified Kneser-Ney discounting . After experimenting with different n-gram orders on the development set, we found 4-gram models the optimal choice both for word and subword BNLMs.
2.3 Transformer language model
Recently Transformer architectures have proven to be particularly successful in generating well-structured, high-quality texts thanks to the self attention mechanism and the depth of the model [14, 17]. In order to generate augmentation text to our ASR task, we applied one of the most promising Transformer architectures called OpenAI GPT-2  implemented in HuggingFace’s Transformers library . The GPT-2 architecture has four variants with different sizes from which we opted for the medium using a total of 345M parameters.
GPT-2 medium is a 24-layer decoder-only Transformer with 16 attention heads and 1024 dimensional states. For regularization purposes it applies embedding, attention and residual dropouts with a rate of 0.1. We apply the Adam optimization scheme 
with initial learning rate of 1e-4 and a linear decay schedule. We pre-train the model on the general training corpus for 15 epochs using minibatches of 16 sequences consisting of 512 tokens each. Fine-tuning of the pre-trained model took 4 epochs on the in-domain training set with the same hyperparameters as in pre-training. Tokenization was performed with a byte-level Byte Pair Encoding (BPE) model with 30k vocabulary items (256 bytes + 29744 merges) trained on the in-domain training set.
3 Data augmentation
3.1 Neural text generation
Generation of a text sequence is initialized with a prefix prompt, which we sample from the in-domain training set. The length of the sampled prefix varies randomly between words to balance the trade-off between free and constrained text generation. For the same reasons, the temperature is also randomly changed from 1.0 to 1.5. We generate two large corpora (1 billion words each) for data augmentation purposes. One with the pre-trained and then fine-tuned Transformer LM (TR) described in Sec. 2.3 and one with a Transformer trained directly on the in-domain corpus without pre-training (TR-noPT).
3.2 Word-based data augmentation
The fact that text corpora generated by RNNLMs can improve the accuracy of n-gram language models has been shown by several studies before [5, 7, 8, 21]. However, Wang and her colleagues  were the first who applied general domain pre-training and in-domain fine-tuning of a Transformer LM to improve the effectiveness of the data augmentation process. For that reason we summarize their original, word-based data augmentation process in this section. In the next section, we are going to propose an extended version of the augmentation process that fits better to morphologically rich scenarios.
The original, word-based version of neural text based data augmentation process is shown on the left side of Fig. 1 (white boxes). First a large corpus is generated by the Transformer LM (pre-trained on a general text corpus and fine-tuned on the in-domain text). Based on this generated text a BNLM (TR-BNLM) is trained, which approximates the short-term dependencies learned by the Transformer. To further improve the model, the TR-BNLM can be interpolated with a BNLM trained the on the in-domain text (BNLM + TR-BNLM).
3.3 The proposed subword-based data augmentation
Morphologically rich languages have significantly larger vocabulary, as case endings usually reflect grammatical roles. Large vocabulary size can be a problem in itself, however it also increases data sparseness in the training data. A common remedy is to segment words into smaller units and train language models on these subword sequences [13, 22]. Since the subword language model is applied in an ASR system, a byte-level tokenizer is not suitable as it does not preserve multi-byte characters that abound in Hungarian. From the available data-driven character-level tokenizers, we chose Morfessor  as our subword tokenizer (i.e its Python implementation called Morfessor 2.0 ), which is specifically designed for processing morphologically rich languages. In  it was shown that on an ASR task, Morfessor can even outperform the popular BPE tokenizer.
Our proposed extension to the word-based data augmentation process called subword-based neural text augmentation is depicted in Fig. 1 (grey boxes). The revised data augmentation process starts with training the Morfessor tokenizer on the in-domain dataset. The word-based generated text corpus and the in-domain training text are then segmented into subword sequences using the pre-trained tokenizer. In order to preserve word boundary information during the ASR decoding process, non-initial subwords were tagged with the ‘+’ sign. For instance, subword segmentation transforms the Hungarian sentence ‘megbeszélem a nejemmel’ (meaning ‘I will discuss it with my wife’) as follows: ‘meg +beszél +em a nejem +mel’. Based on the segmented text, we train BNLM models (Subword BNLM and subword TR-BNLM in Fig. 1), which can be interpolated for the best performance again (Subword BNLM + TR-BNLM).
4 Results and Discussion
4.1 ASR setup
High resolution MFCC vectors were used as input features for an LF-MMI trained Factored Time Delay Neural Network (TDNN-F) acoustic model[25, 26]. The matrix size (hidden-layer dimension) was 768 and the linear bottleneck dimension was 80 resulting in a total number of 6M parameters in the 12 hidden layers. Phoneme-based acoustic and language model resources were compiled into WFSTs and decoded with the VoXerver  ASR decoder. The typical latency of the online decoding setup was measured to be around 250 ms.
|BNLM + RNN-BNLM||21.3||2.7*|
|BNLM + TR-noPT-BNLM||21.1||3.7*|
|BNLM + TR-BNLM||20.6||5.9*|
|BNLM + TR-BNLM + RNN-BNLM||20.4||6.8*|
* sign indicates significant difference compared to BNLM and was tested with Wilcoxon signed-rank test (p < 0.05).
4.2 Results with word-based augmentation
4.2.1 Comparing language modeling approaches
In our first experiment, we use the augmentation text in its original form without subword segmentation. Our goal is to compare the modeling capabilities of language models on in-vocabulary words, hence we limited the vocabulary of all models to the 100k words occurring in the in-domain training text. In our previous work , we augmented the same dataset with a corpus generated by a 2-layer LSTM RNNLM. The results from this former study are placed in this paper to serve as an advanced baseline (RNN-BNLM). All models were pruned to 1 GB runtime memory usage. The results are summarized in Table 2.
Without LM interpolation, neither the RNN-BNLM nor the TR-noPT-BNLM (Transformer without pre-training) models can outperform the baseline BNLM. Only the pre-trained TR-BNLM can reduce the Word Error rate (WER) by around 2% relative. In contrast, with LM interpolation, all augmentation methods reduce significantly the WER of the baseline model. Using a recurrent model (BNLM + RNN-BNLM) or the non-pretrained Transformer (BNLM + TR-noPT-BNLM) for data augmentation result in similar Word Error Rate Reduction (WERR), with the Transformer model being slightly better (2.7% vs. 3.7% WERR). The pre-trained Transformer (BNLM + TR-BNLM), however, stands out among all other approaches, since it reduces the error rate by relative 6%. We also tested whether augmentation models can support each other and found that by applying RNN-BNLM and TR-BNLM simultaneously an additional 1% of WERR can be obtained.
4.2.2 Extended word-based augmentation
In the previous section, we limited the vocabulary size of language models to 100k and pruned them to a maximum memory footprint of 1 GB for comparability reasons. In the following, we examine the performance of word-based augmented models without these limitations (See Fig. 2).
As it can be seen in a morphologically rich language like Hungarian, the 100k vocabulary size is a strict limitation. By increasing the vocabulary size to 300k, we can reduce the WER by a relative 2% (from 20.6% to 20.2%) and by raising it to 1M by a relative 3% (from 20.6% to 20.0%). If we reduce LM pruning and let the memory footprint to increase from 1 GB to 4 GB, the WERR can go up to 4.5% (WER from 20.6% to 19.7%), but for such a great improvement we need an extremely large vocabulary with 3 million words. We can see that in a morphologically rich language, exploiting full advantages of neural text generation based data augmentation sacrifices footprint, as large vocabulary and high memory consumption are produced, which severely limits the practical applicability of the approach.
4.3 Results with subword-based augmentation
In order to lower the resource requirements of the augmented language model and utilize the generated text more efficiently, we apply subword LMs (see Sec. 3.3). While in the word-based case the OOV ratio is around 1% even with an extremely large 3-million-word vocabulary, the subword-based augmented language model (Subword BNLM + TR-BNLM) can fully cover the test set (0% OOV ratio) with only 40k vocabulary items.
As shown in Fig. 3, subword modeling can reduce the WER of the 100k word-based model by up to 5% (from 20.6% to 19.6%). The subword BNLM + TR-BNLM, moreover, outperforms the 3-million-word vocabulary word-based model by reducing WER by 2% relative (from 20.0% to 19.6%). Both former improvements were found statistically significant (p < 0.05). The WER of the subword-based model with 40k vocabulary and 1 GB memory consumption is roughly the same as the WER of the word-based model with 3M vocabulary items and 4 GB memory usage (19.7% WER vs. 19.6%). Thus, we can state that neural text generation based data augmentation with subword tokenization can be significantly more efficient than word-based augmentation in case of a morphologically rich ASR task.
Just like in the case of word-based models (see Sec. 4.2.1), using the text generated by the recurrent model, we were able to achieve an additional average WER reduction of 1%.
4.4 OOV recognition analysis
The Transformer LM applied in our experiments use byte-level BPE tokenization, so it can create new word forms when generating text for data augmentation. Thus, augmented language models become to some extent capable of recognizing Out Of Vocabulary (OOV) words, as well. In this section, we compare this OOV recognition capability of the augmentation approaches. We consider OOV words to be those words that did not occur in the original in-domain training text (see Sec 2.1). We evaluated the ASR outputs of word and subword-based augmentation approaches using information retrieval metrics (Precision, Recall, F1) .
The results are summarized in Fig. 4. The baseline BNLM and the word-based BNLM + TR-BNLM 100k models are not shown in the figure, since they are (obviously) not capable of recognizing OOV words. As it can be seen in Fig. 4, all models recognize OOVs with high precision, so it is not typical that OOV words get inserted or replace other words in the ASR transcript. What shows a significant difference between the systems examined is the value of the recall. As the vocabulary size of word-based models increases, so does the recall of OOV words. The 3-million-word vocabulary word-based augmented LM is capable of recognizing almost 22% of OOVs. However, the subword-based system can capture every 4th OOV word (~25% recall) with only 40k subwords in its vocabulary.
We introduced an approach called subword-based neural text augmentation that is the extension of the Transformer based LM augmentation method presented in  for morphologically rich languages. With this new approach we managed to improve the WER of our online ASR system on Hungarian call center conversions by more than 10% relative (from 21.9% to 19.6%). Our solution also outperforms the original, word-based data augmentation technique in terms of WER and OOV recognition capability while keeping the vocabulary size and memory requirements of the system quite low. Besides, to the best of our knowledge this is the first paper applying GPT-2 Transformer to generate augmentation data for an ASR language model.
The research was supported by the CAMEP (2018-2.1.3-EUREKA-2018-00014) and NKFIH FK-124413 projects.
-  K. Irie, A. Zeyer, R. Schl, H. Ney, and A. Gmbh, “Language Modeling with Deep Transformers,” in Interspeech 2019, 2019, pp. 3905–3909.
T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur, “Recurrent Neural Network based Language Model,” inInterspeech 2010, 2010, pp. 1045–1048.
-  M. Sundermeyer, R. Schlueter, and H. Ney, “LSTM Neural Networks for Language Modeling,” in Interspeech 2012, 2012, pp. 194–197.
-  M. Singh, S. Virpioja, P. Smit, and M. Kurimo, “Subword RNNLM Approximations for Out-Of-Vocabulary Keyword Search,” in Interspeech 2019, 2019, pp. 4235–4239.
-  A. Deoras, T. Mikolov, S. Kombrink, M. Karafiat, and S. Khudanpur, “Variational approximation of long-span language models for LVCSR,” in 2011 IEEE Int. Conf. Acoust. Speech Signal Process. IEEE, may 2011, pp. 5532–5535.
E. Arisoy, S. F. Chen, B. Ramabhadran, and A. Sethy, “Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition,”IEEE Trans. Audio, Speech Lang. Process., vol. 22, no. 1, pp. 184–192, 2014.
-  H. Adel, K. Kirchhoff, N. T. Vu, D. Telaar, and T. Schultz, “Comparing approaches to convert recurrent neural networks into backoff language models for efficient decoding,” in Interspeech 2014, 2014, pp. 651–655.
-  M. Suzuki, N. Itoh, T. Nagano, G. Kurata, and S. Thomas, “Improvements to N-gram Language Model Using Text Generated from Neural Language Model,” in ICASSP 2019 - 2019 IEEE Int. Conf. Acoust. Speech Signal Process., 2019, pp. 7245–7249.
-  Y. Wang, H. Huang, Z. Liu, Y. Pang, Y. Wang, C. Zhai, and F. Peng, “Improving N-gram Language Models with Pre-trained Deep Transformer,” nov 2019.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, jun 2017, pp. 5999–6009.
-  B. Döbrössy, M. Makrai, B. Tarján, and G. Szaszák, “Investigating Sub-Word Embedding Strategies for the Morphologically Rich and Free Phrase-Order Hungarian,” in Proc. 4th Work. Represent. Learn. NLP. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019, pp. 187–193.
-  M. Creutz and K. Lagus, “Unsupervised discovery of morphemes,” in Proc. ACL-02 Work. Morphol. Phonol. Learn. -, vol. 6, Morristown, NJ, USA, 2002, pp. 21–30.
-  M. Kurimo, A. Puurula, E. Arisoy, V. Siivola, T. Hirsimäki, J. Pylkkönen, T. Alumäe, and M. Saraclar, “Unlimited vocabulary speech recognition for agglutinative languages,” in HLT-NAACL ’06, Morristown, NJ, USA, 2007, pp. 487–494.
-  A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
-  A. Stolcke, “SRILM – an extensible language modeling toolkit,” in Proc. Int. Conf. Spok. Lang. Process., Denver, US, 2002, pp. 901–904.
-  S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Comput. Speech Lang., vol. 13, no. 4, pp. 359–393, oct 1999.
-  Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” no. NeurIPS, pp. 1–18, jun 2019.
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew, “HuggingFace’s Transformers: State-of-the-art Natural Language Processing,”ArXiv, vol. abs/1910.0, 2019.
-  D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–15, dec 2014.
R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” inProc. 54th Annu. Meet. Assoc. Comput. Linguist. (Volume 1 Long Pap. Stroudsburg, PA, USA: Association for Computational Linguistics, 2016, pp. 1715–1725.
-  B. Tarján, G. Szaszák, T. Fegyó, and P. Mihajlik, “Investigation on N-gram Approximated RNNLMs for Recognition of Morphologically Rich Speech,” in 7th Int. Conf. Stat. Lang. Speech Process. (SLSP 2019), 2019, pp. 223–234.
-  P. Mihajlik, Z. Tüske, B. Tarján, B. Németh, and T. Fegyó, “Improved recognition of spontaneous hungarian speech-morphological and acoustic modeling techniques for a less resourced task,” IEEE Trans. Audio, Speech Lang. Process., vol. 18, no. 6, pp. 1588–1600, aug 2010.
-  S. Virpioja, P. Smit, S.-A. Grönroos, and M. Kurimo, “Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline,” Aalto University, Tech. Rep. September, 2013.
-  P. Smit, S. Virpioja, and M. Kurimo, “Improved Subword Modeling for WFST-Based Speech Recognition,” in Interspeech 2017. ISCA: ISCA, aug 2017, pp. 2551–2555.
-  D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks,” in Interspeech 2018, vol. 2018-Septe, no. 2. ISCA: ISCA, sep 2018, pp. 3743–3747.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” in IEEE 2011 Work. Autom. Speech Recognit. Underst. IEEE Signal Processing Society, 2011.
-  B. Tarján, P. Mihajlik, A. Balog, and T. Fegyó, “Evaluation of lexical models for Hungarian Broadcast speech transcription and spoken term detection,” in 2nd Int. Conf. Cogn. Infocommunications (CogInfoCom 2011), Budapest, Hungary, 2011.
-  T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., vol. 27, no. 8, pp. 861–874, 2006.