Language modeling is a probabilistic description of language phenomenon. It provides essential context to distinguish words which sound similar and therefore has one of the most useful applications in Natural Language Processing (NLP) especially in downstreaming tasks like Automatic Speech Recognition (ASR). Recurrent Neural Networks (RNN) especially Long Short Term Memory (LSTM) networks have been the typical solution to language modeling which do achieve strong results. In spite of these results, their fundamental sequential computation constraint has restricted their use in the modeling of long-term dependencies in sequential data. To address these issues Transformer architecture was introduced. Transformers relies completely on an attention mechanism to form global dependencies between input and output. It also offers more parallelization and has achieved SOTA results in language modeling outperforming LSTM models .
In recent years,we have seen a lot of development based on this standard transformer models particularly on unsupervised pre-training([3, 4, 5, 6, 7, 8] which have set state-of-the art results on multiple NLP benchmarks. One such model architecture has been the Bidirectional Encoder Representations from Transformers (BERT) model which uses a deep bidirectional transformer architecture.
Another architecture of interest would be the Transformer-XL, which introduces the notion of recurrence in a self-attention model.
The primary research focus though has been mostly on English language for which abundant data is present. It is interesting to see the performance of these models for an agglutinative language like Finnish, which is morphologically richer than English.
In this project, we explore the implementation of Transformer-based models (BERT and Transformer-XL) in language modeling for Finnish. We will use the same training data as in 
so that we can do fair comparisons with the performance of the LSTM models. Also, as the BERT model is a bi-directional transformer, we will have to approximate the conditional probabilities given a sequence of words. We also experiment with using sub-word units with Transformer-XL to cope with the large vocabulary problems associated with the Finnish Language. With smaller units, the modeled sequences are longer, and we hope that the recursive XL architecture can allow us to still model long term effects. To the best of our knowledge this is the first work with the Finnish language to use the following:
Approximation of perplexity using a BERT architecture
Using Transformer-XL architecture with sub-word units.
Comparison of Transformer and LSTM models as language models in the same comparable settings with an agglutinative language.
Ii Background & Methods
The goal of an language model is to assign meaningful probabilities to a sequence of words. Given a set of tokens , where
is the length of a sequence, our task is to estimate the joint conditional probabilitywhich is
were is the context. An Intrinsic evaluation of the performance of Language Models is perplexity (PPL) which is defined as the inverse probability of the set of the tokens and taking the root were is the number of tokens
In our two approaches we use transformer based architectures: BERT and Transformer-XL as mentioned before. Calculating the auto-regressive for the transformer-XL is quite straight-forward as the model is unidirectional but it doesn’t factorize the same way for a bi-directional model like BERT.
BERT’s bi-directional context poses a problem for us to calculate an auto-regressive joint probability. A simple fix could be that we mask all the tokens and we calculate the conditional factors as we do for an unidirectional model. By doing so though, we loose upon the advantage of bi-directional context the BERT model enables. We propose an approximation of the joint probability as,
This type of approximations has been previously explored with Bi-directional RNN LM’s  but not for deep transformer models. We therefore, define a pseudo-perplexity score from the above approximated joint probability.
The original BERT has two training objectives: ’Masked language modelling’, in which you mask input tokens randomly and then predict the masked tokens using the left and right context. Additionally, there is the ’next sentence prediction’ task that jointly trains text-pair representations. For training the Masked language model the original BERT used Byte Pair Encoding (BPE)  for subword tokenization .For example the rare word ”unaffable” to be split up into more frequent subwords such as [”un”, ”##aff”, ”##able”]. To remain consistent with experiments performed with LSTM’s we use the morfessor for the subword tokenization in the Finnish Language. In Addition, we also apply boundary markers as in (Table I) and train two separate models using this distinction. We train with left-marked markings as the original BERT was trained with such a scheme and the left+right-marked as it was the previous SOTA with the Finnish Language. For the transformer-XL experiments, we just train with the left+right marked scheme.
|left+right-marked (+m+)||two slipp+ +er+ +s|
|left-marked (+m)||two slipp +er +s|
The Next Sentence Prediction (NSP) is a binary classification task which predicts whether two segments follow each other in the original text. This pre-training task was proposed to further improve the performance on downstreaming tasks, like Natural Language Inference(NLI) but in reality removing the NSP loss matches or slightly improves the downstream task performance . In this paper, we have omitted the NSP task from the BERT pre-training procedure and changed the input from a SEGMENT-PAIR input to a SINGLE SEGMENT input. As seen in (Fig 1)
Transformer-XL introduced the notion of recurrence in self-attention by caching the hidden state sequence to compute the hidden states of a new segment. It also introduces a novel relative positional embedding scheme and both of them combined address the issue of fixed context lengths. Transformer-XL as mentioned is a unidirectional deep transformer architecture, therefore the perplexity can be calculated as (Eq 2). The only change is in the input format, were we use sub-word units rather than whole word units as Finnish is morphologically richer than English.
The Finnish text data used for the language modeling task is provided by . The dataset consists mainly of newspapers and books of around 144 million word tokens and 4.2 million unique tokens. We use a Morfessor 2.0  using the basic unsupervised Morfessor Baseline algorithm  with a corpus weight parameter (
) of 0.001. We have a vocabulary of 34K subword tokens for the left+right-marked (+m+) markings and 19K subword tokens for the left-marked (+m) markings. We also pre-process the data to remove any punctuation marks such that we can use the same data with an ASR system. The input is one sentence per line and we shuffle the sentences at each epoch. The data is randomly divided into- training dataset and a validation dataset. The test dataset consists of 2850 Finnish news articles obtained from the Finnish national broadcaster YLE.
Iv Experiments & Results
All BERT experiments were trained for 500K steps. The code was written in Python and we used the Tensorflow libraries to create the models. The experiments were trained on a single NVIDIA Tesla V100 32 GB graphic card. The data was first processed into Tensorflow records as the input to the model. The set of hyperparameters which we found optimal after experimenting with different sets are in (TableII).
|Number of hidden layers||20|
|Hidden size of transformer||896|
|Number of attention heads||16|
|Intermediate size(Size of the feed forward layer)||3584|
hidden activation function
|Gaussian Error Linear Units|
|max position embeddings||300|
This set of parameters were chosen as there training performances were better than smaller models on modelling the long sequences of sub-words. We use the Adam optimizer  same as the English BERT. A maximum sequence length of 300 encompasses 98 percent of the training data and also allows us to fit larger models on the GPU card. Hyper-parameter optimization is very difficult in case of these models as they take around 15 days to train given the resources. The hyperparameter choices were therefore more dependant on the original BERT with little tweaks. We assess the training performance of the the model in the (Table III).
|Model||Masked LM Loss||Masked LM Accuracy|
When we train the BERT model we mask some percentage of the input tokens at random, and then predict those masked tokens, this is known as Masked LM. The masked LM loss, refers specifically to the loss when the masked language model predicts on the masked tokens. The masked LM accuracy refers specifically to the accuracy with which the model predicts on the masked tokens. The loss for both the models are far off from the Masked LM loss of the English BERT, key difference being the pre-training data for both the language models are quite different. Google training their model on 3.3 Billion words from BooksCorpus  and the English Wikipedia and our model being trained on 144 million words. Comparing the two Finnish models, the left-marked model has a better training performance than left+right-marked model.
The results of the pseudo-perplexity described in the previous section to evaluate the above models on the test data-set is in table (Table IV).The test dataset is of a different context when compared to the training data, and interestingly both the models are quite confident when it comes to the test dataset. The pseudo-perplexity values of left-marked are lower when compared to left-right-marked signifying that it is more confident.
We cannot directly compare the perplexity scores BERT model with a unidirectional LSTM model as both are calculated in a different manner. We can experiment to compare it with a Bi-directional LSTM or use a downstreaming task to compare both the performances. We could also randomly mask tokens and then compare the prediction accuracy on the masked tokens.
All Transformer-XL experiments are also trained equally for 500K steps. The code was written in Python and we used the PyTorch libraries for model creation. The experiments were trained on a single NVIDIA Tesla V100 32 GB graphic card. Two sets of hyperparameters were chosen to be compared after some initial optimization and are in (TableV)
|Hyperparameters||Model 1||Model 2|
|Number of hidden layers||4||4|
|Hidden size of transformer||512||1024|
|Number of attention heads||8||8|
|Size of attention head||80||128|
|Intermediate size(Size of the feed forward layer)||2048||4096|
From the above parameter choice, we wanted to experiment, whether providing more Segment and Memory length is advantageous (longer context) than a larger model. These parameters where chosen after some hyperparameter optimization. Same as for BERT we use the Adam optimizer, but we also use a cosine annealing learning rate scheduler to speed-up training . The training performance results are in (Table VI)
As opposed to BERT, the left+right-marked models have a better training performance than their counterpart. Interestingly the larger model trains much better when compared to providing larger contexts. The same set of parameters for the 32-32 model cannot be replicated for 150-150 model as the latter takes a lot of space on the GPU card. The test set is same as that used with BERT and the results are in (Table VII). The test performance is similar to that of the training performance with left-right-marked large model(32-32) performing the best. We can directly compare the perplexity scores with the previous best  as both are unidirectional models, Transformer-XL model has outperformed the latter by 27%.
Iv-C Result comparisons for Transformer architectures
). The dramatically low scores of BERT indicate that per word predicted probability is higher than that of a uni-directional model. Thus the predicted word probability distribution is much sharper when compared to the XL model probability distribution. At this point, we cannot say which model architecture has performed better- BERT or Transformer-XL, despite both of them achieving good low perplexity scores. We would need to experiment with a downstreaming task in-order to fairly compare model performances.
Recent migration to transformer based architectures in language modeling from LSTM models is justified as Transformer-XL obtains strong perplexity results. BERT model also obtains very low pseudo-perplexity scores but it is inequitable to the unidirectional models. Our major contributions in this project, is the use of Transformer-XL architectures for the Finnish language in a sub-word setting, and the formulation of pseudo perplexity for the BERT model. Further comparisons between the transformer architectures can be made by downstreaming it to an ASR task, which will be explored in the future.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, pp. 5998–6008, Curran Associates, Inc., 2017.
-  A. Radford, “Improving language understanding by generative pre-training,” 2018.
-  J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018.
-  Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” in ACL, 2019.
-  Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” 2019. arxiv:1906.08237.
-  M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” CoRR, vol. abs/1802.05365, 2018.
-  J. Howard and S. Ruder, “Fine-tuned language models for text classification,” CoRR, vol. abs/1801.06146, 2018.
-  P. Smit, “Modern subword-based models for automatic speech recognition,” pp. 62 + app. 136, 2019.
X. Chen, A. Ragni, X. Liu, and M. Gales, “Investigating bidirectional recurrent neural network language models for speech recognition,” pp. 269–273, 08 2017.
-  P. Gage, “A new algorithm for data compression,” C Users J., vol. 12, p. 23–38, Feb. 1994.
-  R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” CoRR, vol. abs/1508.07909, 2015.
-  Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, 2019.
-  CSC - IT Center for Science, “The helsinki korp version of the Finnish text collection, url: http://urn.fi/urn:nbn:fi:lb-2016050207,” 1998.
-  P. Smit, S. Virpioja, S.-A. Grönroos, and M. Kurimo, “Morfessor 2.0: Toolkit for statistical morphological segmentation,” in Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 21–24, 2014.
-  M. Creutz and K. Lagus, “Unsupervised models for morpheme segmentation and morphology learning,” ACM Trans. Speech Lang. Process., vol. 4, Feb. 2007.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
-  Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” CoRR, vol. abs/1506.06724, 2015.
-  I. Loshchilov and F. Hutter, “SGDR: stochastic gradient descent with restarts,” CoRR, vol. abs/1608.03983, 2016.
-  M. K. Peter Smit, Sami Virpioja, “Advances in subword-based hmm-dnn speech recognition across languages.,” Submitted to Language Resources and Evaluation, 29 November 2018.