Finnish Language Modeling with Deep Transformer Models

03/14/2020 ∙ by Abhilash Jain, et al. ∙ aalto 0

Transformers have recently taken the center stage in language modeling after LSTM's were considered the dominant model architecture for a long time. In this project, we investigate the performance of the Transformer architectures-BERT and Transformer-XL for the language modeling task. We use a sub-word model setting with the Finnish language and compare it to the previous State of the art (SOTA) LSTM model. BERT achieves a pseudo-perplexity score of 14.5, which is the first such measure achieved as far as we know. Transformer-XL improves upon the perplexity score to 73.58 which is 27% better than the LSTM model.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Language modeling is a probabilistic description of language phenomenon. It provides essential context to distinguish words which sound similar and therefore has one of the most useful applications in Natural Language Processing (NLP) especially in downstreaming tasks like Automatic Speech Recognition (ASR). Recurrent Neural Networks (RNN) especially Long Short Term Memory (LSTM) networks

[1] have been the typical solution to language modeling which do achieve strong results. In spite of these results, their fundamental sequential computation constraint has restricted their use in the modeling of long-term dependencies in sequential data. To address these issues Transformer architecture was introduced. Transformers relies completely on an attention mechanism to form global dependencies between input and output. It also offers more parallelization and has achieved SOTA results in language modeling outperforming LSTM models [2].

In recent years,we have seen a lot of development based on this standard transformer models particularly on unsupervised pre-training([3, 4, 5, 6, 7, 8] which have set state-of-the art results on multiple NLP benchmarks. One such model architecture has been the Bidirectional Encoder Representations from Transformers (BERT) model which uses a deep bidirectional transformer architecture.

Another architecture of interest would be the Transformer-XL, which introduces the notion of recurrence in a self-attention model.

The primary research focus though has been mostly on English language for which abundant data is present. It is interesting to see the performance of these models for an agglutinative language like Finnish, which is morphologically richer than English.

In this project, we explore the implementation of Transformer-based models (BERT and Transformer-XL) in language modeling for Finnish. We will use the same training data as in [9]

so that we can do fair comparisons with the performance of the LSTM models. Also, as the BERT model is a bi-directional transformer, we will have to approximate the conditional probabilities given a sequence of words. We also experiment with using sub-word units with Transformer-XL to cope with the large vocabulary problems associated with the Finnish Language. With smaller units, the modeled sequences are longer, and we hope that the recursive XL architecture can allow us to still model long term effects. To the best of our knowledge this is the first work with the Finnish language to use the following:

  • Approximation of perplexity using a BERT architecture

  • Using Transformer-XL architecture with sub-word units.

  • Comparison of Transformer and LSTM models as language models in the same comparable settings with an agglutinative language.

Ii Background & Methods

The goal of an language model is to assign meaningful probabilities to a sequence of words. Given a set of tokens , where

is the length of a sequence, our task is to estimate the joint conditional probability

which is


were is the context. An Intrinsic evaluation of the performance of Language Models is perplexity (PPL) which is defined as the inverse probability of the set of the tokens and taking the root were is the number of tokens


In our two approaches we use transformer based architectures: BERT and Transformer-XL as mentioned before. Calculating the auto-regressive for the transformer-XL is quite straight-forward as the model is unidirectional but it doesn’t factorize the same way for a bi-directional model like BERT.

BERT’s bi-directional context poses a problem for us to calculate an auto-regressive joint probability. A simple fix could be that we mask all the tokens and we calculate the conditional factors as we do for an unidirectional model. By doing so though, we loose upon the advantage of bi-directional context the BERT model enables. We propose an approximation of the joint probability as,


This type of approximations has been previously explored with Bi-directional RNN LM’s [10] but not for deep transformer models. We therefore, define a pseudo-perplexity score from the above approximated joint probability.

The original BERT has two training objectives: ’Masked language modelling’, in which you mask input tokens randomly and then predict the masked tokens using the left and right context. Additionally, there is the ’next sentence prediction’ task that jointly trains text-pair representations. For training the Masked language model the original BERT used Byte Pair Encoding (BPE) [11] for subword tokenization [12].For example the rare word ”unaffable” to be split up into more frequent subwords such as [”un”, ”##aff”, ”##able”]. To remain consistent with experiments performed with LSTM’s we use the morfessor for the subword tokenization in the Finnish Language. In Addition, we also apply boundary markers as in (Table I) and train two separate models using this distinction. We train with left-marked markings as the original BERT was trained with such a scheme and the left+right-marked as it was the previous SOTA with the Finnish Language. For the transformer-XL experiments, we just train with the left+right marked scheme.

subword marking Example
left+right-marked (+m+) two slipp+ +er+ +s
left-marked (+m) two slipp +er +s
TABLE I: Two methods of marking subword units such that the original sentence ’two slippers’ is reconstructed

The Next Sentence Prediction (NSP) is a binary classification task which predicts whether two segments follow each other in the original text. This pre-training task was proposed to further improve the performance on downstreaming tasks, like Natural Language Inference(NLI) but in reality removing the NSP loss matches or slightly improves the downstream task performance [13]. In this paper, we have omitted the NSP task from the BERT pre-training procedure and changed the input from a SEGMENT-PAIR input to a SINGLE SEGMENT input. As seen in (Fig 1)

Fig. 1: BERT-Original sentence ’how are you doing today’

Transformer-XL introduced the notion of recurrence in self-attention by caching the hidden state sequence to compute the hidden states of a new segment. It also introduces a novel relative positional embedding scheme and both of them combined address the issue of fixed context lengths. Transformer-XL as mentioned is a unidirectional deep transformer architecture, therefore the perplexity can be calculated as (Eq 2). The only change is in the input format, were we use sub-word units rather than whole word units as Finnish is morphologically richer than English.

Iii Data

The Finnish text data used for the language modeling task is provided by [14]. The dataset consists mainly of newspapers and books of around 144 million word tokens and 4.2 million unique tokens. We use a Morfessor 2.0 [15] using the basic unsupervised Morfessor Baseline algorithm [16] with a corpus weight parameter (

) of 0.001. We have a vocabulary of 34K subword tokens for the left+right-marked (+m+) markings and 19K subword tokens for the left-marked (+m) markings. We also pre-process the data to remove any punctuation marks such that we can use the same data with an ASR system. The input is one sentence per line and we shuffle the sentences at each epoch. The data is randomly divided into- training dataset and a validation dataset. The test dataset consists of 2850 Finnish news articles obtained from the Finnish national broadcaster YLE.

Iv Experiments & Results

Iv-a Bert

All BERT experiments were trained for 500K steps. The code was written in Python and we used the Tensorflow libraries to create the models. The experiments were trained on a single NVIDIA Tesla V100 32 GB graphic card. The data was first processed into Tensorflow records as the input to the model. The set of hyperparameters which we found optimal after experimenting with different sets are in (Table


Number of hidden layers 20
Hidden size of transformer 896
Number of attention heads 16
Intermediate size(Size of the feed forward layer) 3584

hidden activation function

Gaussian Error Linear Units
dropout probability 0.1
max position embeddings 300
TABLE II: BERT hyperparameters

This set of parameters were chosen as there training performances were better than smaller models on modelling the long sequences of sub-words. We use the Adam optimizer [17] same as the English BERT. A maximum sequence length of 300 encompasses 98 percent of the training data and also allows us to fit larger models on the GPU card. Hyper-parameter optimization is very difficult in case of these models as they take around 15 days to train given the resources. The hyperparameter choices were therefore more dependant on the original BERT with little tweaks. We assess the training performance of the the model in the (Table III).

Model Masked LM Loss Masked LM Accuracy
left+right-marked (+m+) 2.24 0.56
left-marked (+m) 2.03 0.59
TABLE III: BERT training performance

When we train the BERT model we mask some percentage of the input tokens at random, and then predict those masked tokens, this is known as Masked LM. The masked LM loss, refers specifically to the loss when the masked language model predicts on the masked tokens. The masked LM accuracy refers specifically to the accuracy with which the model predicts on the masked tokens. The loss for both the models are far off from the Masked LM loss of the English BERT, key difference being the pre-training data for both the language models are quite different. Google training their model on 3.3 Billion words from BooksCorpus [18] and the English Wikipedia and our model being trained on 144 million words. Comparing the two Finnish models, the left-marked model has a better training performance than left+right-marked model.

The results of the pseudo-perplexity described in the previous section to evaluate the above models on the test data-set is in table (Table IV).The test dataset is of a different context when compared to the training data, and interestingly both the models are quite confident when it comes to the test dataset. The pseudo-perplexity values of left-marked are lower when compared to left-right-marked signifying that it is more confident.

We cannot directly compare the perplexity scores BERT model with a unidirectional LSTM model as both are calculated in a different manner. We can experiment to compare it with a Bi-directional LSTM or use a downstreaming task to compare both the performances. We could also randomly mask tokens and then compare the prediction accuracy on the masked tokens.

Model Pseudo perplexity
left+right-marked (+m+) 17.1
left-marked (+m) 14.5
TABLE IV: BERT Test performance

Iv-B Transformer-XL

All Transformer-XL experiments are also trained equally for 500K steps. The code was written in Python and we used the PyTorch libraries for model creation. The experiments were trained on a single NVIDIA Tesla V100 32 GB graphic card. Two sets of hyperparameters were chosen to be compared after some initial optimization and are in (Table


Hyperparameters Model 1 Model 2
Number of hidden layers 4 4
Hidden size of transformer 512 1024
Number of attention heads 8 8
Size of attention head 80 128
Intermediate size(Size of the feed forward layer) 2048 4096
Warmup 10000 40000
Batch-size 64 224
Segment Length 150 32
Memory Length 150 32
TABLE V: Tr-XL hyperparameters

From the above parameter choice, we wanted to experiment, whether providing more Segment and Memory length is advantageous (longer context) than a larger model. These parameters where chosen after some hyperparameter optimization. Same as for BERT we use the Adam optimizer, but we also use a cosine annealing learning rate scheduler to speed-up training [19]. The training performance results are in (Table VI)

Model Mem-seg len
150-150 32-32
left+right-marked (+m+) 45.22 33.86
left-marked (+m) 47.83 35.78
TABLE VI: Tr-XL training perplexity scores

As opposed to BERT, the left+right-marked models have a better training performance than their counterpart. Interestingly the larger model trains much better when compared to providing larger contexts. The same set of parameters for the 32-32 model cannot be replicated for 150-150 model as the latter takes a lot of space on the GPU card. The test set is same as that used with BERT and the results are in (Table VII). The test performance is similar to that of the training performance with left-right-marked large model(32-32) performing the best. We can directly compare the perplexity scores with the previous best [20] as both are unidirectional models, Transformer-XL model has outperformed the latter by 27%.

Model Mem-seg len
150-150 32-32 (prev best)
left+right-marked (+m+) 82.3 73.58 93.2
left-marked (+m) 84.79 74.39 -
TABLE VII: Tr-XL test perplexity scores, (-): The experiment models are not available

Iv-C Result comparisons for Transformer architectures

Transformer-XL and BERT both have low perplexity and pseudo-perplexity scores, but both cannot be directly compared as they are calculated quite differently (Eq.1, Eq.3

). The dramatically low scores of BERT indicate that per word predicted probability is higher than that of a uni-directional model. Thus the predicted word probability distribution is much sharper when compared to the XL model probability distribution. At this point, we cannot say which model architecture has performed better- BERT or Transformer-XL, despite both of them achieving good low perplexity scores. We would need to experiment with a downstreaming task in-order to fairly compare model performances.

V Conclusion

Recent migration to transformer based architectures in language modeling from LSTM models is justified as Transformer-XL obtains strong perplexity results. BERT model also obtains very low pseudo-perplexity scores but it is inequitable to the unidirectional models. Our major contributions in this project, is the use of Transformer-XL architectures for the Finnish language in a sub-word setting, and the formulation of pseudo perplexity for the BERT model. Further comparisons between the transformer architectures can be made by downstreaming it to an ASR task, which will be explored in the future.