With the aim of extending the global reach of NLP technology, much recent research has focused on the development of multilingual models. Due to the lack of annotated data and resources in many languages, these models typically rely on multilingual joint learning or zero-shot transfer learning across languages. Much of this research has utilized multilingual word embeddings [muse, vecmap], that project words in multiple languages into a shared multilingual semantic space, such that translations of words across these languages appear close in the new space. Such general-purpose multilingual word embeddings then serve as a basis for joint or transfer learning focused on a particular task.
On the other hand, current best performing monolingual approaches have moved away from static to contextualized word embeddings. Such contextualisation allows the models to address the long-standing problem of polysemy and dynamically model contextual meaning variation. The first such model, CoVe [cove], used a deep LSTM [lstm]
encoder pretrained in a machine translation task to contextualize word vectors. This work paved the way for contextualized word embeddings by showing their superiority over the static ones on downstream tasks such as named entity recognition (NER) and question answering. Shortly after, ELMo[elmo] improved upon CoVe by making the contextualized embeddings deep by combining information from multiple layers of their LSTM encoder trained with language modelling objective. Finally, the Transformer-based [vaswani2017attention] BERT model [bert] broke the performance records across many downstream NLP tasks, achieving up to 7.6% absolute improvement over the previous state of the art.
However, in the multilingual setting, the effects of contextualization are still relatively unexplored. One exception includes the work of Schuster et al, schuster2019cross that generalizes the work of Lample et al, muse to the ELMo model by viewing a contextual word embedding as a context-dependent shift from the (static) mean embedding of a word. The mean embeddings are aligned across languages using adversarial training. This technique, however, does not outperform the non-contextualized embedding baseline in half of the performed experiments when no supervised anchored alignment is given and does not allow for large-scale joint training across languages.
Meanwhile, recent multilingual NLP research has moved away from solely word-level representations to training “massively” multilingual sentence encoders. Artetxe and Schwenk, laser proposed a method for learning language-agnostic sentence embeddings (LASER) for 93 languages with a single shared encoder and limited aligned data. More specifically, they trained a deep LSTM-encoder to embed sentences in all (93) languages into a shared space such that semantically similar sentences in different languages appear close to each other in this space. Using this model the researchers advanced the state of the art on zero-shot cross-lingual natural language inference for 13 out of 14 languages in the XNLI [xnli] dataset. Following these advances, Lample and Conneau, xlmberts incorporated multilingual language pretraining into the BERT model [bert], coining their model XLM (cross-lingual language model), and again further improved the performance on all XNLI languages. Such multilingual sentence encoders have so far been applied and evaluated in sentence-level tasks. Yet, they provide a promising starting point for learning multilingual contextualized word representations needed for word-level classification tasks and investigating the effects of contextualisation in multilingual word-representation models.
The contributions of this paper are as follows:
We conduct a comprehensive comparison of state-of-the-art multilingual word and sentence encoding models and pretraining methods on the tasks of NER and part of speech (POS) tagging, experimenting in both zero-shot transfer learning and joint training setting.
We introduce a new method for learning contextualized multilingual word embeddings based on the LASER encoder and perform an in-depth analysis on its performance against multiple benchmarks in zero-shot transfer and joint training settings. We improve the previous state-of-the-art for English to German NER with 2.8 F1-points and perform at state-of-the-art level for other languages.
We empirically show and analyze the benefit of contextual word embeddings versus static word embeddings for zero-shot transfer learning.
2 Related work
2.1 Monolingual word representations
After the success of static word embeddings such as word2vec [word2vec] which produced general-purpose semantic encodings of words, a lot of effort has been put into contextualizing these embeddings. While static embeddings at the time improved results when applied to various NLP tasks, polysemy has remained a challenge for these models, as they represent all meanings of a word within one vector. All occurrences of a word are thus treated the same and the resulting vector is a combination of the semantics of each possible meaning of the word.
In order to incorporate the context into embeddings, Peters et al. bilstmLM introduced Language Model (LM) embeddings and showed that they improved sequence tagging performance, specifically for NER. Subsequently, Peters et al. elmo took the LM embeddings a step further by making them deep, which resulted in performance improvements in several downstream benchmarks. Their model, ELMo, incorporates information from all layers of the network by taking a layer-wise weighted average of the embeddings. Interestingly, the authors showed that each layer of their LSTM-encoder encoded different properties of the word. The first layer captures more syntactic aspects of a word whereas the second layer captures more high-level semantic information.
2.2 Multilingual word representations
Much previous research has focused on aligning word embeddings from different languages into a language independent space [ap2014autoencoder], either bilingual or multilingual. These methods either jointly train word embeddings on aligned corpora or align monolingual ones by means of post-processing. An obvious drawback of these methods is their need for aligned corpora. Hence in more recent work the focus shifted towards unsupervised alignment of word embeddings. Works such as Multilingual Unsupervised or Supervised word Embeddings (MUSE) [lample2017unsupervised] is an example of unsupervised methods capable of aligning embeddings into a shared space, enabling easier knowledge transfer across languages without the need for additional resources. By aligning the embedding spaces of more than 30 languages, it generates high-quality embeddings for use in multilingual semantics tasks. Because of its proven performance we use these embeddings as a baseline to compare our models to.
To the best of our knowledge, the work of Schuster et al. schuster2019cross comes closest to ours. They present a method to align monolingual ELMo embeddings across languages by modelling such an embedding as a context-dependent shift from its mean (see equation 1, where is the embedding for the work and is the context) and applying the linear alignment technique proposed by Mikolov, Le and Sutskever mikolov2013exploiting to the mean embedding of each word.
Whereas the authors show this approach to be a simple yet effective one, it does not allow for joint training across languages or handling code-switching. Our proposed method tries to overcome these deficiencies by sharing one encoder for all languages instead of transferring knowledge from one to another language by means of post-processing.
2.3 Zero-shot cross-lingual transfer learning
Early work on cross-lingual transfer learning used parallel corpora to create cross-lingual word clusters or exploited external knowledge bases as means of feature engineering [tackstrom2012cross]. More recent approaches either exploit bilingual word embeddings to translate a dataset [xie2018neural] or attempt to learn language-invariant features [chen2019multi].
A related promising development is that of Language-Agnostic SEntence Representations (LASER) [laser], where one of the main contributions is an encoder capable of embedding sentences in 93 languages in a shared space such that semantically related sentences are close in this space, regardless of their respective language, language family and script. At the time of release, LASER has set a new state of the art on multiple zero-shot transfer learning tasks such as XNLI [xnli], indicating its success in creating language agnostic embeddings.
The LASER encoder consists of a byte-pair encoded vocabulary (BPE) [bpe]
followed by a 5-layer biLSTM with 512-dimensional hidden states. The final sentence embedding is obtained by applying max pooling over the hidden states of the final layer. BPE is a form of learning Subword Units (SUs) by encoding frequently occurring character n-grams as a symbol. In the case of LASER 50k of those symbols were learned together with their respective embedding.
The encoder is trained in an encoder-decoder setup in the task of machine translation, as shown in Figure 1. More specifically, a dataset was gathered by combining the Europarl, United Nations, Opensubtitles2018, Global Voices, Tanzil and Tatoeba corpora [tiedemann2012parallel] comprising sentences in 93 languages translated into English and/or Spanish. The task of the encoder is to encode a sentence in a 1024-dimensional vector such that the decoder can generate the translation of the original sentence in a chosen target language. The decoder receives no information about which language is encoded by the encoder and hence it cannot distinguish between languages. This forces the encoder to create language-agnostic sentence embeddings.
The main contribution of LASER is that the the authors show that it is possible to encode numerous languages into one encoder when a shared vocabulary is learned and the training data is aligned with just two target languages. Since the LASER sentence encoder achieves very promising results on zero-shot transfer learning for sentence-level NLP tasks, we will use this model as a basis for ours. Our goal is to investigate the possibility of extracting contextualized word embeddings from an encoder trained at the sentence level. We evaluate two versions of our model and compare them to multiple baselines.
2.4 Multilingual joint learning
Multilingual joint learning has shown to be beneficial when either the target or all languages are resource-lean [khapra2011together], when code-switching is present [adel2013combination] or even in high-resource scenarios [mulcaire2018polyglot]. Often, multilingual joint training is approached by some form of parameter sharing [johnson2017google].
3.1 LASER-based contextualized embeddings
We use a pretrained LASER model to obtain the contextualized word embeddings. For this, we compare different methods, which we explain below.
BPE BOW As the first baseline, we simply create word embeddings by averageing the BPE embeddings per word. This approach can be compared to a continuous Bag-Of-Words (BOW) approach with the BPE embeddings serving as words.
BPE GRU As the second baseline, we introduce a GRU [gru]
encoder followed by max-pooling over time to encode the BPE embeddings into a word embedding. First, each BPE symbol is embedded by the pretrained embeddings from the LASER encoder. Then, the data is split into a [BPE_pad, N, Emb_dim] tensor with BPE_pad the length of the longest word in the batch expressed in number of Subword Units. N is the number of words in the batch and Emb_dim is the embedding dimension of the pretrained BPE symbols which equals to 320. This tensor is then fed into the GRU encoder to compute the semantics of the SUs together. The final embedding is created by applying max-pooling over time on the output of the GRU.
MUSE As the third baseline, we consider static crosslingual word embeddings from the MUSE model [lample2017unsupervised] as embeddings for our sequence tagger.
LASER-top Our first proposed method incorporating the LASER LSTM encoder, which we call LASER-top, uses the hidden state of the final layer as a base representation of a BPE symbol. First, we apply max-pooling over the forward and the backward hidden states, and , to obtain a 512-dimensional vector per BPE symbol111Preliminary experiments showed that max-pooling both reduced overfitting and improved computational efficiency.. Inspired by ELMo, we then rescale this output with a learnable scale parameter .
As the LASER encoder is fed a sentence split into SUs its output can now be seen as a contextualized representation of the original embeddings, which is the reason why we expect this method to improve over the baselines. Then, similar to the original approach [laser], the final word embedding is created by applying max-pooling over time to all s belonging to a word, for each word separately.
LASER-elmo Inspired by ELMo and hence called LASER-elmo, we make our multilingual contextualized embeddings deep by incorporating multiple layers of the LSTM encoder. In order to do this, a weighted average of the hidden states of all layers is computed by softmax-normalizing task-specific layer weights for layer l, which are learned during training.
Where is the deep contextualized embedding for the SU at index in the sequence, is computed as in Equation 2 and is the softmax-normalized layer weight. These embeddings are then used as in LASER-top to create word embeddings.
LSTM No Pretraining In order to verify what part of the performance of LASER-top and LASER-elmo can be attributed to the fact that the 5-layer LSTM encoder allows for modelling more complex dependencies, we replace the pretrained encoder by a randomly initialized one and retrain. As overfitting plays a major role in training our models and Peters et al, elmo have shown a two-layer LSTM to be sufficiently powerful for sequence tagging tasks, we pick a two-layer LSTM to replace the LASER encoder. Otherwise, this baseline functions in the same way as LASER-elmo.
3.2 Transformer-based contextualized embeddings
BERT Devlin et al. bert propose a novel language representation model which uses Transformers to create deep contextual representations for words. These representations are obtained by training the model on unlabeled text to predict the words at randomly chosen masked positions conditioned on both the left and the right context - the authors call this technique the Masked Language Model (MLM), see Figure 2. BERT uses WordPiece embeddings [wu2016google] with a vocabulary of 30k tokens and is trained on the BookCorpus [zhu2015aligning] and a Wikipedia dump together: a combined corpus of approximately 3300M words. In addition to the MLM, BERT is also trained on a next sentence prediction task in order to capture relationships between sentences. This task is phrased as a binary classification task where either two consecutive or two random sentences are sampled from the corpus. The complete pretraining procedure combines these two tasks by sampling sentences as described for the next sentence prediction task and applying both this task and the MLM.
Although BERT is not explicitly pretrained to align semantics across languages, its multilingual version222The multilingual version of BERT is not described in the original paper by [bert]. Instead, it is described on the official GitHub page of the authors: https://github.com/google-research/bert/blob/master/multilingual.md., from which we use the cased base version in our experiments, is trained on 100+ languages and its monolingual capabilities are (near) state-of-the-art for many NLP tasks without heavily-engineered task-specific architectures.
XLM After the success of LASER in zero-shot multilingual transfer learning Lample and Conneau xlmberts proposed a similar method based on the architecture of BERT. Their contribution lies in the introduction of several new unsupervised and supervised methods for cross-lingual language model pretraining. In this work we will focus on their supervised method as it is most closely related to LASER and outperforms the former on the XNLI benchmark. This method is a multi-task setup of a slightly adjusted MLM [bert] combined with their so called Translation Language Model (TLM), see Figure 2 for a comparison of the MLM and TLM (taken from Lample and Conneau).
The TLM exploits a N-way parallel corpus of sentences to allow the model to explicitly use words from language A to predict the masked words in language B, hence encouraging the model to learn similar representations for semantically similar phrases across languages. The authors use a dataset accompanying the XNLI evaluation set which contains 10k parallel sentences in all 15 languages333The 15 languages in the corpus are English, French, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. The TLM objective is altered with the MLM objective using Wikipedia dumps of each language.
This approach differs from the one used by Artetxe and Schwenk, laser: firstly, no encoder-decoder structure is used to explicitly align languages in a shared space. Instead, the model is only implicitly encouraged to align languages by allowing to share knowledge across the language boundary and solving the same task independently and simultaneously for both languages. Although the performance on the XNLI dataset improved using this method, there is an obvious drawback in terms of scaling to more languages due to the N-way parallel corpus requirement.
3.3 Task-specific models
All the above methods provide a means to extract word embeddings, which then serve as input to models for downstream tasks. We experiment with two downstream tasks: NER and POS tagging. These tasks are chosen in order to evaluate the performance on both semantic (NER) and syntactic level (POS tagging).
We use the encoder of the Transformer model as described in Vaswani et al. vaswani2017attention as a sequence tagging model instead of a more commonly used RNN model in the hope to be able to transcend differences in sentence structure across languages. Specifically, we use a double layer Transformer with 2 attention-heads and 300 hidden dimensions for the query, key and value matrices as well as the feed forward network (FFN). The model is topped off by a Conditional Random Field (CRF) [lafferty2001conditional]. For BERT and XLM, literature shows that adding a linear classification layer suffices for token-level classification tasks [bert].
4 Experimental setup
4.1 Data and preprocessing
|Supervised||dataset||OPUS, 223M sentences||-||XNLI, 150k sents|
|Unsupervised||dataset||-||Wiki dump, 104 langs||Wiki dump, 15 langs|
|task||-||MLM + next sentence||MLM|
We used the datasets from the CoNLL2002 [conll2002] and CoNLL2003 [conll2003] shared tasks, which provide data for the NER and POS in English, Spanish, Dutch and German. The data is gathered from local newspapers and is annotated with both named entities and POS tags. All datasets are approximately the same size with sentences to train on and to test on.
As the POS tags are given in language specific tags, we convert them to Universal POS tags [petrov2011universal], leaving us with 12 POS tags. In order to evaluate the ability of our methods to capture both semantic and syntactic information about the word no extra features are used to learn the model. All data is tokenized, and only punctuation normalization and lower casing has been applied in addition to that.
BPE BOW, BPE GRU and MUSE are relatively simple models and hence no intensive hyperparameter tuning is performed. Between the embedder and the sequence tagger a dropout of 0.25 is applied and within the sequence tagger a dropout of 0.15. We used Adam[kingma2014adam] as an optimizer with the default learning rate of 0.001 and applied regularization with
. The models were trained for a total of 15 epochs while monitoring performance on a development set and applying early stopping[EarlyStopping] after two rounds of consecutive decreased performance.
LASER-based models During preliminary experiments we found overfitting to be a major challenge for the LASER-based models and LSTM No Pretraining, hence more sophisticated techniques compared to the baselines have been applied for training.
Firstly, instead of using Adam as optimizer, the 1cycle LR[smith2018disciplined]
policy is used. This policy uses the much simpler SGD optimizer with momentum and has been shown to improve generalization capabilities of neural networks while decreasing the number of epochs needed to train, a phenomenon the author calls ”super convergence”.
Finally, all remaining hyperparameters concerning regularization, such as dropout and regularization, are determined using Bayesian Optimization [snoek2012practical].
Transformer-based models As BERT and XLM are practically the same model except for their exact hidden size and their pretraining methods, they are trained using the same method. The original work [bert] comes with a guide on how to finetune BERT for downstream NLP tasks based on a small grid search over values for the batch size, learning rate and number of epochs. Moreover, it contains optimal settings for the task of NER on the CoNLL2003 dataset, which is also part of our datasets. Apart from these specified optimal hyperparameter settings, the authors also note that BERT tends to be robust to the exact hyperparameter settings and hence we to use the specified hyperparameters for all experiments. This amounts to training for 4 epochs with a batch size of 16 using a learning rate of 5e-5.
Zero-shot transfer learning The first set of experiments involve zero-shot transfer learning across languages. Each model is trained on the English dataset and consecutively evaluated on all other datasets, including two datasets in the low-resource languages — Hungarian [szarvas2006highly] and Basque [alegria2004design] — for NER.
Joint training In order to evaluate the benefit of joint training, we considered two scenarios. In the first scenario (A) a quarter of the training set of each language in the CoNLL2002 and CoNLL2003 shared tasks is taken and combined into a new training set of approximately the same size as the former ones. A validation set is created in a similar fashion and used for monitoring. Each model is trained on the multilingual dataset and then evaluated on the original test sets of each language. In the second scenario (B) one full training set (English) was complemented with a quarter of the training sets of the remaining languages. The difference in performance between scenarios A and B can be used as a way to quantify how well each model shares knowledge across languages.
|NER||BPE BOW||BPE GRU||MUSE||LASER-top||LASER-elmo||LSTM No Pre||BERT||XLM1||Literature|
XLM has been pretrained on the 15 XNLI languages: only English and German are amongst those languages.
No MUSE embeddings available for Basque.
Not available in literature.
5.1 Zero-shot transfer learning
Table 2 shows the F1-scores per model per language for the tasks of NER and POS tagging with the highest scores per language shown in bold and, if applicable, the state of the art underlined. Where possible, scores from monolingual training and evaluation are appended, separated by ”/” for reference. All results have been tested against LASER-top for significance using the sign test with and have been found to be significant.
The highest performance scores were achieved by either BERT or LASER-top, where BERT performs the best on 6/9 tasks. Overall BERT appears to be a stronger model for learning the tasks at hand than the LSTM-based LASER-top model, as BERT achieves the highest scores in all monolingual settings except for German NER. LASER-top on the other hand is less capable of learning the task at hand in the source language, but the drop in performance when evaluating on other languages is smaller: it achieves the highest score on 2 out of 4 languages for POS tagging and advances the state of the art for German NER, indicating the added benefit of the LASER pretraining method for crosslingual knowledge sharing.
Surprisingly, XLM does not outperform BERT in any of the settings. It is worth noting that from the evaluated languages only English and German have been seen by XLM during pretraining, which explains the poor transfer learning capabilities to the remaining languages. Yet, XLM does not outperform BERT in the transfer from English to German, indicating no added benefit in the TLM method for zero-shot transfer learning across languages.
For the low-resource languages the added benefit of contextualization is evident. As expected, performance on languages from more distant language-families is lower: all but one models score higher in Hungarian than in Basque. Furthermore, pretraining on a higher number of languages appears to positively influence performance on low-resource languages, as BERT outperforms XLM by a large margin and the same holds for LASER-top and LSTM No Pretraining.
Contrary to our expectations, LASER-top consistently outperforms LASER-elmo across all tasks. This is likely to be due to overfitting: LASER-elmo achieves higher scores on the training set than LASER-top in all experiments. Since the drop in performance across languages is far greater for all baseline models than for LASER-top and LASER-elmo, we attribute this improved performance to the multilingual pretraining. As the scores for LSTM No Pretraining are lower than LASER-top and LASER-elmo in the transferred languages this improved performance cannot be attributed to the added complexity from the extra layers.
5.2 Joint training
|NER||BPE BOW||BPE GRU||LASER-top||LASER-elmo||LSTM No Pre||BERT||XLM1|
Note: all results are depicted as base score followed by a deviation: the base scores are the F1 scores after training using a quarter of each dataset (scenario A) and the deviation is the change in scores after training with the full English train set and a quarter of the remaining languages (scenario B).
XLM has been pretrained on the 15 XNLI languages: only English and German are amongst those languages.
Table 3 shows the results of joint training of the four CoNLL2002 and CoNLL2003 languages. For both NER and POS tagging, BERT and XLM clearly outperform all other models, yet it is questionable whether this is the case because of the ability to share knowledge across languages or because Transformer-based models are better suited for the task. Figures 3 and 4 visualize the added benefit of joint training expressed as the difference in F1-scores compared to the baseline. For English this baseline is the monolingual baseline whereas for the other languages the baseline is the mixed setting with a quarter of each language (scenario A) compared to the full English dataset extended with a quarter of the remaining datasets (scenario B). BPE BOW has been omitted from the graphs as its values distort the graph and is of less importance than the remaining models.
Whereas LASER-top benefits from joint training in all but one languages (Basque), this greatly differs for BERT and XLM, indicating that the pretraining method used for LASER might allow for better crosslingual knowledge sharing than the MLM and TLM methods used for BERT and XLM respectively.
When comparing BERT and XLM, it appears that XLM shares knowledge better across languages it has been pretrained on while languages from a distant language family do not benefit at all from joint training.
5.3 Discussion and error analysis
In order to further analyze the added benefit of contextualized word embeddings versus static word embeddings in the zero-shot transfer setting, we compare MUSE with LASER-top in more detail. Firstly, we look at how good the models are at identifying whether an entity is present at all by trimming down the labels to B, I and O
and creating a confusion matrix in Figure5.
It can be clearly seen that LASER-top is better at detecting an entity than MUSE.
Furthermore, we find evidence that the difference in performance is partially attributable to the fact that LASER-top has contextualized representations for words. Two examples are given where MUSE made an error that LASER-top did not by capitalizing the errors and appending the predicted entity type.
despite winning the asian GAMES/O title two years ago , uzbekistan are in the finals as outsiders .
houston 1996-12-05 ohio state left tackle orlando PACE/O became the first repeat winner of the lombardi award thursday night when the ROTARY/O club of houston again honoured him as college football ’s lineman of the year .
The incorrectly predicted words are all words that on their own would not be considered an entity of interest, but in this respective context they are part of a bigger entity span. These two sentences are examples of an often recurring pattern in the data in all evaluated languages.
In this work we have presented a comprehensive comparison of architectures and pretraining methods for contextualized multilingual word embeddings. We have also shown that it is possible to train a language model solely in an encoder-decoder style on the task of machine translation and consecutively use the encoder to create multilingual contextualized word embeddings. Moreover, we have shown that LASER-top outperforms (non-contextualized) baselines in multiple settings on multiple tasks and sometimes performs on par or better than BERT in the zero-shot transfer setting.
Although our results indicate that our LSTM-based model is not as well suited for downstream NLP tasks as Transformer-based models, we have empirically shown our method to be superior at sharing knowledge across languages in a joint training setting and to perform at or above state of the art in zero-shot transfer setting.
As the results of our models are not yet on par with the current state of the art in monolingual settings, a logical next step is to investigate ways to combine pretraining methods, with the aim of learning higher quality monolingual word representations while encouraging knowledge sharing across languages. For instance a multi-task setup with BERT’s MLM combined with the LASER pretraining method can be explored for this purpose.
This research was supported by Deloitte Risk Advisory B.V., NL. Special thanks to Willem Mobach, Tommie van der Bosch and Marc Verdonk for their involvement and support.