English has an abundance of labeled data that can be used for various Natural Language Processing (NLP) tasks, such as part-of-speech tagging (POS), named entity recognition (NER), and natural language inference (NLI). This richness of labeled data manifests itself as a boost in accuracy in the current era of data-hungry deep learning algorithms. However, the same is not true for many other languages where task specific data is scarce and expensive to acquire. This motivates the need for cross-lingual transfer learning – the ability to leverage the knowledge from task specific data available in one or more languages to solve that task in languages with little or no task-specific data.
Recent progress in NMT has enabled one to train multilingual systems that support translation from multiple source languages into multiple target languages within a single model firat2016multi; johnson2017google; aharoni2019massively. Such multilingual NMT (mNMT) systems often demonstrate large improvements in translation quality on low resource languages. This positive transfer originates from the model’s ability to learn representations which are transferable across languages. Previous work has shown that these representations can then be used for cross-lingual transfer in other downstream NLP tasks - albeit on only a pair of language pairs eriguchi2018zero
, or by limiting the decoder to use a pooled vector representation of the entire sentence from the encoderartetxe2018massively.
In this paper we scale up the number of translation directions used in the NMT model to include 102 languages to and from English. Unlike artetxe2018massively, we do not apply any restricting operations such as pooling while training mNMT which allows us to obtain token level representations making it possible to transfer them to sequence tagging tasks as well. We find that mNMT models trained using plain translation losses can out of the box emerge as competitive alternatives to other methods at the forefront of cross-lingual transfer learning devlin2018bert; artetxe2018massively
Our contributions in this paper are threefold:
We use representations from a Massively Multilingual Translation Encoder (MMTE) that can handle 103 languages to achieve cross-lingual transfer on 5 classification and sequence tagging tasks spanning more than 50 languages.
We compare MMTE to mBERT in different cross-lingual transfer scenarios including zero-shot, few-shot, fine-tuning, and feature extraction scenarios.
We outperform the state-of-the-art on zero-shot cross-lingual POS tagging [Universal Dependencies 2.3 dataset nivre2018universal], intent classification schuster2018cross, and achieve results comparable to state-of-the-art on document classification [ML-Doc dataset SCHWENK18.658].
The remainder of this paper is organized as follows. Section 2 describes our MMTE model in detail and points out its differences from mBERT. All experimental details, results and analysis are given in Sections 3 and 4. This is followed by a discussion of related work. In Section 6, we summarize our findings and present directions for future research. We emphasize that the primary motivation of the paper is not to challenge the state-of-the-art but instead to investigate the effectiveness of representations learned from an mNMT model in various transfer-learning settings.
2 Massively Multilingual Neural Machine Translation Model
In this section, we describe our massively multilingual NMT system. Similar to BERT, our transfer learning setup has two distinct steps: pre-training and fine-tuning. During pre-training, the NMT model is trained on large amounts of parallel data to perform translation. During fine-tuning, we initialize our downstream model with the pre-trained parameters from the encoder of the NMT system, and then all of the parameters are fine-tuned using labeled data from the downstream tasks.
2.1 Model Architecture
We train our Massively Multilingual NMT system using the Transformer architecture (vaswani2017attention) in the open-source implementation under the Lingvo framework (lingvo). We use a larger version of Transformer Big containing 375M parameters (6 layers, 16 heads, 8192 hidden dimension) (chen-EtAl:2018:Long1), and a shared source-target sentence-piece model (SPM)111https://github.com/google/sentencepiece kudo2018sentencepiece vocabulary with 64k individual tokens. All our models are trained with Adafactor (shazeer2018adafactor) with momentum factorization, a learning rate schedule of (3.0, 40k)222The shorthand form (3.0, 40k) corresponds to a learning rate of 3.0, with 40k warm-up steps for the schedule, which is decayed with the inverse square root of the number of training steps after warm-up. and a per-parameter norm clipping threshold of 1.0. The encoder of this NMT model comprises approximately 190M parameters and is subsequently used for fine-tuning.
We train a massively multilingual NMT system which is capable of translating between a large number of language pairs at the same time by optimizing the translation objective between language pairs. To train such a multilingual system within a single model, we use the strategy proposed in johnson2017google which suggests prepending a target language token to every source sequence to be translated. This simple and effective strategy enables us to share the encoder, decoder, and attention mechanisms across all language pairs.
We train our multilingual NMT system on a massive scale, using an in-house corpus generated by crawling and extracting parallel sentences from the web uszkoreit2010large. This corpus contains parallel documents for 102 languages, to and from English, comprising a total of 25 billion sentence pairs. The number of parallel sentences per language in our corpus ranges from around 35 thousand to almost 2 billion. Figure 1 illustrates the data distribution for all 204 language pairs used to train the NMT model. Language ids for all the languages are also provided in supplementary material.
Data sampling policy
Given the wide distribution of data across language pairs, we used a temperature based data balancing strategy. For a given language pair, , let
be the size of the available parallel corpus. Then if we adopt a naive strategy and sample from the union of the datasets, the probability of the sample being from language pairwill be . However, this strategy would starve low resource language pairs. To control for the ratio of samples from different language pairs, we sample a fixed number of sentences from the training data, with the probability of a sentence belonging to language pair being proportional to , where is the sampling temperature. As a result, would correspond to a true data distribution, and,
yields an (almost) equal number of samples for each language pair (close to a uniform distribution with over-sampling for low-resource language-pairs). We setfor a balanced sampling strategy. To control the contribution of each language pair when constructing the vocabulary, we use the same temperature based sampling strategy with . Our SPM vocabulary has a character coverage of .
We use BLEU score papineni2002bleu to evaluate the quality of our translation model(s). Our mNMT model performs worse than the bilingual baseline on high resource language pairs but improves upon it on low resource language pairs. The average drop in BLEU score on 204 language pairs as compared to bilingual baselines is just 0.25 BLEU. This is impressive considering we are comparing one multilingual model to 204 different bilingual models. Table 1 compares the BLEU scores achieved by mNMT to that of the bilingual baselines on 10 representative language pairs.333We chose a diverse set of language pairs with varying language families, scripts, and dataset sizes. These scores are obtained on an internal evaluation set which contains around 5k examples per language pair.
2.3 Fine-tuning mNMT Encoder
Fine-tuning involves taking the encoder of our mNMT model, named Massively Multilingual Translation Encoder (MMTE), and adapting it to the downstream task. For tasks which involve single input, the text is directly fed into the encoder. For tasks such as entailment which involve input pairs, we concatenate the two inputs using a separator token and pass this through the encoder. For each downstream task, the inputs and outputs are passed through the encoder and we fine-tune all the parameters end-to-end. The encoder encodes the input through the stack of Transformer layers and produces representations for each token at the output. For sequence tagging tasks, these token level representations are individually fed into a task-specific output layer. For classification or entailment tasks, we apply max-pooling on the token level representations and feed this into the task-specific output layer.
It should be noted that fine-tuning is relatively inexpensive and fast. All of the results can be obtained within a few thousand gradient steps. The individual task-specific modeling details are described in detail in section 3. It is also important to note that while the encoder, the attention mechanism, and the decoder of the model are trained in the pre-training phase, only the encoder is used during fine-tuning.
2.4 Differences with mBERT
We point out some of the major difference between mBERT and MMTE are:
mBERT uses two unsupervised pre-training objectives called masked language modeling (MLM) and next sentence prediction (NSP) which are both trained on monolingual data in 104 languages. MMTE on the other hand uses parallel data in 103 languages (102 languages to and from English) for supervised training with negative log-likelihood as the loss. It should be noted that mBERT uses clean Wikipedia data while MMTE is pre-trained on noisy parallel data from the web.
mBERT uses 12 transformer layers, 12 attention heads, 768 hidden dimensions and has 178M parameters while MMTE uses 6 transformer layers, 16 attention heads, and 8196 hidden dimensions with 190M parameters. Note that, the effective capacity of these two models cannot easily be compared by simply counting number of parameters, due to the added characteristic complexity with depth and width.
MMTE uses SPM to tokenize input with 64k vocabulary size while mBERT uses a Wordpiece model (wu2016google) with 110k vocabulary size.
3 Experiments and Results
As stated earlier, we use MMTE to perform downstream cross-lingual transfer on 5 NLP tasks. These include 3 classification tasks: NLI (XNLI dataset), document classification (MLDoc dataset) and intent classification, and 2 sequence tagging tasks: POS tagging and NER. We detail all of the experiments in this section.
3.1 XNLI: Cross-lingual NLI
XNLI is a popularly used corpus for evaluating cross-lingual sentence classification. It contains data in 15 languages conneau2018xnli
. Evaluation is based on classification accuracy for pairs of sentences as one of entailment, neutral, or contradiction. We feed the text pair separated by a special token into MMTE and add a small network on top of it to build a classifier. This small network consists of a pre-pool feed-forward layer with 64 units, a max-pool layer which pools word level representations to get the sentence representation, and a post-pool feed-forward layer with 64 units. The optimizer used is Adafactor with a learning rate schedule of (0.2, 90k). The classifier is trained on English only and evaluated on all the 15 languages. Results are reported in Table2. Please refer to Appedix Table 1 for language names associated with the codes.
MMTE outperforms mBERT on 9 out of 15 languages and by 1.2 points on average. BERT achieves excellent results on English, outperforming our system by 2.5 points but its zero-shot cross-lingual transfer performance is weaker than MMTE. We see most gains in low resource languages such as ar, hi, ur, and sw. MMTE however falls short of the current state-of-the-art (SOTA) on XNLI (lample2019cross). We hypothesize this might be because of 2 reasons: (1) They use only the 15 languages associated with the XNLI task for pre-training their model, and (2) They use both monolingual and parallel data for pre-training while we just use parallel data. We confirm our first hypothesis later in Section 4 where we see that decreasing the number of languages in mNMT improves the performance on XNLI.
3.2 MLDoc: Document Classification
MLDoc is a balanced subset of the Reuters corpus covering 8 languages for document classification (SCHWENK18.658). This is a 4-way classification task of identifying topics between CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social), and MCAT (Markets). Performance is evaluated based on classification accuracy. We split the document using the sentence-piece model and feed the first 200 tokens into the encoder for classification. The task-specific network and the optimizer used is same as the one used for XNLI. Learning rate schedule is (0.2,5k). We perform both in-language and zero-shot evaluation. The in-language setting has training, development and test sets from the language. In the zero-shot setting, the train and dev sets contain only English examples but we test on all the languages. The results of both the experiments are reported in Table 3.
MMTE performance is on par with mBERT for in-language training on all the languages. It slightly edges over mBERT on zero-shot transfer while lagging behind SOTA by 0.2 points. Interestingly, MMTE beats SOTA on Japanese by more than 8 points. This may be due to the different nature and amount of data used for pre-training by these methods.
3.3 Cross-lingual Intent Classification
recently presented a dataset for multilingual task oriented dialog. This dataset contains 57k annotated utterances in English (43k), Spanish (8.6k), and Thai (5k) with 12 different intents across the domains weather, alarm, and reminder. The evaluation metric used is classification accuracy. We use this data for both in-language training and zero-shot transfer. The task-specific network and the optimizer used is the same as the one used for the above two tasks. The learning rate schedule is (0.1,100k). Results are reported in Table4. MMTE outperforms both mBERT and previous SOTA in both in-language and zero-shot setting on all 3 languages and establishes a new SOTA for this dataset.
3.4 POS Tagging
We use universal dependencies POS tagging data from the Universal Dependency v2.3 (nivre2018universal; zeman2018conll)
. Gold segmentation is used for training, tuning and testing. The POS tagging task has 17 labels for all languages. We consider 48 different languages. These languages are chosen based on intersection of languages for which POS labels are available in the universal dependencies dataset and the languages supported by our mNMT model. The task-specific network consists of a one layer feed-forward neural network with 784 units. Since MMTE operates on the subword-level, we only consider the representation of the first subword token of each word. The optimizer used is Adafactor with learning rate schedule (0.1,40k). The evaluation metric used is F1-score, which is same as accuracy in our case since we use gold-segmented data. Results of both in-language and zero-shot setting are reported in Table5.
While mBERT outperforms MMTE on in-language training by a small margin of 0.16 points, MMTE beats mBERT by nearly 0.6 points in the zero-shot setting. Similar to results in XNLI, we see MMTE outperform mBERT on low resource languages. Since mBERT is SOTA for zero-shot cross-lingual transfer on POS tagging task wu2019beto, we also establish state-of-the-art on this dataset by beating mBERT in this setting.
3.5 Named Entity Recognition
For NER, we use the dataset from the CoNLL 2002 and 2003 NER shared tasks, which when combined have 4 languages DBLP:journals/corr/cs-CL-0209010; sang2003introduction. The labeling scheme is IOB with 4 types of named entities. The task-specific network, optimizer, and the learning rate schedule is the same as in the setup for POS tagging. The evaluation metric is span-based F1. Table 6 reports the results of both in-language and zero-shot settings.
MMTE performs significantly worse than mBERT on the NER task in all languages. On average, mBERT beats MMTE by 7 F1 points in the in-language setting and by more than 18 points in the zero-shot setting. We hypothesize that this might be because of two reasons: (1) mBERT is trained on clean Wikipedia data which is entity-rich while MMTE is trained on noisy web data with fewer entities, and (2) the translation task just copies the entities from the source to the target and therefore might not be able to accurately recognize them. This result points to the importance of the type of pre-training data and objective on down-stream task performance. We plan to investigate this further in future work.
In this section, we consider some additional settings for comparing mBERT and MMTE. We also investigate the impact of the number of languages and the target language token on MMTE performance.
In this setting, instead of fine-tuning the entire network of mBERT or MMTE, we only fine-tune the task-specific network which only has a small percentage of the total number of parameters. The rest of the model parameters are frozen. We perform this experiment on POS tagging task by fine-tuning a single layer feed-forward neural network stacked on top of mBERT and MMTE. We report the results in Table 7. While the scores of the feature-based approach are significantly lower than those obtained via full fine-tuning (5), we see that MMTE still outperforms mBERT on both in-language and zero-shot settings by an even bigger margin. This is particularly interesting as the feature-based approach has its own advantages: 1) it is applicable to downstream tasks which require significant task-specific parameters on top of a transformer encoder, 2) it is computationally cheaper to train and tune the downstream model, and 3) it is compact and scalable since we only need a small number of task-specific parameters.
Few Shot Transfer
While zero-shot transfer is a good measure of a model’s natural cross-lingual effectiveness, the more practical setting is the few-shot transfer scenario as we almost always have access to, or can cheaply acquire, a small amount of data in the target language. We report the few-shot transfer results of mBERT and MMTE on the POS tagging dataset in 8. To simulate the few-shot setting, in addition to using English data, we use 10 examples from each language (upsampled to 1000). MMTE outperforms mBERT in few-shot setting by 0.6 points averaged over 48 languages. Once again, we see that the gains are more pronounced in low resource languages.
One Model for all Languages
Another setting of importance is the in-language training where instead of training one model for each language, we concatenate all the data and train one model jointly on all languages. We perform this experiment on the POS tagging dataset with 48 languages and report results in Table 9. We observe that MMTE performance is on par with mBERT. We also find that the 48 language average improves by 0.2 points as compared to the one model per language setting in Table 5.
Number of Languages in mNMT
We perform an ablation where we vary the number of languages used in the pre-training step. Apart from the 103 language setting, we consider 2 additional settings: 1) where we train mNMT on 4 languages to and from English, and 2) where we use 25 languages. The results are presented in Table 10. We see that as we scale up the languages the zero-shot performance goes down on both POS tagging and XNLI tasks. These losses align with the relative BLEU scores of these models suggesting that the regressions are due to interference arising from the large number of languages attenuating the capacity of the NMT model. Scaling up the mNMT model to include more languages without diminishing cross-lingual effectiveness is a direction for future work.
Effect of the Target Language Token
During the pre-training step, when we perform the translation task using the mNMT system, we prepend a 2xx token to the source sentence, where xx indicates the target language. The encoder therefore has always seen a 2en token in front of non-English sentences and variety of different tokens depending on the target language in front of English sentence. However, when fine-tuning on downstream tasks, we do not use this token. We believe this creates a mismatch between the pre-training and fine-tuning steps. To investigate this further, we perform a small scale study where we train an mNMT model on 4 languages to and from English in two different settings: 1) where we prepend the 2xx token, and 2) where we don’t prepend the 2xx token but instead encode it separately. The decoder jointly attends over both the source sentence encoder and the 2xx token encoding. The BLEU scores on the translation tasks are comparable using both these approaches. The results on cross-lingual zero-shot transfer in both settings are provided in Table 11. Removing the 2xx token from the source sentence during mNMT training improves cross-lingual effectiveness on both POS tagging and XNLI task. Training a massively multilingual NMT model that supports translation of 102 languages to and from English without using the 2xx token in the encoder is another direction for future work.
|with <2xx>||without||with <2xx>||without|
5 Related Work
We briefly review widely used approaches in cross-lingual transfer learning and some of the recent work in learning contextual word representations (CWR).
Multilingual Word Embeddings
For cross-lingual transfer, the most widely studied approach is to use multilingual word embeddings as features in neural network models. Several recent efforts have explored methods that align vector spaces for words in different languages faruqui2014improving; upadhyay2016cross; ruder2017survey.
More recent work has shown that CWRs obtained using unsupervised generative pre-training techniques such as language modeling or cloze task taylor1953cloze have led to state-of-the-art results beyond what was achieved with traditional word type representations on many monolingual NLP tasks peters2018deep; devlin2018bert; howard2018universal; radford2018improving such as sentence classification, sequence tagging, and question answering. Subsequently, these contextual methods have been extended to produce multilingual representations by training a single model on text from multiple languages which have proven to be very effective for cross lingual transfer wu2019beto; mulcaire2019polyglot; piresmultilingual. lample2019cross show that adding a translation language modeling (TLM) objective to mBERT’s MLM objective utilizes both monolingual and parallel data to further improve the cross-lingual effectiveness.
Representations from NMT
The encoder from an NMT model has been used as yet another effective way to contextualize word vectors mccann2017learned. Additionally, recent progress in NMT has enabled one to train multilingual NMT systems that support translation from multiple source languages into multiple target languages within a single model johnson2017google. Our work is more closely related to two very recent works which explore the encoder from multilingual NMT model for cross-lingual transfer learning eriguchi2018zero; artetxe2018massively. While eriguchi2018zero also consider multilingual systems, they do so on a much smaller scale, training it on only 2 languages. artetxe2018massively uses a large scale model comparable to ours with 93 languages but they constrain the model by pooling encoder representations and therefore only obtain a single vector per sequence. Neither of these approaches have been used on token level sequence tagging tasks. Further, neither concern themselves with the performance of the actual translation task whereas we our mNMT model performs comparable to bilingual baselines in terms of translation quality.
6 Conclusion and Future Work
We train a massively multilingual NMT system using parallel data from 103 languages and exploit representations extracted from the encoder for cross-lingual transfer on various classification and sequence tagging tasks spanning over 50 languages. We find that the positive language transfer visible in improved translation quality for low resource languages is also reflected in the cross-lingual transferability of the extracted representations. The gains observed on various tasks over mBERT suggest that the translation objective is competitive with specialized approaches to learn cross-lingual embeddings.
We find that there is a trade off between the number of languages in the multilingual model and efficiency of the learned representations due to the limited capacity. Scaling up the model to include more languages without diminishing transfer learning capability is a direction for future work. Finally, one could also consider integrating mBERT’s objective with the translation objective to pre-train the mNMT system.
Appendix A Supplementary Material
In this section we provide the list of languages codes used throughout this paper and the statistics of the datasets used for the downstream tasks.
|#Training Samples||#Dev Samples||#Test Samples|
|#Training Samples||#Dev Samples||#Test Samples|
|#Training Samples||#Dev Samples||#Test Samples|
|#Training Samples||#Dev Samples||#Test Samples|
|#Training Samples||#Dev Samples||#Test Samples|