The longstanding goal of multilingual machine translation firat16; johnson16; aharoni19; gu2018universal has been to develop a universal translation model, capable of providing high-quality translations between any pair of languages. Due to limitations on the data available, however, current approaches rely on first selecting a set of languages for which we have data and training an initial translation model on this data jointly for all languages in a multi-task setup. In an ideal setting, one would continually update the model once data for new language pairs arrives. This setting, dubbed in the literature as continual learning ring1994continual; rebuffi2017icarl; kirkpatrick2017overcoming; lopez2017gradient, introduces new challenges not found in the traditional multi-task setup, most famously catastrophic forgetting mccloskey1989catastrophic, in which the model may lose its previously-learned knowledge as it learns new language pairs. This situation is further complicated by the training procedures of standard tokenizers, such as Byte-Pair Encoding (BPE) sennrich2015neural or Sentencepiece kudo18, which necessitate access to monolingual data for all the languages considered before producing the vocabulary. Failing to comply with these requirements, one risks suboptimal segmentation rules which in the worst case could result in strings of entirely <UNK> tokens for text in a previously-unseen alphabet.
In this work, we investigate how vocabularies derived from BPE transform if they are rebuilt with the same settings but with additional data from a new language. We show in Section 3.1 that there is a large token overlap between the original and updated vocabularies. This large overlap allows us to retain the performance of a translation model after replacing its vocabulary with the updated vocabulary that additionally supports a new language.
Past works have explored adapting translation models to new languages, typically focusing on related languages which share similar scripts gu2018universal; neubig2018rapid; lakew2019adapting; chronopoulou2020reusing. These works usually focus solely on learning the new language pair, with no consideration for catastrophic forgetting. Moreover, these works only examine the setting where the new language pair comes with parallel data, despite the reality that for a variety of low-resource languages, we may only possess high-quality monolingual data with no access to parallel data. Finally, unlike our approach, these approaches do not recover the vocabulary one would have built if one had access to the data for the new language from the very beginning.
Having alleviated the vocabulary issues, we study whether we are able to learn the new language pair rapidly and accurately, matching the performance of a model which had access to this data at the beginning of training. We propose a simple adaptation scheme that allows our translation model to attain competitive performance with strong bilingual and multilingual baselines in a small amount of additional gradient steps. Moreover, our model retains most of the translation quality on the original language pairs it was trained on, exhibiting no signs of catastrophic forgetting.
2 Continual learning via vocabulary substitution
Adapting translation models to new languages has been studied in the past. neubig2018rapid showed that a large multilingual translation model trained on a subset of languages of the TED dataset qi2018and could perform translation on the remaining (related) languages. tang2020multilingual was able to extend the multilingual translation model mBART liu2020multilingual from 25 to 50 languages by exploiting the fact that mBART’s vocabulary already supported those additional 25 languages. escolano2021bilingual was able to add new languages to machine translation models by training language-specific encoders and decoders. Other works zoph16; lakew2018transfer; lakew2019adapting; escolano-etal-2019-bilingual have studied repurposing translation models as initializations for bilingual models for a target low-resource language pair. Most recently chronopoulou2020reusing
examined reusing language models for high-resource languages as initializations for unsupervised translation models for a related low-resource language through the following recipe: build vocabularyand a language model for high-resource language ; once data for low-resource language arrives, build a joint vocabulary and let be the tokens from that appear in ; substitute the vocabulary for the language model with the one given by and use the language model as the initialization for the translation model.
In this work, we are not only interested in the performance of our multilingual translation models on new language pairs, we also require that our models retain the performance on the multiple language pairs that they were initially trained on. We will also be interested in how the performance of these models compares with those obtained in the oracle setup where we have all the data available from the start. The approaches discussed above generate vocabularies that are likely different (both in selection and number of tokens) from the vocabulary one would obtain if one had a priori access to the missing data, due to the special attention given to the new language. This architectural divergence will only grow as we continually add new languages, which inhibits the comparisons to the oracle setup. We eliminate this mismatch by first building a vocabulary on the languages available, then once the new language arrives, build a new vocabulary as we would have if we had possessed the data from the beginning and replace with . We then reuse the embeddings for tokens in the intersection111Tokens shared between the two vocabularies are also forced to share the same indices. The remaining tokens are rewritten but we still reuse the outdated embeddings. and continue training.
The success of our approach relies on the fact for large (i.e. the multilingual setting), and are mostly equivalent, which allows the model to retain its performance after we substitute vocabularies. We verify this in the following section.
In this section, we outline the set of experiments we conducted in this work. We first discuss the languages and data sources we use for our experiments. We then provide the training details for how we trained our initial translation models. Next, we compute the token overlap between various vocabularies derived from BPE before and after we include data for a new language and empirically verify that this overlap is large if the vocabulary already suppots a large amount of languages. We then examine the amount of knowledge retained after vocabulary substitution by measuring the degradation of the translation performance on the original language pairs from replacing the original vocabulary with an updated one. Finally, we examine the speed and quality of the adaptation to new languages under various settings.
Our initial model will have to access to data coming from 24 languages222In alphabetical order: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Gujarati, Hindi, Croatian, Hungarian, Italian, Lithuanian, Latvian, Portugese, Romanian, Russian, Slovak, Slovenian, Tamil.. Our monolingual data comes primarily from the newscrawl datasets333http://data.statmt.org/news-crawl/ and Wikipedia, while the parallel data comes WMT training sets and Paracrawl. We will adapt our model to the following four languages: Kazakh, which is not related linguistically to any of the original 24 languages, but does share scripts with Russian and Bulgarian; Bengali, which is related to the other Indo-Aryan languages but possesses a distinct script; Polish, which is related to (and shares scripts with) many of the Slavic languages in our original set; Pashto, which is not closely related444Closest languages are in the Indic branch, but the Indic and Iranian branches split over 4000 years ago. to any of the languages in our original set and has a distinct script. We provide an in-depth account of the data available for each language in the appendix.
We perform our experiments in JAX jax2018github
, using the neural network library FLAX555https://github.com/google/flax. We use Transformers vaswani17 as the basis of our translation models. We use the Transformer Big configuration and a shared BPE model of 64k tokens with byte-level fallback using the Sentencepiece666We use 1.0 character coverage, split by whitespace, digits, and include a special token MASK for the MASS objective. library. We used a maximum sequence length of 100, discarded all sequences longer than that during training.
We train our models leveraging both monolingual and parallel datasets, following previous work siddhant-etal-2020-leveraging; garcia20
. We sample examples from monolingual and parallel sources with equal probability. Within each source, we use a temperature-based sampling scheme based on the numbers of samples of the relevant datasets with a temperature of 5arivazhagan19a.
We apply the MASS objective song19 on the monolingual data and cross-entropy on the parallel data. We used the Adamkingma14
optimizer, with an initial learning rate of 4e-4, coupled with a linear warmup followed by a linear decay to 0. The initial warmup took 1k steps, and the total training time was 500k steps. We also included weight decay with a hyperparameter of 0.2.
We use beam search with a beam size of 4 and a length penalty of for decoding. We evaluate the quality of our models using BLEU scores papineni2002bleu. We exclusively use detokenized BLEU, computed through sacreBLEU post-2018-call for consistency with previous work and future reproducability.777BLEU + case.mixed + numrefs.1 + smooth.exp + tok.13a + version.1.4.14
3.1 Transfer learning from vocabulary substitution
|# langsin base||bn||pl||kk||ps||bn+pl+kk+ps|
|Model||PMIndia bnen||newsdev2020plen||newstest2019kken||FLoRes devsetpsen|
|xx monolingual & parallel||5.7||13.6||20.2||26.2||3.9||17.2||2.8||10.3|
|4xx monolingual & parallel||5.3||15.1||18.3||25.0||2.7||15.8||2.3||8.4|
|xx monolingual (+BT)||-||-||21.3||24.1||4.7||19.5||-||-|
|xx monolingual & parallel||10.0||27.2||21.5||27.5||5.9||20.2||6.6||15.1|
|4xx monolingual & parallel||10.5||26.4||20.3||26.8||5.6||20.5||6.7||15.2|
|Oracle||xx monolingual & parallel||10.1||29.2||19.6||26.8||5.4||20.5||6.6||14.7|
|4xx monolingual & parallel||10.0||28.6||18.9||26.4||5.4||20.3||6.2||14.4|
Measuring token overlap
We now examine the impact on the vocabulary derived from a BPE model upon the inclusion on a new language. We first build corpora consisting of text888We used 1 million lines of raw text per language. from 1, 5, 10, 15, 20, and 24 of our original languages. For each corpus, we make copies and add additional data for either Bengali, Polish, Kazakh, Pashto, or their union, yielding a total of 30 corpora. We build BPE models using the same settings for each corpus, compute the token overlap between the vocabularies with and without the additional language, and report the results in Table 1. In the multilingual setting, we attain large token overlap, more than 90%, even for languages with distinct scripts or when we add multiple languages at once. We extend this analysis to different vocabulary sizes and examine which tokens are “lost” in Appendix A.3.
3.2 Evaluating translation quality and catastrophic forgetting
Measuring the deterioration from swapping vocabularies at inference
To measure the amount of knowledge transferred through the vocabulary substitution, we compute the translation performance of our initial translation model with the adapted vocabularies without any additional updates. For each new language, we compute the change in BLEU from the model with its original vocabulary and the one utilizing the adapted one and plot the results in Figure 1. Notably, we only incur minor degradation in performance from the vocabulary substitution.
We now study the effect of introducing a new language into our translation model. We require an adaptation recipe which enjoys the following properties: fast, in terms of number of additional gradient steps; performant, in terms of BLEU scores on the new language pair; retentive, in terms of minimal regression in the translation performance of the model on the original language pairs.
Our solution: re-compute the probabilities for the temperature-based sampling scheme using the new data, upscale the probabilities of sampling new datasets by a factor then rescale the remaining probabilities so that their combined sum is one. We limit ourselves to either 15k or 30k additional steps (3% and 6% respectively of the training time for the original model) depending on the data available999We use 15k steps if we leverage both monolingual and parallel data for a single language pair. We use 30k steps if we only use monolingual data or if we are adapting to all four languages at once. to ensure fast adaptation. We reset the Adam optimizer’s stored accumulators, reset the learning rate to 5e-5 and keep it fixed. We provide more details in Appendix A.2. Aside from these modifications, we continue training with the same objectives as before unless noted otherwise. We include the results for oracle models trained in the same way as the original model but with access to both the adapted vocabulary and the missing data. We compute the BLEU scores and report them in Table 2.
Our models adapted with parallel data are competitive with the oracle models, even when we add all four languages at once and despite the restrictions we imposed on our adaption scheme. For languages that share scripts with the original ones (Kazakh and Polish), we can also attain strong performance leveraging monolingual data alone, albeit we need to introduce back-translation sennrich2015improving for optimal performance. We can also adapt the translation model using the original vocabulary, but the quality lags behind the models using the adapted vocabularies. This gap is larger for Bengali and Pashto, where the model is forced to rely on byte-level fallback, further reaffirming the value of using the adapted vocabularies.
To examine whether catastrophic forgetting has occured, we proceed as in Section 3.1 and examine the performance on the original language pairs after adaptation on the new data against the oracle model which had access to this data in the beginning of training. We present the results for the models adapted to Kazakh in Figure 2. All the models’ performance on the original language pairs deviate only slightly from the oracle model, mitigating some of the degradation from the vocabulary substitution i.e. compare the kk and bn+pl+kk+ps curves in Figure 1 to the curves in Figure 2.
Lastly, we compare our models with external baselines for Kazakh. We consider the multilingual model mBART liu2020multilingual as well as all the WMT submissions that reported results on English Kazakh. Of these baselines, only mBART and kocmi-etal-2018-cuni use sacreBLEU which inhibits proper comparison with the rest of the models. We include them for completeness. We report the scores in Table 3. Our adapted models are able to outperform mBART in both directions, and as well some of the weaker WMT submissions, despite those models specifically optimizing for that language pair and task.
|xx monolingual (+BT)||4.7||19.5|
|xx monolingual & parallel||5.9||20.2|
|4xx monolingual & parallel||5.6||20.5|
We present an approach for adding new languages to multilingual translation models. Our approach allows for rapid adaptation to new languages with distinct scripts with only a minor degradation in performance on the original language pairs.
Appendix A Appendix
a.1 Data statistics and details
a.2 Adaption schemes
We now explain in detail our configurations:
Monolingual data for a single language
In this case, we compute the probabilities following the temperature-based sampling scheme that we would have obtained had we computed with this data in the first place. Then we proceed to set the sampling probability of the new monolingual to 30% and rescale the remaining probabilities so that they add up to 1.
Monolingual data for a single language coupled with back-translation
In order to properly utilize back-translation, we first train the model for 10k step in the same fashion as the previous paragraph. Then, we use offline backtranslation on the new monolingual data to generate pseudo-parallel data. We then treat this data as authentic and include it in the model. We set the sampling probability of the pseudo-parallel data to be 10%, we reset the sampling probability of the monolingual data to 10%, and rescale the rest so that they sum up to 1. We then continue training for an additional 20k steps, amounting to a total of 30k steps.
Monolingual & parallel data for a single language
We multiply the probabilities of the new parallel data by a factor of 10, set the sampling probability of the monolingual data to 10% then rescale the reamining probabilities so that they are normalized. We then train for 15k steps.
Monolingual & parallel data for all four languages
We do not use the same scaling as before, since this would aggressively undersample the original language pairs. Instead, we first average the total probabilities for the new parallel data, multiply it by 5 and then assign this probability to each of the parallel datasets. We then fix the probability of sampling the new monolingual datasets to be 5% each. We then train for 30k steps
a.3 Token overlap analysis
Next, we examine which tokens are lost during the vocabulary substitution. Since the Sentencepiece library does not provide an easy way to acquire frequency scores for BPE models after training, we instead use the order of the tokens as a proxy for the relative ranking obtained by sorting the tokens by frequency. For each language, we produce violin plots for the indices in the original vocabulary which are not in the adapted vocabulary for that language in Figure 3.
Critically, we observe that most of the tokens lost are towards the end of spectrum, suggesting that the model is mostly discarding infrequent tokens. Notably, it cannot discard the tail due to our requirement of full character coverage, which introduces a variety of rare Unicode characters as tokens that reside in the tail.
|# languagesin base model||bn||pl||kk||ps||bn+pl+kk+ps|
|# languagesin base model||bn||pl||kk||ps||bn+pl+kk+ps|
|Language||Monolingual data(# of lines)||Parallel data(# of examples)||Domain(Monolingual data)||Domain(Parallel data)||Test set||Language family||BLEUen-xx||BLEUxx-en|
|Language||Monolingual data(# of lines)||Parallel data(# of examples)||Domain(Monolingual data)||Domain(Parallel data)||Test set||Language family|
|kk||4032908||222424||Newscrawl + Wiki Dumps||WMT||WMT||Turkic|
|ps||6969911||1134604||Newscrawl + CommonCrawl||WMT||WMT||Indo-Iranian|