Building Machine Translation Systems for the Next Thousand Languages

by   Ankur Bapna, et al.

In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven filtering techniques; (ii) Developing practical MT models for under-served languages by leveraging massively multilingual models trained with supervised parallel data for over 100 high-resource languages and monolingual datasets for an additional 1000+ languages; and (iii) Studying the limitations of evaluation metrics for these languages and conducting qualitative analysis of the outputs from our MT models, highlighting several frequent error modes of these types of models. We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.


Breaking Down Multilingual Machine Translation

While multilingual training is now an essential ingredient in machine tr...

Improving Neural Machine Translation of Indigenous Languages with Multilingual Transfer Learning

Machine translation (MT) involving Indigenous languages, including those...

Balancing Training for Multilingual Neural Machine Translation

When training multilingual machine translation (MT) models that can tran...

Macro-Average: Rare Types Are Important Too

While traditional corpus-level evaluation metrics for machine translatio...

USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation

The vast majority of evaluation metrics for machine translation are supe...

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

Research in NLP lacks geographic diversity, and the question of how NLP ...

1 An Overview

The past decade has seen tremendous improvements in the quality of academic and commercial machine translation (MT) systems. These improvements have largely been driven by advances in machine learning and the availability of large-scale web-mined datasets 

(Resnik & Smith, 2003; Uszkoreit et al., 2010; Esplà-Gomis, 2009; Esplà et al., 2019; Bañón et al., 2020; Schwenk et al., 2021)

. The advent of deep learning and sequence-to-sequence models 

(Sutskever et al., 2014; Bahdanau et al., 2015; Luong et al., 2015; Vaswani et al., 2017), large parallel (and monolingual) datasets mined from the web, data augmentation approaches like back-translation (Sennrich et al., 2016; Edunov et al., 2018) and self-training (He et al., 2019) and massively multilingual modeling (Firat et al., 2016a; Johnson et al., 2017; Aharoni et al., 2019a; Arivazhagan et al., 2019; Tang et al., 2021; Fan et al., 2021) have enabled high quality machine translation systems that can support over 100 languages.

However, despite tremendous progress in low-resource MT, the number of languages for which widely-available, general-domain MT systems have been built has been limited to around , which is a small fraction of the over

languages that are spoken in the world today. Apart from the limited number of languages, the distribution of languages supported by current MT systems is highly skewed in favour of European languages. Despite high speaker populations, languages spoken in Africa, South and South-East Asia and indigenous languages of the Americas are relatively under-served. For example, Google Translate supports Frisian, Maltese, Icelandic, and Corsican, each with fewer than 1M L1 speakers, but not (up until this work) Bhojpuri (~51M speakers), Oromo (~24M speakers), Quechua (~9M speakers), or Tigrinya (~9M speakers) 

(van Esch et al., 2022). We will refer to these languages as long-tail languages, since the data scarcity requires the application of machine learning techniques that can generalize beyond the languages for which ample training data is available.

Web-crawled datasets for 1000 languages: The progress towards building machine translation systems in these languages has largely been limited by the lack of digitized and accessible datasets and NLP tools like language identification (LangID) models; such resources are ubiquitous for higher resource languages. The first stage of this paper describes our approach to building monolingual web text corpora in over 1500 languages, with a particular focus on dealing with common noise, data quality and scale challenges encountered when building a dataset from the web (Section 2).

Towards this goal, we first scale LangID models to 1500+ languages (2.1.1

), both using traditional n-gram models and semi-supervised approaches. We next describe several practical approaches and filtering techniques that enable using these LangID models to identify long-tail language data on the web. To minimize recall loss during mining, we cluster languages by error rate (

2.1.3). To reduce noise from LangID mis-predictions, we leverage document-level LangID consistency to filter our data (2.1.4), followed by percent-threshold wordlist filtering (2.1.5), Tf-iif filtering (2.1.7), and hand-designed filters to address specific noise issues with certain languages (2.1.8). We describe the resulting dataset in Section 2.2 and perform an audit of 72 language corpora (2.2.2), finding that the data is between 70% and 100% in-language, with a median score of 80%.

Machine Translation for long-tail languages: With monolingual data mined from the web, the next challenge is to build high quality, general-domain MT models from limited amounts of monolingual training data. We follow a practical approach, utilizing all the parallel data that is available for higher resource languages to boost the quality of long-tail languages where only monolingual data is available111Preliminary experiments indicated no quality improvements on our evaluation sets when incorporating widely available limited-domain parallel corpora in our models.. We will refer to this setting as zero-resource since no direct supervision is available for our long-tail languages. We want to emphasize that this term is used only in the technical sense, meaning that the model itself is not able to see parallel text; in reality, there is a richness of resources for these languages, including tens of millions of native speakers, centuries (or in some cases millenia) of scholarship, and even large segments of text inaccessible to digital methods. See Bird (2020) for some more reflections on this term.

We leverage several techniques that have been developed for MT over the last few years in order to boost the quality of zero-resource translation for long-tail languages. These include self-supervised learning from monolingual data, massively multilingual supervised training, large-scale back-translation and self-training, and high capacity models. We utilize these tools to build MT models capable of translating across over 1000 languages, utilizing our existing parallel corpus spanning around 100 languages and the 1000-language monolingual dataset built from the web (Section 


We first highlight the importance of model capacity in highly multilingual models by comparing the performance of 1.5B and 6B parameter Transformers on zero-resource translation (3.2). We then scale up the number of self-supervised languages to 1000, demonstrating that the performance of most long-tail languages improves as more monolingual data from similar languages becomes available (3.3

). While our 1000-language models demonstrate reasonable performance, in order to understand the strengths and limitations of the approach we incorporate large-scale data augmentation. For practical purposes we fine-tune the resulting model on a subset of 30 languages with large amounts of synthetic data via self-training and back-translation (

3.4). We further describe practical approaches to filter synthetic data to increase the robustness of these fine-tuned models to hallucinations and wrong-language translations (3.5). We also describe our efforts to distill these models into smaller, more inference-friendly architectures using sequence level distillation, and highlight the performance gaps between the teacher and student models (3.6).

Evaluating MT for 1000 languages: Existing MT systems heavily rely on n-gram overlap based lexical metrics (Bleu(Papineni et al., 2002) or the new and emerging model-based metrics like YiSi (Lo, 2019), BLEURT (Sellam et al., 2020) or COMET (Rei et al., 2020) to evaluate translation quality. These metrics are usually computed on static evaluation sets with fixed references obtained from professional translators or crowd-workers.

To evaluate our MT models, we first build an evaluation set for 38 selected languages from the long-tail (4.1), by translating English sentences into these languages. We highlight the limitations of Bleu in the long-tail setting and make the case for evaluating these languages with ChrF (4.2). We also propose an approximate, round-trip translation based, reference-free metric to understand the quality of our models on languages where reference sets were unavailable, and report the quality of our models as measured on this metric (4.3). We conduct and report the findings from human evaluations of our models (on a subset of 28 languages), confirming that it is possible to build functioning MT systems by following the recipe described in this paper (4.4).

To understand the weaknesses of our massively-multilingual zero-resource models, we perform qualitative error analysis on several languages. We find that our models often confuse distributionally similar words and concepts, e.g. “tiger” becomes a “miniature crocodile” (4.5) and their ability to translate tokens deteriorates on less frequent tokens for lower-resource settings (4.6). We also find that these models often fail to adequately translate short, or single word inputs (4.7). A study of our distilled models also reveals that all models are more likely to magnify biases or noise present in the training data (4.8).

Other miscellaneous findings and experiments: We perform a few additional experiments on these models, demonstrating that they often perform better when translating directly between similar languages — rather than using English as a pivot (5.1) and that they can be utilized for zero-shot transliteration between different scripts (5.2). We describe a practical technique of appending terminal punctuation to any input, called the “period trick”, that improves the quality of translation (5.3). Furthermore, we demonstrate that these models are robust to nonstandard Unicode glyph usage for some but not all languages (5.4), and we explore several non-Unicode fonts (5.5).

Throughout the course of this work, we rely extensively on native speakers of these languages to guide, evaluate and often contribute technically to the development of these systems. Section 6 highlights the important role they play — a role that researchers not familiar with the language and community could not have contributed.

As a result of the work outlined in this paper, we add support for 24 new languages to Google Translate. Adding a language to a user-facing product requires considerable effort and individual attention to evaluate each language thoroughly and consult members of affected communities; as a result we limited this effort to 30 languages, of which 24 met our launch bar. This set of languages was chosen to cover languages with large speaker populations in regions that are under-represented in technology, like the Americas, Africa, and South Asia.

2 Building a 1000-Language Web text Dataset

It is difficult to uncover large, clean, and highly multilingual corpora from the web. As explored in Caswell et al. (2020), the task of using LangID models for web-mining is exceedingly challenging for low-resource languages, with noise arising from domain mismatch between LangID training data and web text, idiosyncrasies of n-gram based LangID models, and the massive class imbalance between high- and low-resource languages. This is further borne out by Kreutzer et al. (2022), who demonstrate that a variety of public multilingual corpora, created using techniques that work for high-resource languages, are unusably noisy for many low-resource languages.

This problem is compounded by the size of the internet. For lower-resource languages, it is important to get as much data as possible to have any usable signal for a model to learn from. However, the sheer quantity of data online makes it infeasible to apply computationally intensive methods, like Transformer-based models; and even storing unfiltered data on disk can be challenging.

The following subsections explain in detail the approach we took to crawl a monolingual text dataset for 1500+ languages. Our approach focused on recovering high-precision data (high percentage clean, in-language text), so a large portion of the steps are various filtration approaches. The work in this section is an extension of and improvement over the methods presented in Caswell et al. (2020).

In summary, our approach is as follows:


Omit languages from the LangID model with poor quality training data and poor LangID performance; train both a 1,629-language CLD3 (n-gram) LangID model and Semi-supervised LangID (SSLID) model


Cluster languages by their error rate in the CLD3 model


Perform a first-pass webcrawl with the CLD3 model


Filter sentences with document-consistency filtering


Filter all corpora with percent-threshold wordlist filtering


Filter all corpora with Semi-Supervised LangID (SSLID) filtering


Detect outliers languages with Relative Recall Rate and filter them with Term-Frequency-Inverse-Internet-Frequency (

Tf-iif) filtering


Detect outliers with Token-Frequency Anomalousness score and hand-design filters for them


De-duplicate all corpora at a sentence level

2.1 Steps necessary to create the dataset

2.1.1 LangID modeling

As described in Caswell et al. (2020), the task of reliably detecting what language a given portion of text is in, known as Language Identification or LangID, is surprisingly challenging for low-resource languages on the web.

There are two types of LangID models used in this work: CLD3 models (Bakalov et al., 2016) and Semi-Supervised LangID (SSLID) models (Caswell et al., 2020)

. The CLD3 models are a multi-layer perceptron on top of a bag of character ngrams, and are very efficient, but suffer from various error pathologies. The semi-supervised models use the MASS objective 

(Song et al., 2019) on noisy web text in addition to the LangID prediction task, and use the Transformer Big (Vaswani et al., 2017) architecture. They are generally more accurate and also more robust to web noise, but are much slower to run inference with.

Overall, three LangID models were used for this work: 1) a 1,745-language CLD3 model to measure the quality of the training data (Section 2.1.2); 2) a 1,629-language CLD3 model to cluster languages and label sentences when performing a deep crawl of the web (Sections 2.1.3 and 2.1.4); and 3) a 1,629-language semi-supervised LangID model for filtering (Section 2.1.6).

The training and evaluation data for our LangID models is identical to that described in Caswell et al. (2020); namely, we trained on an aggregation of proprietary and publicly available text corpora, with an average of 800K tokens per language. Some of the data came from sources with language tags like Wikipedia or Corpus Crawler (Brawer, 2017), while another subset was created using a text elicitation task where we prompted native speakers to write sentences in their language (van Esch et al., 2019).

2.1.2 Paring down the set of languages

An important first step is determining what languages to crawl the web for. In an ideal world we would crawl all possible languages, but some of them may have poor-quality training data, or be very close to each other and be hard to distinguish.

We trained a CLD3 LangID model on 1,745 language varieties, and investigated languages that seemed to be having large quality losses. Our metrics for considering a language as possibly presenting difficulties were 1) LangID precision was under 33%; 2) language had over 50% confusion (False Negative Rate, FNR, or False Discovery Rate, FDR) with respect to another language; 3) a preliminary web-crawl with this model indicated severe overtriggering with a high-resource language; 4) LangID training data had under 2000 examples. Following this analysis we removed 116 languages from the model, primarily based on poor-quality training data. Among these most were regional dialects in Europe.

We additionally did not try to crawl data for the highest-resource 83 languages, in part because of storage space limitations, and in part as we already had monolingual data for these languages from a previous web crawl.

2.1.3 False Negative Rate clustering

For the first-pass, we wanted to avoid lost recall between related dialects, of which many existed in the 1,629 languages. For instance, our training data contains many different varieties of Hindustani, e.g. Haryanvi, Garhwali, Magahi, and so on.

To mitigate this, we clustered languages based on their False Negative Rate (FNR), as measured on our LangID evaluation sets. Of the many error rates one could cluster on, we chose to cluster on FNR to minimize the loss in recall for related languages. We clustered using the Sklearn (Pedregosa et al., 2011) implementation of Hierarchical Agglomerative Clustering, using distance_threshold=None and affinity="precomputed", and linkage=average

. These parameters were chosen largely because they seemed to produce the nicest distribution in cluster sizes. Since some of the clusters were still excessively large, we re-split the largest ones by hand such that no cluster had more than 20 languages. We split these clusters by hand since re-splitting using clustering algorithms still resulted in very uneven sizes, probably because these clusters have hubs with poor or noisy data.

To observe the effect of clustering predictions, we compare the size of a dataset for a language if document-consistency filtering (Section 2.1.4) were applied on the language-code level (which would result in lower recall) or on the cluster level. The median ratio between these sizes was 1.006, meaning that for most languages, clustering didn’t significantly improve recall. However, for some languages it improved recall significantly, with 57 languages showing a dataset size increase of greater than 20x. The languages with large wins here were by-and-large to be expected, including Hindustani and Arabic varieties, and a variety of cases like Oromo (om) and Eastern Oromo (hae).

The higher-resource languages omitted in this crawl (Section 2.1.2) were all put in their own clusters.

2.1.4 Document Consistency filtering and First-pass LangID

To start with, we performed LangID prediction on every sentence in every document with the 1,629-language CLD3 n-gram LangID model. Having obtained these predictions, we applied document consistency filtering (Caswell et al., 2020). This is one of the simplest and by far most effective filtering steps. We simply discarded any sentence whose sentence-level LangID cluster prediction did not match the document-level LangID cluster prediction. We defined the document-level LangID cluster prediction as the most-often predicted cluster among all sentences — e.g. if a document had 20 sentences in cluster A, 19 sentences in cluster B, and 18 in cluster C, we gave it a document-level ID of cluster A.

Figure 1: Histogram of document consistency scores, using a 1,745-language CLD3 LangID model, on web text. The large majority of sentence-level LangID predictions had a score of under 10%, indicating that they were likely noisy predictions on pages in other languages or non-linguistic content. Of the 0.55% with document consistency score over 90%, over half (0.3%) had a score of 100%.

To investigate the effectiveness of document consistency filtering, we looked at the document consistency score. For a given sentence in a document, the document consistency score is simply the percent of sentences in that document sharing the same LangID prediction as that sentence. (Therefore, if the document consistency score is over 50%, it will never be filtered out with document consistency filtering.) Figure 1

shows the deciles for document consistency score across all languages, for a sample web crawl with the initial, 1,745-language CLD3 LangID model. The distribution is heavily weighted towards the lower values, with a whole 83% of sentences having a score under 10%. The large mass of sentences with a low document consistency score indicates that there are many single, random sentences in larger documents, including non-linguistic content, whose language is mis-predicted. This suggests a more refined approach to document-consistency filtering. Using the method outlined above, a page with content in multiple languages could only yield data for one language; if instead any sentence were preserved with a document consistency score under a particular threshold (say, 0.3), documents with multilingual content would be handled much better, while preserving most of the benefits of filtering. This is left for future work.

2.1.5 High-recall Wordlist-filtering

Following Caswell et al. (2020), we apply percent-threshold wordlist filtering to all languages. A sentence was discarded if it had in-language words for any of the languages in the cluster, where the wordlists were the most frequent 800 words from the LangID training data.

2.1.6 Filtering and Declustering with Semi-supervised LangID

Once the first-pass, clustered filtering was completed, we classified each sentence with the more computationally expensive Semi-Supervised LangID model (SSLID), as described in

Caswell et al. (2020), resulting in per-language corpora. If the SSLID model predicted any language outside of the cluster, we filtered that sentence. The CLD3 predictions were ignored.

2.1.7 Tf-iif Filtering

Even after these three rounds of filtering — CLD3, percent-threshold, and SSLID – many languages still had extremely noisy data. The worst case was Tok Pisin (tpi), whose dataset consisted of 1.3B sentences, of which over were in Standard English (mostly containing the word “long”, which is also a common function word in Tok Pisin). Therefore, as in Caswell et al. (2020), we applied percent-threshold filtering with Tf-iif (Term-Frequency Inverse-Internet-Frequency) lists 222Open-sourced at, meaning that we retained any sentence which contained at least 20% of its tokens in our Tf-iif wordlists. However, we optimize the approach used in Caswell et al. (2020)

, and furthermore develop a heuristic metric to determine whether to apply the filtering on a per-language basis.

In contrast to Caswell et al. (2020), we omitted the IDF term entirely, since this term is influenced by the set of languages one considers, and becomes less helpful as the number of languages scales. We also adjusted the values of the parameters and (described below). To understand these changes, we can revisit the formulation for Tf-iif. For a token in a language , with a frequency function and language-specific corpora :


Note the clipping parameter in the IIF term, which is introduced to account for OOV tokens (which are common) and noise near the tail of the distribution. For this work we set — in other words, we only consider the top terms in the empirical frequency distribution, and give all less common terms the same weight as the most common token. A higher value of means that more words are covered by the frequency distribution – which seems good – but it also means that OOV words will have a higher weight, since the IIF term is an inverse. In the worst case this could mean that the resulting Tf-iif wordlist would contain too many rare words. An especially low value of , e.g. , would essentially guarantee this, pushing up only OOV words to the top of the list.

There is another parameter hidden in this formulation, namely . This parameter is simply how many words to include in the wordlist. For instance, means that the wordlist used for filtering includes the top 1000 words by Tf-iif score.

In order to investigate the effect of these parameters, we define a heuristic measure of distractibility as a proxy for precision, which measures how much a given language’s data is likely to be polluted by a common high-resource language. For a language and a set of distractor languages , we define the distractibility :


Where, for our purposes, we have chosen = {en, de, es, hi, id, ar, ru}, and the False Discovery Rate (FDR) of a distractor language with regard to a language is defined in the standard way, as (on a balanced eval set)

Despite the concern that it would elevate rare and OOV words, we found that increasing the value of steadily decreased the distractibility with little loss in recall on the target languages. Increasing the size of the wordlists, , recouped the loss in recall, but increased distractibility to more dangerous levels. These results can be seen in Table 1.

Therefore, for our Tf-iif lists, we use values of and . For comparison, Caswell et al. (2020) use , and the value of was set on a per-language basis from the recall on the dev set. Furthermore, Caswell et al. (2020) use an IIF list with only 980k unique tokens (7M webpages), whereas this work uses 41M unique tokens (250M web pages). The public GitHub repo has been updated to reflect these improvements.

Deciding when to apply Tf-iif filtering:

Many languages do not need extra filtering, and it is important not to overfilter and decrease recall. Therefore we needed an outlier detection metric to determine whether filtering needed to be applied. We wanted to filter those corpora where the loss in recall was small, but the reduction in dataset size was large (indicating that many out-of-language sentences were removed).

We looked at the percent of our LangID eval sets that remained after filtering (), aka the filtering recall), and compared it to the percent of our web-crawled corpora remaining after filtering (. We used these to define the heuristic Relative Recall Rate (RRR) as follows:


This quantity measures, approximately, how many true positives are kept by this method for every false positive filtered out. The exponent is the trade-off between recall and percent filtered. A value of means that we weight a loss in recall (undesired outcome) more than an equivalently sized reduction in data size (desired outcome). One way of interpreting this is how much we trust the recall on our eval set. If , we trust it perfectly. However, in practice, since we fear that the recall on our eval set overestimates the true recall on natural text on the web, we set .

For our experiments we used a value of . This flagged 895 out of 1503 languages. However since we were only concerned about datasets that were more severely polluted, we only filtered datasets where the filtering would remove 20% or more of the data (). We also decided not to filter any language where the recall () was less than 80%, in case we lost too much usable data. The result of this was that we applied Tf-iif filtering to 210 out of 1503 language corpora.

mean recall median recall mean median
5000 1000 94.3 98.9 8.6 1.2 20
10000 1000 94.2 98.9 7.3 0.7 16
20000 1000 94.1 98.7 5.7 0.4 14
80000 1000 93.4 98.5 2.9 0.1 8
80000 2000 94.9 99.3 6.8 0.4 21
Table 1: Investigating the effect of varying the IIF-thresholding parameter on the recall and distractibility () of Tf-iif filtering, over 1598 languages. As we increase , the distractibility steadily trends down, with only small losses in recall. Increasing recoups lost recall but increases distractibility.

2.1.8 Token distribution anomaly detection and Negative Token Filters

Even after the filtering steps described in the previous sections, a variety of languages still had data-quality issues. One issue that was not easily filtered out by approaches like Tf-iif was templated content, which is technically in the right language, but is not useful for training data. To a lesser extent we also saw issues like the “unlucky n-gram” (Caswell et al., 2020) effect. Examples of the type of content we found were:

  1. Scottish Gaelic (gd) found 570M in-language sentences even after Tf-iif filtering. It turned out that this was mostly from one site, and the most common token was “Luchdaich a-nois” (“download”)

  2. Morrocan Arabic (ar-MA) came up with a dataset of over a billion sentences, but 94.9% contained some reference to “casinos”, “gambling”, etc.

  3. Kurukh (kru-Mlym) was mainly A N T S P E A K. (And it later turned out that the training data was spurious.)

  4. The Arabic-script Indo-Aryan languages of Kalami (gwc), Hindko (hnd), and Torwali (trw) had picked up masses of templates from unit conversion websites (“X lbs is Y ounces”, “convert Euro to US American Dollar”, etc.)

  5. Many Latn-script Indic languages (hi-Latn, ml-Latn, etc.) had large amounts of content that were just download links or titles for videos, songs, and so on.

  6. Cree (cr-Latn) was almost 100% “Lorem ipsum” sentences

While some of these are actually incorrect content, much of it is technically in-language, and therefore can’t be filtered out by Tf-iif filtering or straightforward application of LangID. Therefore, we decided to develop an approach to detect anomalous data and investigate datasets that looked like they had issues.

To detect these extreme domain shifts, we hypothesized that the token distribution would be severely skewed. Therefore, we compared the distribution of the tokens in the LangID train data (the reference distribution) to the token distribution in the crawled data (the empirical distribution). To compare these distributions we looked at several scores for the top N=40 tokens in the empirical distribution:

  • 2n-overlap: This is simply the percentage of the top N tokens that appear in the top 2N tokens of the reference distribution; this metric is very simple and highly interpretable.

  • Euclidean: This is the Euclidean distance between the frequencies of the top N tokens and their corresponding frequencies from the reference distribution.

Different scores, like Jenson-Shannon divergence and Pearson R, were initially considered but found to be ineffective.

We then combined these two scores together with the harmonic mean, yielding the

Harmonic Token Anomalousness Score. This is a very approximate measure, but still gives a useful score for outlier detection. Based on some qualitative analysis, we determined that a score of is a sign of a questionable dataset quality. Interestingly enough, however, datasets with scores that were too high also had issues: a score of often indicated that the web crawl had merely recovered the training data, which in these cases was often religious material.

After computing this score for all datasets, we manually observed samples of the data for all languages with Harmonic Token Anomalousness Score and more than 20,000 sentences. There were 179 languages flagged as suspicious in this way. It was relatively straightforward to make filters for 62 of these, for instance excluding sentences containing “casino” in Arabic dialects. For some of the others, we made notes that they were the wrong language. For many others, there was no clear or obvious solution, so we left them as-is. These filters removed on median 21% of the data for these 62 languages.

We acknowledge that this measure of quality is very approximate, so it should be used judiciously.

2.1.9 Deduplication

The last step was simply to remove duplicate sentences in all datasources. This reduced the median dataset size by a factor of 1.8x, and the average dataset by a factor of 1.4x.

2.2 Description of Resultant Dataset

2.2.1 Monolingual Data

The result of the process described in section 2.1 was a dataset with corpora for 1503 low-resource languages, ranging in size from one sentence (Mape) to 83 million sentences (Sabah Malay). For our experiments we chose to experiment only on those 1057 languages where we recovered more than 25,000 monolingual sentences (before deduplication). We combined this with our existing monolingual datasources for the 83 high-resource languages. Table 2 shows statistics for these three datasets — the full low-resource dataset (“LRL-full”), the portion of the full low-resource dataset used for model training (“LRL-train”), and the full training dataset (“all-train”), which combines in the 83 higher-resource languages.

2.2.2 Monolingual Data Quality Audit

We conducted an audit of our data as in Kreutzer et al. (2022)

with a variety of volunteers, comprising native speakers and non-speakers willing to do detective work. From the sample of 72 languages we audited, the median data score was 80%. The “score” is a simple heuristic to estimate the percent usable data given the different error codes, and is defined as

, where cc is the percent of the sample labeled as “Correct”, cb is the percent labeled as “correct, Low quality”, ca is the percent labeled as “Correct, ambiguous dialect”, and wd is the percent labeled as “Correct, wrong dialect” (See Appendix Section B for details).

The largest error category tended to be Wrong Language, with an average value of 16%, followed by Low Quality/Boilerplate, with an average of 10%. Languages with the poorest quality tended to be close dialects of major languages, especially varieties of Arabic (Sa’idi, Morrocan, Mesopotamian, Latinized, Algerian: aec,ar-MA,acm,ar-Latn,arq), but also varieties of French (Saint Lucian Creole: acf) and English (Nigerian Pidgin: pcm). There were also some close varieties that had higher quality, including varieties of French (Seychellois Creole: crs), Hindi (Bhojpuri: bho) and English (Krio: kri). Of these three, kri and crs use quite different orthographies than their high-resource relatives. In addition to close varieties, some African languages (Ndebele, Anaang, Kwanyama, Wolof: nd,anw,kj,wo) also had very poor quality. Less common languages with extremely high quality, e.g. Northeastern Dinka (dip), Zarma (dje), and Dombe (dov), each with a score of 100, may also be viewed with suspicion, as this may mean that a very narrow domain has been recovered (usually religious text). Details on the per-language performance and the set of error codes we used can be seen in Appendix Section B.

dataset N total median 1M 100K
LRL-full 1503 1.7B 25k 122 322
LRL-train 1057 1.7B 38k 122 321
all-train 1140 28B 43k 205 404
Table 2: Summary statistics about monolingual data for 1) the full dataset of low-resource languages; 2) the portion thereof used to train our model; and 3) the full training set including high resource languages. Columns are the number of languages covered, the total number of sentences across all languages, the median number of sentences per language, the number of languages with more than 1 million sentences, and the number of languages with more than 100,000 sentences.

2.2.3 Parallel Data

In addition to the monolingual data described above, we also utilize the web-crawled parallel corpora available to us. This corpus is a slightly extended version of the corpus described in Arivazhagan et al. (2019), containing approximately billion sentence pairs spanning languages, to and from English.

3 Building Machine Translation Models for Long-Tail Languages

Next, we utilize the datasets described in Section 2.2 to build our MT models. As we see from the monolingual data statistics in Table 2, while there are over 1M sentences per language present in our corpus for the highest-resource 205 languages, the median number of sentences in our full training corpus is only around 43K sentences per language. This is a relatively small amount of data, given that traditional MT systems often require a few hundred thousand to a few million parallel sentences to reach good quality, while we are limited to around 43K monolingual sentences per language.

Given the highly data sparse setting, it is clear that the approaches that work for high resource languages cannot apply directly in our setting. In order to build high quality models for long-tail languages, we build on the approach developed previously in Siddhant et al. (2022) to enable zero-resource translation, by leveraging (i) Self-supervised training on in-language monolingual data, (ii) Massively multilingual supervised translation on out-of-language data, (iii) Large-scale data augmentation via back-translation and self-training and (iv) Model scaling. In this section we describe our recipe for training MT models for long-tail languages, starting with massive massively multilingual models and concluding with smaller, inference-friendly models trained on a selected subset of languages.

We start by highlighting the role of capacity in our highly multilingual setting by comparing the performance of B and B parameter models in Section 3.2. These experiments were conducted on an earlier version of our dataset, spanning supervised and zero-resource languages. In Section 3.3 we scale up our models to cover over languages and highlight the effect of increasing multilinguality in the long-tail setting. To further increase performance, we incorporate large-scale back-translation and self-training. In Section 3.4, We evaluate these approaches on a subset of languages for practical considerations. In Section 3.5, we further highlight the role of filtering synthetic data and its effect on model quality for close dialects. Finally we describe our distillation approach and compare the resulting student models against the fine-tuned teachers, in Section 3.6.

Of the languages we evaluate on in this work, only Sorani Kurdish (ckb) uses any in-language supervised data; all other languages are evaluated under a zero-resource setting. While small amounts of parallel resources weren’t difficult to obtain for many of these languages, preliminary experiments did not result in any quality improvements with including limited amounts of (often religious-domain) parallel text.

3.1 Experimental Setup

We utilize the 1000-language monolingual corpus available to us, together with large amounts of massively multilingual parallel data in 112 languages (including English), to build zero-resource MT models for 1000 languages following the approach described in Siddhant et al. (2022). To elaborate, we train Transformer-based MT models on the translation task on 112 languages, simultaneously with the MASS masked denoising task (Song et al., 2019) to first build zero-resource models capable of translating from xxen with reasonable quality. These models follow a two-stage training procedure, where the first stage consists of training on just the MASS and supervised translation tasks, and the second stage also incorporates data generated using online translation from the model’s latest checkpoints (in the xxen direction). This data is used as back-translated and self-training (aka forward-translated) data simultaneously.

All our models are based on the standard Transformer architecture, with 32 layer encoders and decoders. We use two variants of the Transformer; a smaller model, with approximately 1.5B parameters, and a larger one with 2x larger model dimension and 2x wider hidden dimension, with approximately 6B parameters. Our models utilize the vocabulary from a 64K-token SentencePiece model (SPM) (Kudo & Richardson, 2018) trained on monolingual data from the entire set of languages covered by the model. Data is upsampled with a temperature of (Arivazhagan et al., 2019) before creation of the SPM. An analysis of the vocabulary suggests that our vocabulary is close to character-level for all but the highest resource languages. We use GPipe pipeline parallelism (Huang et al., 2019)

and the Tensorflow-Lingvo 

(Shen et al., 2019) framework for all our MT experiments.

Different from Siddhant et al. (2022), in addition to the <2xx> token that was prepended to the source sequence to signify the target language for both translation and MASS tasks, we add a <2task> token (<2translation> for the translation task, and <2mass> for the MASS task) that specifies the task to be performed by the model. We find this to be critical for zero-resource performance, especially when model sizes are scaled up. In the absence of this task token, our models learnt to ‘infer’ the task from the source languages instead of relying on the <2xx> token, resulting in copying sentences when provided zero-resource language sentences with the <2en> token.

All our models are evaluated with ChrF on standard evaluation sets built by translating 1200 English sentences into the target language, for 38 languages. Details of our evaluation metrics and datasets are provided in Section 4.

en-mfa en-ff en-mni en-kri en-mad
Mono Size 7k 86k 106k 129k 138k
1.5B 57.8 26.0 33.7 31.1 54.8 35.7 48.7 32.7 44.9 29.8
6B 62.1 17.9 34.3 25.0 55.1 31.9 54.8 28.2 52.0 25.9
en-doi en-bm en-ban en-quc en-ady
Mono Size 179k 187k 188k 250k 296k
1.5B 57.4 29.9 27.6 28.7 32.2 31.8 24.4 22.8 53.7 21.8
6B 60.4 24.4 33.4 28.5 35.5 29.4 27.5 22.2 56.0 21.4
en-av en-gom* en-iso* en-yua en-min
Mono Size 301k 311k 409k 419k 533k
1.5B 48.1 26.1 50.7 1.1 17.9 16.5 34.6 30.6 58.8 51.8
6B 50.5 27.3 53.3 0.9 19.3 16.8 39.3 31.3 62.6 54.1
en-bho en-kl* en-qu en-gn en-bbc
Mono Size 734k 741k 842k 861k 923k
1.5B 54.8 38.9 18.4 12.5 29.9 29.8 26.7 24.3 40.9 34.9
6B 56.9 38.8 20.0 13.0 32.7 30.2 35.6 28.8 43.9 35.6
en-skr en-ak en-ts* en-mai en-ce
Mono Size 974k 1.29m 1.3m 1.35m 1.36m
1.5B 44.4 32.8 31.7 32.1 25.5 17.4 57.8 34.1 44.6 23.5
6B 46.1 28.5 34.6 32.8 28.3 17.5 59.9 32.5 48.1 24.4
en-pcm en-nso en-ilo en-cv en-ti
Mono Size 1.59m 1.87m 2.6m 2.9m 3.9m
1.5B 54.9 54.7 45.8 45 50.8 49.7 39.9 28.1 37.9 19.9
6B 52.5 52.4 49.1 46.9 39.9 47.8 44.4 27.5 40.9 20.0
en-om en-sa en-dv en-lus* en-as
Mono Size 5.6m 6.2m 7.9m 8.3m 9.3m
1.5B 32.1 36 43.5 26.9 37 43.2 22.6 16.9 52.9 36.3
6B 37.0 36.7 45.8 24.7 35.8 39.2 33.0 19.1 49.6 36.2
en-mzn en-bew en-ckb* Average
Mono Size 11.6m 33.3m 76.9m
1.5B 50 41.3 49.8 45.7 46.6 40.8
6B 53.3 40.2 51.6 44.0 51.7 41.6 +2.7 -1.0
Table 3: Comparison of our 206-language 1.5B and 6B models trained on supervised data from 112 languages to and from English and monolingual data from 206 languages. Languages marked with * weren’t included in the set of 206 languages, so monolingual data was not used for these languages in this experiment.

3.2 Experiments comparing effect of model capacity

We first compare the effect of model capacity on zero-resource performance. In this experiment we train our 1.5B and 6B Transformer variants on supervised data from 112 languages, to and from English, and monolingual data from 206 languages (including the above 112 languages). These models are first trained with MASS and translation, and as a second stage, additionally trained on online-translated monolingual sentences as back-translation and self-training data. The results of these experiments are listed in Table 3. We demonstrate results for 38 languages, of which we didn’t include any monolingual data for 6 languages (which weren’t present in our initial scrape). We find that increasing the capacity of the model from 1.5B to 6B has a significant effect on the quality of translation, improving by an average of 2.7 ChrF on the xxen translation direction. On the other hand, the enxx direction regresses in quality with an average loss of -1.0 ChrF. This degradation was associated with the same issue that required us to append extra <2task> tokens to the source input (models learning to infer the task from the input language, resulting in copying instead of translation).

Looking at the overall results, we notice that the performance of these models reaches above 40 ChrF in the xxen direction for a majority of the languages. Performance is especially high for languages which either have similar languages in our parallel corpus (including South-Asian languages and Pidgins like Bhojpuri (bho), Dogri (doi), and Nigerian Pidgin (pcm)), while ChrF scores are relatively low for lower-resource and those with no related languages in the model (including Native American languages like Quechua (qu), K’iche’ (quc), Aymara (ay), and Kalaallisut (kl)). We observe that our model learns to translate Goan Konkani (gom) into English with surprisingly high quality despite the lack of gom monolingual data seen by the model, perhaps owing to its similarity with Marathi (mr), which is present in our supervised data.

en-mfa en-ff en-mni en-kri en-mad
Mono Size 7k 86k 106k 129k 138k
200L 62.1 17.9 34.3 25.0 55.1 31.9 54.8 28.2 52.0 25.9
1000L 65.4 27.5 41.2 32.3 56.4 38.2 56.5 34.9 50.5 30.2
en-doi en-bm en-ban en-quc en-ady
Mono Size 179k 187k 188k 250k 296k
200L 60.4 24.4 33.4 28.5 35.5 29.4 27.5 22.2 56.0 21.4
1000L 63.1 25.5 36.3 34.3 35.4 33.1 26.7 22.9 54.8 28.2
en-av en-gom* en-iso* en-yua en-min
Mono Size 301k 311k 409k 419k 533k
200L 50.5 27.3 53.3 0.9 19.3 16.8 39.3 31.3 62.6 54.1
1000L 48.1 28.1 55.5 39.1 29.4 30.5 40.7 31.6 62.4 56.1
en-bho en-kl* en-qu en-gn en-bbc
Mono Size 734k 741k 842k 861k 923k
200L 56.9 38.8 20.0 13.0 32.7 30.2 35.6 28.8 43.9 35.6
1000L 58.3 40.6 29.7 23.1 32.5 33.1 38.9 32.2 44.0 35.4
en-skr en-ak en-ts* en-mai en-ce
Mono Size 974k 1.29m 1.3m 1.35m 1.36m
200L 46.1 28.5 34.6 32.8 28.3 17.5 59.9 32.5 48.1 24.4
1000L 48.4 31.3 36.3 34.3 43.0 45.5 61.6 37.6 44.9 23.7
en-pcm en-nso en-ilo en-cv en-ti
Mono Size 1.59m 1.87m 2.6m 2.9m 3.9m
200L 52.5 52.4 49.1 46.9 39.9 47.8 44.4 27.5 40.9 20.0
1000L 51.2 53.5 51.3 41.6 43.4 52.4 46.3 32.1 44.2 21.1
en-om en-sa en-dv en-lus en-as
Mono Size 5.6m 6.2m 7.9m 8.3m 9.3m
200L 37.0 36.7 45.8 24.7 35.8 39.2 33.0 19.1 49.6 36.2
1000L 38.1 39.1 46.3 28.4 45.2 43.7 34.5 39.3 58.6 36.7
en-mzn en-bew en-ckb* Average
Mono Size 11.6m 33.3m 25.1m
200L 53.3 40.2 51.6 44.0 51.7 41.6
1000L 55.2 41.4 51.8 46.0 54.4 41.6 +2.5 +5.3
Table 4: Comparison of our 206-language 6B and 1000-language 6B models trained on monolingual data from 206 languages and 1000 languages respectively, while sharing supervised data from 112 languages. Languages marked with * weren’t included in the set of 206 languages, so monolingual data was not used for these languages in the baseline model, and the 1000-language model consequently sees larger performance improvements.

3.3 Continual learning, extending models from 200 to 1000 Languages

We next compare our 6B model trained on monolingual data from 200 languages against one trained on our entire monolingual set, spanning 1000 languages. To train the 1000-language translation model we utilize the continual learning technique described in Garcia et al. (2021a). To elaborate, we replace the vocabulary used by our 200-language MT model with one trained on monolingual data from all 1000 languages. To be able to reuse the model learnt on 200 languages to initialize the newer, 1000 language model, we align the SPM tokens shared across the two vocabularies and assign them the same IDs. The tokens in the new vocabulary that do not map to any token in the original vocabulary are assigned random IDs not assigned to the aligned tokens. This allows to continue training our 1000-language model from the model trained on 200-language subset.

We compare the performance of our 200-language model against the 1000 language MT model in Table 4. Unsurprisingly, languages that were not covered by our 200-language model but are now covered by our new monolingual data, including Goan Konkani (gom), Isoko (iso), Kalaallisut (kl) and Tsonga (ts), see large quality improvements. However, we also observe quality improvements almost for all languages in our evaluation set, including both, the xxen and enxx translation directions with average improvements of +2.5 and +5.3 ChrF points respectively. This is counter-intuitive since we are increasing the number of self-supervised languages, which should presumably increase the amount of interference and worsen capacity contention within this model, similar to what was observed in Siddhant et al. (2022). What is different from earlier studies is the extent of multilinguality in the model; any new added language in our model is likely to be very similar to one of the existing languages, resulting in stronger transfer between those languages. This transfer effect is strengthened by the low quantity of monolingual data for long-tail languages; in the massively multilingual and resource constrained setting the cross-lingual transfer effect dominates the interference observed in massively multitask models. This is also supported by the fact that languages that benefit the least from increased multilinguality are Native American languages like Quechua (qu), K’iche’ (quc), Aymara (ay), Yucatec Maya (yua), and Kalaallisut (kl), which tend to have few or no similar languages in the training data.

While our evaluation sets are limited to 38 languages, we provide RttLangIDChrF as a highly approximate measure of the quality of the model on all 1000 languages in Appendix Table 24. More details of the metric and the performance of the model on all languages are described in Section 4.3.

3.4 Effects of Large Scale Data Augmentation

en-ff en-mni en-kri en-doi en-bm
Mono Size 86k 106k 129k 179k 187k
1000L 41.2 32.3 56.4 38.2 56.5 34.9 63.1 25.5 36.3 34.3
Finetuned 44.5 35.7 60.5 40.8 62.2 36.8 64.6 37.2 37.9 34.7
en-quc en-gom en-yua en-bho en-kl
Mono Size 250k 311k 419k 734k 741k
1000L 26.7 22.9 55.5 39.1 40.7 31.6 58.3 40.6 29.7 23.1
Finetuned 29.2 23.5 57.5 40.0 43.0 31.8 60.5 41.4 39.1 27.8
en-qu en-gn en-ak en-ts en-mai
Mono Size 842k 861k 1.29m 1.3m 1.35m
1000L 32.5 33.1 38.9 32.2 36.3 34.3 43.0 45.5 61.6 37.6
Finetuned 35.3 36.1 42.7 32.3 38.6 34.1 46.2 46.5 64.3 39.2
en-pcm en-nso en-ilo en-ti en-om
Mono Size 1.59m 1.87m 2.6m 3.9m 5.6m
1000L 51.2 53.5 51.3 41.6 43.4 52.4 44.2 21.1 38.1 39.1
Finetuned 59.3 54.5 52.2 45.4 61.7 53.4 46.0 21.0 41.2 39.9
en-sa en-dv en-lus en-as en-ckb
Mono Size 6.2m 7.9m 8.3m 9.3m 25.1m
1000L 46.3 28.4 45.2 43.7 34.5 39.3 58.6 36.7 54.4 41.6
Finetuned 49.0 30.9 47.3 43.8 40.0 40.3 59.8 39.0 54.7 42.7
Table 5: Comparison of our 1000-language 6B models against the 30-language version fine-tuned from this model. Scores are shown for the languages that were fine-tuned on.

In order to understand the limits of quality with our zero-resource approach, we select a subset of 30-languages for large-scale data augmentation. We continue training our 1000-language model on this subset of 30-languages with the MASS, translation and online back-translation objectives. To leverage the full power of data augmentation, we translate all the available monolingual data (up to 10 million sentences) for these languages into English, and sample around 10 million English web sentences which are translated into each of these languages. The model is then continued training with MASS, translation and offline back-translation and self-training with this synthetic data. This process is repeated twice since quality improvements with subsequent stages were observed to be incremental. This model is then compared against our vanilla 1000-language model that wasn’t trained with large-scale augmented data.

The results of this comparison are depicted in Table 5. We notice that fine-tuning with augmented data shows substantial quality improvements across the board, with especially large improvements for enxx translation for several languages. Excepting our Native American subset, most languages reach greater than 40 ChrF on xxen translation and greater than 35 ChrF in the reverse direction.

3.5 Effect of Filtering Synthetic Data

en-ff en-mni en-kri en-doi en-bm
Mono Size 86k 106k 129k 179k 187k
Finetuned 44.5 35.7 60.5 40.8 62.2 36.8 64.6 37.2 37.9 34.7
Filtered 45.3 35.3 62.1 40.8 64.2 35.4 65.5 36.9 38.4 34.7
en-quc en-gom en-yua en-bho en-kl
Mono Size 250k 311k 419k 734k 741k
Finetuned 29.2 23.5 57.5 40 43 31.8 60.5 41.4 39.1 27.8
Filtered 29.8 23.5 57.9 40.8 43.7 32.0 60.9 41.2 39.0 35.8
en-qu en-gn en-ak en-ts en-mai
Mono Size 842k 861k 1.29m 1.3m 1.35m
Finetuned 35.3 36.1 42.7 32.3 38.6 34.1 46.2 46.5 64.3 39.2
Filtered 35.3 35.7 43.0 32.1 38.6 34.3 47.5 46.5 65.5 39.2
en-pcm en-nso en-ilo en-ti en-om
Mono Size 1.59m 1.87m 2.6m 3.9m 5.6m
Finetuned 59.3 54.5 52.2 45.4 61.7 53.4 46.0 21.0 41.2 39.9
Filtered 64.6 57.2 52.5 47.1 62.7 54.1 46.1 21.5 40.7 40.0
en-sa en-dv en-lus en-as en-ckb
Mono Size 6.2m 7.9m 8.3m 9.3m 25.1m
Finetuned 49.0 30.9 47.3 43.8 40.0 40.3 59.8 39.0 54.7 42.7
Filtered 49.2 31.0 48.4 45.0 42.1 39.7 60.6 39.6 55.3 42.6
Table 6: Comparison of our 30-language fine-tuned model against one fine-tuned with RTT and LangID filtered data.

One challenge with using the model’s own predictions for improving model quality is the presence of positive feedback loops which can magnify any problems with the model outputs, as elaborated in Section 4.8. To reduce these effects we compare our fine-tuned model against a version that was fine-tuned with a round-trip translation and LangID filtered version of our synthetic data for training. The results of this comparison are depicted in Table 6. For most languages this additional filtering has no major impact on model performance, often performing within 0.5 ChrF of the non-filtered model. However, for certain languages like Nigerian Pidgin (pcm) and Kalaallisut (kl) we observe large improvements in quality. Closer inspection reveals that our pcm data suffered from mixing with African-American Vernacular English (AAVE), which gets magnified as the model trained on its own outputs. Similarly, our kl monolingual set was polluted with a small fraction of Danish (da) data, which gets magnified with self-training. Filtering with round-trip-translation and LangID reduces the instances of data pollution improving the correctness of the model outputs. Section 4.8 explores this in greater depth.

3.6 Distillation

We next describe our approach for distilling our best 6B parameter Transformer model fine-tuned on languages into smaller, more efficient, architectures. This process also yields further quality gains.

3.6.1 Data Generation for Distillation

We follow the sequence-distillation approach (Kim & Rush, 2016) to distill our teacher model into smaller students. To this end, we generate large amounts of synthetic forward- and backward-translated data with the teacher model, which is then used to train a smaller student model. We start with the -language fine-tuned model and translate roughly - million English sentences into each of the languages. We translate our entire corpus in these languages (up to a maximum of million sentences) into English. This synthetic data is then filtered using the same round-trip and LangID filtering approach described in Section 3.5. We also applied a few manual regex-based filters for specific languages where we observed particular data pollution and noise issues, as further elaborated in Section 4.8.

3.6.2 Distillation Approach and Hyper-parameters

We looked at two candidate student architectures with increasing encoder depth; we refer to them as shallow encoder (330M parameters) and deep encoder (850M parameters). All our models are sequence-to-sequence models with attention (Bahdanau et al., 2015), using Transformer encoders and LSTM (Hochreiter & Schmidhuber, 1997) decoders, as described in Chen et al. (2018). All our student models are multilingual, with separate models for xxen and enxx translation.

Effect of student model capacity:

Although there are some quality improvements from optimizing hyperparameters, we found that most distilled models performed similarly, having little sensitivity to the hyperparameters we experimented with. Nonetheless, one important take-away is that trends that appeared to hold for the shallow encoder model, for instance the impact of increased amounts of back-translated data, were often erased when experimenting with the deeper model.

In all cases, the shallow encoder model was noticeably worse. For 30-language models, the deep encoder saw gains of about and ChrF in the enxx and xxen directions respectively. With respect to multilinguality, we found that increasing the multilinguality of the students from 6 languages to 30 yielded small quality losses of median ChrF.

Effect of amounts of synthetic enxx data used for distillation: Another important hyperparameter was the amount of English-original synthetic data used for distillation (the non-English datasets were small enough that we could just translate the entire dataset). In the enxx direction, where English-original data is forward-translated data, we varied the number of forward-translated sentences from 1M to 8M, but found no significant differences in model performance. In the xxen direction, where English-original data is back-translated data, we saw consistent but small gains across all languages, with ChrF rising by about +0.6 when increasing from 1M to 2M synthetic sentences. Increasing past 2M back-translated sentences saw minimal gains. However, these experiments were carried out on the lower-capacity shallow encoder models (with 14 languages each), so more gains from higher quantities of back-translated data may be seen on a higher-capacity model.

Although the teacher model may be trained with back-translation (BT), sequence-level distillation is typically only conducted with forward-translated data (FT). However, in this case there are some interesting implications that arise from a) the very small data sizes, and b) the asymmetrical teacher model quality, where xxen quality tends to be better than enxx quality. For enxx , FT (aka English-original) data is more abundant, but also lower quality; for xxen , the FT data is higher quality but rather scarce. Therefore, we experiment with using different proportions of FT and BT data for distillation.

For enxx , we saw very small differences in performance between different ratios, and preferred using 80% FT and 20% BT data.

For an initial experiment on a 7-language shallow encoder model in the xxen direction, we saw noticeable losses with under 50% FT data, losing about 1.7 ChrF when going to 33% FT, and losing a further 3.7 ChrF when going to 20% FT. Values above 50% FT were not significantly different. However, when observing the ChrF curve over time, we saw that the models with more BT data were learning more slowly and probably underfitting. Replicating this experiment with the deep encoder model, the performance on all languages increases, and the aggregate differences between different ratios of synthetic data are minimal. However, there is a slight trend on a per-language basis, with higher-resource languages benefiting from more back-translated data, with a Kendall tau of 0.28 between the number of monolingual training sentences and the difference between the 70% FT and 20% model. Since the differences are very slight, we favor models with a smaller percentage of FT data (20%-50%), motivated by the intuition that the increased amount of natural target side English compared to the small number of natural source-side sentences may have benefits we can’t measure on our eval sets and metrics.

3.6.3 Comparison against teacher model

A comparison of our best deep encoder student models against the teacher on all languages can be seen in Table 7. The student outperforms the teacher on the enxx direction, with an average gain of +1.1 ChrF, and minor gains of +0.2 average ChrF on xxen . These gains are probably in part due to the filtering applied to the RTT data. However, for the most part the differences are likely an artifact of the fact that the eval sets are English-Original, meaning that the English sentences are natural sentences, and the non-English sentences were produced via translation. Since distilled models tend to produce more translationese than their teachers, reference-based metrics like ChrF will tend to overestimate their performance in the source-original direction (enxx ), and underestimate it in the target-original direction (xxen ). This phenomenon is investigated in depth in Freitag et al. (2019).

en-ff en-kri en-doi en-bm en-ay
Mono Size 86k 129k 179k 187k 267k
Teacher 45.3 35.3 64.2 35.4 65.5 36.9 38.4 34.7 39.8 28.6
Student 45.5 37.2 64.6 36.1 65.7 40.5 38.6 36.4 40.1 34.2
+0.2 +1.9 +0.4 +0.7 +0.2 +3.6 +0.2 +1.7 +0.4 +5.6
en-gom en-bho en-kl en-ee en-qu
Mono Size 311k 734k 741k 796k 842k
Teacher 57.9 40.8 60.9 41.2 39.0 35.8 37.0 40.2 35.3 35.7
Student 57.4 42 61.3 42.7 39.5 41.0 37.5 40.7 35.1 36.1
-0.4 +1.2 +0.4 +1.5 +0.5 +5.2 +0.5 +0.4 -0.1 +0.4
en-gn en-ak en-ts en-mai en-ln
Mono Size 861k 1.3m 1.3m 1.3m 1.4m
Teacher 43.0 32.1 38.6 34.3 47.5 46.5 65.5 39.2 31.7 34.6
Student 43.4 31.9 38.9 34.4 47.7 46.8 65.5 40.0 32.1 34.6
+0.4 -0.2 +0.4 +0.1 +0.2 +0.3 0.0 +0.8 +0.4 0.0
en-nso en-lg en-ilo en-ti en-om
Mono Size 1.9m 2m 2.6m 3.9m 5.6m
Teacher 52.5 47.1 40.5 39.3 62.7 54.1 46.1 21.5 40.7 40.0
Student 52.8 48.2 41.0 39.8 62.9 54.8 46.0 21.9 41.5 40.1
+0.3 +1.1 +0.5 +0.5 +0.2 +0.8 0.0 +0.5 +0.8 +0.2
en-sa en-dv en-lus en-as en-ckb
Mono Size 6.2m 7.9m 8.3m 9.3m 25.1m
Teacher 49.2 31.0 48.4 45.0 42.1 39.7 60.6 39.6 55.3 42.6
Student 49.2 33.3 48.4 45.3 41.9 41.3 61.1 40.3 56.4 44.3
0.0 +2.3 0.0 +0.3 -0.1 +1.6 +0.6 +0.7 +1.1 +1.7
Table 7: Performance of teacher model versus student model in ChrF.

For the 30 languages in the distilled model, we looked for correlations between the amount of monolingual data, the number of speakers, the percentage of data removed by deduplication, the harmonic data quality score (Section 2.1.8), and the ChrF score and human-rated scores in enxx and xxen directions. The correlations were all quite low, with Kendell’s Tau under 0.3. The only exception to this is that there was a larger correlation ( = 0.59) between ChrF in the xxen direction and a heuristic measure of closeness to supervised languages (e.g. Bhojpuri/bho, being very close to Hindi, gets a score of 0.9; Aymara/ay, with some loanwords from Spanish but otherwise entirely unrelated, gets a score of 0.1).

4 Evaluation

Traditional metrics like Bleu, which have enough problems with higher-resource languages (Freitag et al., 2019, 2020), have even more problems with the languages studied in the present work. Reference translations are hard to come by, and tail languages are often less standardized with respect to dialect, orthography, and even sometimes Unicode encoding. Furthermore, the frequent presence of close varieties complicates evaluation: automatic metrics like ChrF can give very high scores to outputs which are entirely in the wrong variety. Finally, as Marchisio et al. (2021) find, outputs of unsupervised NMT are less monotonic and more natural than outputs from supervised NMT, which, like the findings on paraphrased references from Freitag et al. (2020), produce Bleu scores that are much lower — although sometimes better correlated with human judgements of quality. Model-based metrics like YiSi (Lo, 2019), BLEURT (Sellam et al., 2020) or COMET (Rei et al., 2020) cannot be used for these languages due to a lack of human ratings and pretrained models in these languages.

The following sections analyze performance along a variety of axes. First we describe the evaluation sets we collected (Section 4.1). We analyze the models starting with more quantitative methods, including the performance of ChrF versus Bleu (4.2) and human evaluations (4.4). We explore RttLangIDChrF, a reference-free metric supervised for very low-resource languages, which shows reasonable correlation with ChrF. We next perform qualitative analysis of our model outputs, and highlight several patterns of errors including confusing between distributionally similar words and concepts like “tiger” and “miniature crocodile” (4.5), errors on single word inputs (4.7), and investigation of magnification of error modes in distilled models (4.8).

4.1 Evaluation Sets

In order to measure the quality of translation for development and experiments, we collected reference translations for 38 languages. For ease of comparison we collected a multi-way parallel data set, with the same English side for all languages. The xxen eval sets were made by reversing this dataset. Rather than opting for larger evaluation sets for a small number of languages, we decided to collect relatively small evaluation sets containing sentences in 38 linguistically and geographically diverse languages. % of the English sentences were drawn from a corpus of simpler and more colloquial language (average length: 12.0 tokens; ), and the remaining % from more technical web content (average length: 20.1 tokens; ). The resulting dataset was shuffled and then split into a 600-sentence development set and a 600-sentence test set. The tables in this paper report the score across the combined set except where otherwise mentioned. The average absolute difference between the ChrF scores on both sets was about 0.5.

4.2 Evaluation Metrics

For this work we deem token-level metrics like Bleu unsuitable, and rely almost entirely on the character level ChrF 333sacrebleu signature nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 (Popović, 2015), reported on a scale from 0 to 100. This section explains this decision, and gives examples of where the two metrics differ on the languages we studied.

Many of the languages studied in this work have complex morphologies, including the agglutinating Bantu languages and the polysynthetic Native American languages. This means that they can inflect to form very long tokens. An extreme example can be seen in Table 9, which shows how full sentences in English translate to only a few tokens in Kalaallisut (kl) . For such highly inflecting languages, it is to be expected that a character-based metric, like ChrF, would correlate better with quality than a token-based metric like Bleu. This expectation is borne out by our observations. These observations align with Mirzakhalov et al. (2021b), who observed similar trends for the agglutinating Turkic languages.

It is difficult to make any sort of direct comparison between Bleu and ChrF. Not only do they measure different quantities and have different score distributions (e.g. ChrF = 0 is exceedingly unlikely), but they also are influenced by different artifacts. The Bleu score will be affected by infection, and a slightly wrong infection will nullify all affected n-grams. On the other hand, ChrF is affected by the writing system – for languages with abugidas or abjads, for instance, it will behave differently than for languages with alphabets, as there will be fewer characters to match. (On languages with writing systems like Chinese or Japanese, it will function very similarly to Bleu.)

Nonetheless, in order to make some attempt to demonstrate how and where the two metrics differ, we have defined a very simplistic conversion metric based on performance of the less-inflecting low-resourced languages we studied: ScaledChrF = 0.75ChrF - 0.15. The purpose of this metric is solely to have some way of flagging which languages lead to very different performance under the two metrics. Table 8 shows the language pairs in our distilled models, along with their Bleu score, their ChrF score (unscaled) and the ratio between ScaledChrF and Bleu. The results correspond with intuitions: The languages with the highest ratio tend to be polysynthetic (Quechua/qu, Aymara/ay, Kalaallisut/kl), agglutinative (Luganda/lg, Lingala/ln), or otherwise highly fusional (Sanskrit/sa ). Oromo (om) is worth a special mention, as its orthography seems to have higher character usage per morpheme (because of the many doubled letters), which may inflate ChrF. It is not clear why Dhivehi (dv) has such a high ratio. The performance of both metrics on Tigrinya (ti) is also something of a mystery, given that the model translations were rated very highly by humans, and native speakers validated that the reference translations were also high quality.

lp Bleu ChrF ratio lp Bleu ChrF ratio
enom 3.8 40.1 4.0 ents 12.8 46.7 1.6
enlg 3.7 39.8 4.0 engom 10.8 42.3 1.6
endv 5.4 45.5 3.5 enee 11.9 40.5 1.3
ensa 2.8 33.1 3.5 enff 9.9 37.2 1.3
enqu 3.6 36.2 3.4 enbm 9.5 36.4 1.3
enln 3.5 34.9 3.2 enlus 13.2 41.5 1.2
enkl 5.3 40.6 2.9 enkri 10.6 36.2 1.1
enay 4.6 34.3 2.4 enilo 24.2 54.9 1.1
enckb 9.4 44.2 1.9 enak 10.4 34.6 1.0
engn 5.2 32.1 1.7 ennso 20.3 47.6 1.0
enmai 8.6 39.8 1.7 enbho 16.7 42.7 1.0
enas 9.6 40.6 1.6 endoi 15.3 40.7 1.0
enmni-Mtei 12.9 47.1 1.6 enti 3.7 22.1 0.4
Table 8: ChrF versus Bleu scores on enxx language pairs in the distilled model. Although it is almost impossible to compare these scores directly, as explained for a variety of reasons in the text, it is still clear that they give a very different picture of the performance on the languages. In order to make an approximate comparison between the two, we have included the ratio of the ScaledChrF score to the Bleu score. Language pairs in this table are sorted from the highest ratio (where Bleu underestimates performance) on the top to the lowest ratio on the bottom. As expected, polysynthetic and agglutinative languages are misjudged my Bleu.
source translation
I can’t sing so well so I don’t want to be a singer. Erinangippallaannginnama erinarsortartunngorusunngilanga.
Without my car, I wouldn’t be able to get to work. Biileqanngikkuma suliartorsinnaanavianngilanga.
I would like to raise a dog instead of a cat Qimmiuteqarusunnerussangaluarpunga qitsuuteqarnissannit.
I have a cat who is lazy. Qitsuuteqarpunga eqiasuttuuvoq.
Table 9: Example translations from English to Kalaallisut from our evaluation set. A token-based metric like Bleu is unsuitable for such a highly inflecting language, as exact match is very unlikely. Character-based metrics, like ChrF, are more appropriate.

4.3 RTT LangID ChrF

Because of the infeasibility of collecting reference translations for 1000+ languages, some sort of reference-free evaluation is increasingly important. Round-trip translations have been utilized as a metric for MT Quality Evaluation several times over the last few decades (Huang, 1990; Aiken & Park, 2010; Moon et al., 2020). We experimented with a simple modification we call RttLangIDChrF. To compute this score, we simply round-trip-translate a corpus of English sentences through some language, and compute the ChrF score of these translations with respect to the original sentences. However, since this metric is trivially fooled by error modes like copying or translating into the wrong language, we omit any round-trip translation where the intermediate translation is not in the correct language according to our SSLID model. If fewer than 10% of the intermediate translations receive the correct LangID score, we consider the score invalid, as this may also be a result of errors in the LangID model.

We computed correlation between ChrF and RttLangIDChrF over 30 languages, and found it to correlate moderately well both in the enxx direction () and the xxen direction (). When we recalculated only on the scores from the distilled models, the correlations were much better in the xxen direction (), but similar in the enxx direction ().

In addition to this, we computed this score over all languages in the 1000-language model, of which 630 passed the LangID > 10% threshold. Figure 2 shows these scores as a function of log data size. There is a relatively clear trend (), and the large majority of languages with over 100,000 monolingual sentences have relatively high scores. In general, the languages above the trend line are close dialects to high-resource languages, most notably variants of English written in different scripts. Languages below the trend line tend to be unrelated to high-resource languages or have poor-quality data according to our data audit.

Of the 630 languages with valid RttLangIDChrF scores, 268 have a RttLangIDChrF score of over 30.0, which we tentatively deem of “hopeful quality”, and 147 have RttLangIDChrF , which is the minimum score from any of our supervised languages. Interestingly enough there is actually fairly low correlation between RttLangIDChrF and the percent of intermediate translations that were assigned the correct LangID score (). One possible explanation is that frequent error modes are outputs in the wrong language and copying, phenomena that we observed in Section 4.8.

We include RttLangIDChrF for all 630 languages in Appendix Table 24, as an approximate measure of translation quality of the model. However, since we have done only cursory analysis of how effective this score is at measuring quality, we advise readers to view it with appropriate skepticism. While a low RttLangIDChrF probably means low translation quality, a high value may well mean something other than high quality.

The version of this score described above we call the loose version of RttLangIDChrF, since it does not penalize the model for intermediate sentences in the wrong language. To get the strict RttLangIDChrF, we multiply the loose RttLangIDChrF by the percent of intermediate translations assigned the correct LangID score, thereby penalizing wrong-language translations. This version does not correlate well with ChrF on the 30 languages where we have evaluation sets, and in fact has negative correlation in the xxen direction (), and only a weak correlation in the enxx direction (), likely due to noise from the LangID model. It also has weaker correlation with the size of the monolingual dataset over all 630 applicable languages (), and is noticeably messed up by close dialects, e.g. assigning Bosnian (bs) a low score because intermediate translations were frequently LangID’d as Croatian (hr), a mistake that should not be penalized as the two languages are frequently indistinguishable. Despite these failings, it has the attractive property that it penalizes the common error mode of wrong-language outputs; and furthermore, the supervised and zero-shot languages appear more clearly separated when plotting them versus monolingual data size. The graph is therefore included in Appendix B.1.

Figure 2: Plot of RttLangIDChrF scores (loose) for languages as a function of log monolingual data size. With over 100,000 sentences, almost any language does reasonably well. Outliers are labeled with their language code. The largest outliers are English in Cyrillic script (en-Cyrl), which has excellent RttLangIDChrF score but very little monolingual data, and Tibetan (bo), which has plenty monolingual text but very poor performance. In general, the languages above the trend line are close to high-resource languages (where the metric may also be fooled), and the languages below the trend line are linguistically distant from other languages in the model or have poor-quality data. Languages added to Google Translate as part of this effort (all unsupervised except Sorani Kurdish (ckb)) are marked with stars.

4.4 Human evaluations

Any decision on translation quality ultimately cannot by made with automatic metrics like ChrF alone. In order to understand the quality of our distilled models, we asked human raters to rate the quality of the translations from our test set on a scale from 0 (nonsense or wrong language) to 6 (perfect). Full results may be seen in Appendix Table 20.

Although we made an attempt to calibrate raters and explain each point in the scale very clearly, each rater will naturally have a different understanding of “a good translation”. For this reason, it is very difficult to interpret these results in any sort of holistic way. However, with some diving into the results together with native speakers, a few things stood out.

The biggest takeaway is that automatic metrics overestimate performance on related dialects. Nigerian Pidgin (pcm), a dialect of English, had very high Bleu and ChrF scores, of around 35 and 60 respectively. However, humans rated the translations very harshly, with a full 20% judged as “Nonsense/Wrong Language”, with trusted native speakers confirming that the translations were unusable. Krio (kri – close to English), Maithili (mai –close to Hindi), and Bhojpuri (bho –close to Hindi) were in a similar boat, though trusted native speakers agreed that the translation quality, though borderline, was usable. What’s happening here that the model translates into (a corrupted version of ) the wrong dialect, but it is close enough on a character n-gram level that the ChrF is still high. In our cases, this is the result of a data pollution problem. Since these languages are so close to other much more common languages on the web – in this case, English and Hindi – the training data is much more likely to be mixed with either corrupted versions of the higher-resource language, or other varieties. As a result, many model outputs that were supposed to be in Dogri (doi) were actually in misspelled or ungrammatical Hindi (hi), outputs supposed to be in Nigerian Pidgin (pcm) were sometimes in other English dialects like AAVE, and so on.

4.5 Mistakes on distributionally similar words

sl reference translation
ak I believe a lion is stronger than a tiger. I think the hyena’s hotter than the elephant.
dv I believe a lion is stronger than a tiger. I believe a lion would be stronger than a miniature crocodile.
mni I believe a lion is stronger than a tiger. I believe a snake is stronger than a crocodile.
doi I believe a lion is stronger than a tiger. I believe seizures are more severe than epilepsy.
ff I believe a lion is stronger than a tiger. I think a rabbit is stronger than a squirrel.
mni The first three colors are red, orange,and yellow. The first three colors are red, yellow, and blue.
qu The first three colors are red, orange,and yellow. The first three colors are red, fire red, and yellow.
sa The first three colors are red, orange, and yellow. The first three colors are red, yellow and saffron.
ts The first three colors are red, orange, and yellow. the first three colors are red, purple and pink
yua The first three colors are red, orange, and yellow. the first colors are red, red and yellow.
sa In this I use capsicum, tomatoes, onions, garlic, green chilies, olives, etc. and do not use cheese and butter in it. Here I deal with greater marjoram, blood fruit, plantain, lassi, green marjoram, jujube, etc., but I do not deal with curd and yogurt.
ak I would want to be a dog for a day. I want to be a crocodile just one day.
ak I would ask my cat some questions about what he’s always trying to tell him by meowing. I will ask my friend some questions about why she is crying.
ak my dog keeps me moving and enjoying life. he is man’s best friend. my pet cat teaches me how to live a healthy lifestyle and enjoy being with people.
ak I went to a carnival yesterday that was located in the middle of nowhere under a huge red and white circular striped tent I went to a place of pleasure in a desolate place, unknown to me, where there was a parrot that had a golden-yellow coat
ak John was working in the lighthouse and went for a walk on the beach one night. John was working in the synagogue and he was sound asleep one afternoon.
ak this is why hair turns grey with age. this is the reason why ticks change into worms after a certain period of time.
sa my bad habit is that I eat too much. My bad behaviour is that I eat poison.
sa Dogs are very intelligent animals, they understand very much about humans. Cockroaches are extremely sharp minded animals, they know about humans correctly
sa Susy loved her space book and would ask her parents to read it to her every night. Susie loved her newspaper, and she asked her parents to read it to her in the morning.
Table 10: Examples of correct translations ( blue) and mistranslations ( orange), illustrating the model’s tendency to make mistakes on distributionally similar nouns.

We observe that our zero-resource models exhibit some characteristic error modes. The most common one relates to translating nouns that occur in distributionally similar contexts in the training data. This occurs even for relatively common nouns like “tiger” – which is often translated as another kind of animal, showing that the model learned the distributional context in which this noun occurs, but was unable to acquire the exact mappings from one language to another with enough detail within this category. This may be related to the relatively small amounts of training data that were used, alongside the unsupervised nature of training. Common nouns suffering from these mistakes include animal names, colors, and times of day. This was also an issue with adjectives, but we observed few such errors with verbs. Sometimes, words were translated into sentences that might be considered culturally analogous concepts – for example, translating “cheese and butter” into “curd and yogurt” when translating from Sanskrit (sa). Surprisingly, the model hypotheses were often strings that would probably yield a high perplexity under most language models. Table 10 provides a variety of examples of mistakes from these models. These examples are from the 6B parameter 1000-language Transformer model, described in Section 3.3.

A good example of what it means for words to have different meanings but to be distributionally similar given the usage of the language is the translation of the string “English Language”. Around 25% of the languages translated the string “English Language” into the name of their own language, e.g. into Tsonga as ririmi ra xitsonga or into Oromo as Afaan Oromoo.

Another interesting phenomenon evident from Table 10 is that models seem to be good at translating “red”, mediocre at translating “yellow”, and poor at translating “orange”. This observation is consistent with cross-lingual hierarchy of color terms described in Berlin & Kay (1969); Saunders & Brakel (2002), which find that terms for color generally arise in specific orders across the world’s cultures, with words for “red” occurring in stage II of color development, “yellow” in stage III/IV, and “orange” in stage VII.

An interesting parallel can be found between this problem and an issue of “unsupervised” translation in the real world – the case of the words for four and six in Etruscan, a language that went extinct about 2,000 years ago (Freeman, 1999). There does not currently exist any surviving parallel text with Etruscan, excepting a single 37-word tablet, but there do exist some 13,000 monolingual inscriptions. Etruscan does not seem obviously related to any living languages, so meanings usually cannot be discovered by their similarity to cognates. As a result, modern scholars have to use a sort of “unsupervied translation” called the Combinatorial Method, a first-principles approach to discovering word sense and grammar. And in the case of Etruscan, there are some words whose meaning cannot be teased apart with existing context. As an example, the words for all numerals from one to ten have been established with some certainty – with the exception of four and six. For these two there is no clear contextual clue to separate them, and there is no academic consensus of which of hu and śa mean four and six (Artioli et al., 2011).

4.6 Performance on tokens by token frequency

In order to quantitatively measure the distributional token errors from Section 4.5, we decided to look at the accuracy of our model at translating particular tokens as a function of their frequency in an open-domain English web corpus.

First, we collected a list of the 8000 most common tokens in a large, web-crawled, monolingual English corpus. Then, we separated these into exponentially-spaced bins, based on the exponential distribution of token frequencies in natural text

(Zipf, 1935). Therefore, each bin corresponds to a set of English tokens in a certain frequency band. Then, for each of these bins, we made a evaluation set for this bin composed of all sentences in our standard xxen evaluation set containing these target tokens at least once in their references.

To score a model on a bin-specific evaluation set, we looked at a simple hit-rate metric. For a given reference sentence containing tokens in the set of target sentence, the model got one point for each token it produced in its output that was among these tokens (a “hit”), for a maximum of points. The hit-rate score for that eval set is then simply the number of hits the model got divided by the total number of possible hits. Formally, for a given set of tokens (in our case a frequency bin), a list of reference translations and a corresponding list of model hypotheses , the hit-rate is defined as follows:

Table 11 shows the result of this analysis, sorted by the number of monolingual sentences seen by the model. Hit-rate is reported on five bins, starting on the most frequent tokens (tokens #0 - #125) and ending with the most infrequent tokens (tokens #8000 - #12800). For each bin the number of sentences in this eval set is included in the column labeled sents. The most frequent bin includes 1191 of the 1197 sentences in the full eval set; the lest frequent bin includes only 958 of them. For each bin the total number of possible points (i.e. hits) is also reported in the column labeled pts. Whereas the first bin has a total of 9682 possible points – averaging eight per sentence, and constituting mostly function words – the least frequent bin had only 2420 possible points, so slightly above two per sentence. It is also worth mentioning that the least frequent bin had 6,000 possible tokens in it, so only about 30% of them actually occurred in our eval set.

sents pts lb ub metric bew mzn as lus dv sa om ti cv ilo
N mono 33M 12M 9M 8M 8M 6M 6M 4M 3M 3M
ChrF 51.5 54.6 59.1 34.5 45.1 46.6 38 43.8 46.6 43.5
1191 9682 0 125 64 65 73 43 58 64 55 60 62 41
994 2537 125 500 56 60 65 32 49 51 42 52 51 39
1034 3102 500 2k 55 61 66 31 48 50 39 45 49 42
958 2420 2k 8k 51 56 56 27 44 42 28 38 43 41
metric nso pcm ce mai ts ak skr bbc gn qu min yua iso gom
N mono 2M 2M 1M 1M 1M 1M 974k 932k 861k 842k 533k 419k 409k 311k
ChrF 51.1 52.3 45.2 61.4 43.6 36.1 47.8 44.2 38.2 32.2 62.2 40.8 29.1 54.9
73 61 55 68 61 57 64 56 57 40 72 58 45 69
57 53 47 66 45 34 53 47 35 28 66 40 23 59
51 54 45 67 42 30 53 46 32 30 67 39 20 59
41 53 43 62 38 24 47 39 28 29 63 31 17 52
metric av ady quc ban bm doi mad kri mni ff mfa
N mono 301k 296k 250k 188k 187k 179k 138k 129k 106k 86k 7k
ChrF 48.4 55.2 27 35.4 35.8 62.8 50.1 56.1 56.2 40.7 65.1
55 56 42 48 57 77 61 65 71 66 77
53 63 20 33 34 68 50 58 61 41 71
57 67 19 33 30 69 49 59 59 33 69
55 65 16 29 22 62 45 50 51 26 62
Table 11: Token hit-rate () for different sets of tokens, binned by frequency. The lower and upper bound of the token frequency rank per bin are given in the lb and ub columns; thus, the top row is the hit-rate on the 125 most frequent tokens, and the bottom row is on the 6,000 least frequent tokens. Columns sents and pts give the number of sentences in each bin-specific eval set and the number of tokens in that bin occurring in those references, respectively. The ChrF score and number of monolingual training sentences is also given. The interesting results are on the languages, like Fulfulde (ff), that have a high hit-rate on more frequent tokens and a lower hit-rate on rarer tokens. All results are only in the xxen direction.

For many languages, higher ChrF means higher hit-rate across the board, and for others, like K’iche’ (quc) low ChrF corresponds with generally lower hit-rate. Things get more interesting for languages that have higher hit-rates for the first several bins of higher frequency tokens, but lower hit-rates for less common tokens. These are the ones that have higher ChrF scores, corresponding to the ability to translate the top 500 tokens or so very well, but that also make frequent mistakes on less common tokens. Perhaps the best example of this is Fulfulde (ff), which had the relatively high ChrF of 40.7, but sees a large drop-off in hit-rate from 41% on bin 2 (token #125 - #500) to 26% on bin 4 (token #2000 - #8000). This is also the language where, anecdotally, we observed several of these sorts of mistakes. Mizo (lus) and Bambara (bm) exhibit similar patterns.

sl tl source translation
en ff devices kabirde (dispositifs)
en lus fragile hring hring (fragile)
en pcm removes dey remove
en pcm blame blame wetin?
en lus four pali a awm a
en lus freedom zalenna a awm
en quc solved xsol rij ri
en quc dam ri k’o pa ri cho
en kri juvenile pikin we no rich 18 ia yet
en kri notorious pipul den we get badnem
Table 12: Our models frequently displayed exceeding verbosity with single-word inputs. In some cases they would add extra definitions for words in parentheses or commas afterwards (top half of table). In other instances they added function words from that language after the translation (boldface), or gave whole definitions, as with Krio.

4.7 Errors on short inputs

Another category of errors we encountered was with single word inputs to the model. The translations tended to be longer, and the model would frequently give alternate translations or append frequent tokens (Table 12). Outputs also frequently had duplicates among the other outputs, suggesting hallucination, or copied inputs. A breakdown of types of errors can be found in Table 13.

This was an issue we observed mainly for lower-resource languages. For higher-resource languages, like Ilocano, the model tended to provide a succinct and correct single-word translation. Ensuring that the MASS training data covers the lower end of the length distribution would likely remedy these issues. However, this is also an inherently difficult problem since we do not provide the model any source language information. For the xxen direction, the translation task needs to solve LangID and translation simultaneously, which could often be ambiguous and quite challenging for short queries. This can potentially be addressed by providing the model source language information along with the input, and is left to future work.

direction model copies multi-word duplicate repeats other
enxx teacher 1.28 66.23 26.48 5.55 17.43
enxx teacher. 0.85 71.71 31.66 2.79 13.87
enxx student 1.15 48.21 37.16 1.43 27.38
enxx student. 8.28 45.53 36.18 2.05 22.73
xxen teacher 61.87 6.19 21.11 0.74 14.74
xxen teacher. 47.3 17.02 24.8 0.47 18.21
xxen student 18.33 28.54 43.23 2.65 22.65
xxen student. 20.1 35.14 40.41 0.27 21.70
Table 13: Statistics for single-token translation on the top ten thousand most common tokens from the monolingual datasets, compared for 48 language pairs, for the fine-tuned teacher model. Metrics shown are 1) percent of outputs that copy the source; 2) percent of input that is multi-word; 3) percent of outputs that are identical to other outputs, suggesting incorrect translation; 4) percent of outputs with low character diversity, suggesting repeats; and 5) percent of outputs with none of those features. The entries with full-stops (.) after them use the “period trick” (Section 5.3). Single word translation is a particularly tricky task as the model needs to solve translation and LangID simultaneously, which can be undefined for short queries.

4.8 Magnification of error modes in student models

After training distilled models, we noticed a variety of unexpected error modes when analyzing the output translations:

  • The translations to Nigerian Pidgin (pcm) frequently instead translated to (often offensive) US slang. For instance, the English sentence “She said to herself” was translated to the unacceptable string “da b***** say ta da b*****self.”

  • Many of the translations to Kalaallisut (kl) were actually Danish (da)

  • Many translations to Sanskrit (asa) were actually Hindi (hi)

We developed filters to remove this content from the forward translated data and distilled the models again. We observed that these problems were more prevalent in the synthetic data used for distillation (generated by the teacher model) than in the monolingual data that had originally been used to train these models, and that the issues were more severe for synthetic text produced by translating a noisier source corpus. The changes in noise level are illustrated in Table 14. We hypothesize that this error magnification could either be an artifact of a positive feedback loop arising from training the model on its own prediction (self-training), or due to a difference in domains between the training and distillation datasets.

This table also includes one entry about distillation from Japanese to English. In this case, we found that one particular Amharic string was often hallucinated. This string occurred occasionally in the target side of the original training data, and then occurred much more often in the distilled data.

language % mono % synth (clean) % synth (noisy)
pcm 8% 20% -
kl 14% 18% 30%
sa 13% 16% 37%
ja 0.0004% - 1%
Table 14: Percent of each data set removed after applying specialized filters. mono refers to the monolingual data from the corpus mined in Section 2.2. synth(clean) refers to synthetic data generated by forward translating a clean English monolingual corpus and synth (noisy) refers to synthetic data generated from a noisy web-scraped English corpus. We see that the synthetic text generated by the teacher model exhibited these problems to a greater degree than the monolingual web-sourced data, and that these problems intensified on noisier data.

4.9 Comparison on Flores benchmark

Since these models are not trained on the same data as public benchmarks, a comparison on public benchmarks is not necessarily very meaningful. Nonetheless, in Table 15 we provide a comparison between the spBLEU results from our method (on distilled models) versus the Flores-101 benchmark scores (Goyal et al., 2021) reported for the massively multilingual M2M-124 (Fan et al., 2021) for overlapping languages. Given the higher language coverage in our monolingual dataset, our models yield higher spBLEU for all language pairs.

LP this method M2M-100 LP this method M2M-100
enas 29.1 1.22 asen 34.5 3.76
enckb 28.5 0.23 ckben 37.5 7.65
enff 2.5 0.68 ffen 11.2 2.4
enlg 17.3 0.61 lgen 29.3 4.45
enln 24.7 1.03 lnen 30.2 4.57
ennso 32.5 1.54 nsoen 45.0 6.76
enom 17.2 0.4 omen 30.8 3.33
Table 15: Flores dev-test: comparing spBleu between this method and Goyal et al. (2021).

5 Additional Experiments and Notes

5.1 Non-English-centric bridging

There is nothing inherently English-centric about the approach to zero-resource translation put forth in this paper. Nonetheless, the model has only seen translated text between English and other languages, so it would be a reasonable hypothesis that it is just inherently better at translating to and from English, even in the zero-shot scenario.

We test this hypothesis by evaluating the model on bridged translations. We first translate the English source sentences to other languages using bilingual supervised models on these language pairs. Then we use the model proposed in this paper to translate these translations directly into the desired target language. For each desired target language we pick 1) the closest mid- or high-resource language (HRL) that we expect our bilingual models to do well on; and 2) if applicable, a lower-resource language (LRL) that may be closer to the desired target language, but have lower quality supervised models. Please note that the definition of close is somewhat approximate. For instance, for Native American languages we choose colonial languages as the “close" languages, because they may share a certain amount of vocabulary, even if the grammar is entirely divergent. Furthermore, for some of these languages the “close” languages are in fact not very close at all, as with the Sino-Tibetan languages of North-East India (Bodo (brx), Meiteilon (mni-Mtei), Mizo (lus)), which are only somewhat related to the “closer LRL” of Myanmar/Burmese (my).

lang. direct close HRL closer LRL
Native American Languages
ay 33.1 es 34.2 - -
gn 31.5 es 28.9 - -
kl 35.5 da 27.5 - -
qu 35.3 es 29.5 - -
quc 24.1 es 22.5 - -
yua 31.5 es 28.4 - -
Indian Languages (Indo-European)
as 39.2 hi 39.8 bn 36.3
bho 42.0 hi 43.4 - -
doi 36.3 hi 39.5 pa 33.3
dv 44.4 hi 42.0 si 39.7
gom 40.2 hi 39.3 mr 39.7
ks 21.9 hi 25.7 ur 28.5
mai 38.1 hi 44.3 - -
sa 30.5 hi 27.3 - -
lang. direct close HRL closer LRL
Indian Languages (Not Indo-European)
brx-Beng 4.6 hi 11.5 my 3.4
lus 38.6 hi 38.2 my 34.1
mni 40.7 hi 35.5 my 29.4
sat-Latn 20.9 hi 20.8 km 18.2
Bantu Languages
lg 38.7 sw 34.1 rw 33.6
ln 34.4 sw 31.9 fr 25.5
nso 45.7 sw 33.3 st 29.1
ts 46.2 sw 40.0 zu 44.0
bm 34.3 fr 26.6 - -
ff 34.7 sw 26.7 - -
ilo 54.0 id 48.4 fil 51.2
om 39.2 - - so 30.2
ti 21.4 - - am 22.6
yue 20.3 zh 21.6 - -
Table 16: Results for bridged translation from English (ChrF) on 1200 sentences/language. Bridging seems to improve quality only when the intermediate language is both higher-resource and close to the target language.

Table 16 shows the results of this investigation, along with which languages were chosen as “close high-resource languages”444For this experiment, we considered Swahili (sw) to be a high resource language. This is not precisely accurate, but it was the closest mid-resource language available. and “closer low-resource languages”.

We find that in a substantial number of cases, bridged translation scores better on automatic metrics. This is true especially for the languages of India, with bridged translation drawing even or improving even for the non-Indo-European languages. Overall, the largest improvements are seen on close dialects, for instance Maithili (mai: +6.2 ChrF), Bhojpuri (bho: +1.4 ChrF), Kashmiri (ks: +7.4 ChrF), and Cantonese (yue: +1.5 ChrF). Aymara (ay) and Dogri (doi) also saw noticeable gains. However, most Native American languages, Bantu languages, and languages without close relatives saw large losses by bridging.

Nonetheless, in 19 out of 28 cases, we find that direct translation from English produces the highest ChrF score. This may be due in part to the model being inherently better at translating from English, and in part from errors compounding from the two-step process. Evidence for the second hypothesis may be seen in the fact that only two languages (Kashmiri (ks) and Tigrinya (ti)) see any gain from bridging through a lower-resource language. Overall, it seems that bridging only improves quality when the intermediate model is already relatively high quality, and additionally when the intermediate language is close to the target language.

The fact that bridging appears to work relatively well is a good sign. In practice, translation to and from English is not the major use case for many low resource languages, and direct models to local languages (e.g. Hindi for India, Spanish for Latin America, etc.) would likely provide more utility to local communities.

5.2 Zero-shot transliteration

Many of the world’s languages are written in multiple scripts, whether because of historical and national changes, informal online usage, non-standardized writing systems, or different ethnic or religious populations. For this reason, especially with under-resourced languages where this is more common, it is especially important to be able to support these languages in their many different writing systems.

For our data collection efforts, we crawled data in a variety of different scripts for several languages, e.g. both Malayalam in Malayalam script (ml) and Malayalam in Latin script (ml-Latn). We treated script-variants of the same language as separate languages, with their distinct <2xx> tags. To transliterate, we simply asked the model to provide “translations” from text in one script to the same language in another script.

We applied this approach to transliterate from Latinized variants of Indian languages to their native scripts, and found that it worked very well out of the box, appearing to be more robust than existing transliteration libraries to informal or nonstandard spellings. One example is the common abbreviation kr in Latinized Hindi (hi-Latn), as in kya kr rhe ho which our model correctly transliterated as “kar” (in Devanagari script), whereas the rule-based model incorrectly rendered as “ke”. Similarly the model was able to do some spelling correction, e.g. on the the misspelled Konkani swpnatat, which the model correctly transliterated to “swapnaat”. However, the model also had a tendency to change small parts of the input sentence, as well as occasionally hallucinating extra content. This also made it difficult to compare this model to rule-based approaches in a rigorous way, because the lack of guaranteed monotonic alignment made word error rate inapplicable.

One interesting example highlighting issues this approach has is the Konkani sentence Xet-camot ani kneddeam- gauncho vaur vo dondo. This was transliterated mostly correctly to Devanagari, but the the dialect was changed: the original text is Goan Roman Catholic Konkani, but the transliteration was in Goan or Maharashtrian Konkani, changing “camot” to “camat” and “vo dondo” to “vaa dando”.

Future work is necessary to coax this technique into a form that does not take any liberties with the input. One promising direction is to separate the <2xx> tokens (Johnson et al., 2016) into a language subtag and a script subtag. For instance, <2ms> becomes <2ms> <2Latn>, and <2ms-Arab> becomes <2ms> <2Arab>. This enables 0-shot transliteration between any scripts supported by the model, simply by using the desired <2Script> tag at inference time. In order to prevent the model editorializing, it would also be advisable to consider adding synthetic transliterated parallel text with a <2transliteration> tag, to teach the model that this task only involves substituting letters/sounds.

5.3 The “Period Trick”

Even after having been distilled on a mix of clean and noisy data, we observed that some of these languages still had lower performance on inputs that lacked terminal punctuations. To study this, we compared performance on the evaluation sets with and without terminal punctuation. Table 17 illustrates the results of this experiment on the distilled models, though we noticed similar trends in the teacher models. The gain is small but consistent, and in the xxen direction, 100% of the language pairs benefit. We noticed that sentences without terminal punctuation sometimes triggered common error modes, e.g. decoding into Danish (da) instead of Kalaallisut (kl), or misspelled Hindi (hi) instead of Dogri (doi). We hypothesize that the presence of terminal punctuations might provide a “domain” signal to the model and trigger different translation qualities.

direction no TP. TP W/L
enxx 39.32 39.56 +0.24 0.77
xxen 48.54 49.23 +0.69 1.0
Table 17: Comparing ChrF on versions of the evaluation sets with and without terminal punctuation (TP) on 26-language distilled models. The Win/Loss ratio is also reported, meaning the fraction of language pairs that saw an increase in ChrF by applying the terminal punctuation.

5.4 Robustness to non-standard glyph usage

There are many different ways to write certain letters, especially those where a Unicode standard was introduced after a population was already active online, or before keyboards using this standard were widely available. Common cases include the many Unicode points for the “open o” (o) and “open e” (e) used in many African languages; the Palochka (resembling the letter I), used in many Caucasian languages; the apostrophe or ‘Okina, used around the world but especially in many American languages; and many other examples that can be found in the Unicode Confusables list (Davis & Suignard, 2021). Table 18 gives an example of this phenomenon in the wild, showing the variation of Unicode points used for the Chechen and Chuvash languages in our webcrawled data. We refer the reader to Prasad et al. (2018) for many more examples of this phenomenon.

We conducted a simple experiment to determine how robust our model was to these different usages. Looking at translations into English, we decoded our evaluation sets (with the non-finetuned, 1000-language teacher model) using each of a variety of different ways of representing each letter. We then compared the ChrF scores on the output. Results can be seen in Table 19. We found that in most cases, this made very little difference in ChrF, even when using rare glyphs like the capital Greek Iota in place of the Palochka in Caucasian languages, indicating that our models were quite robust to these variations. However, we did notice a relatively large change in ChrF for West African languages when using nonstandard glyphs, including the “chatspeak” ASCII characters used when texting or writing very informally – for instance, writing “aho)f3” for “ahoofe”.

Unicode character name Percent of Data
Table 18: Examples of the different Unicode points used to encode the Palochka character in the Chechen language (above) and letters with diacritics in the Chuvash language (below), along with their prevalence on web-crawled data. In both cases, the “correct” Unicode point (bolded) is much less common.
Unicode point name avg. ChrF
gn, luo, quc, yua en
U+2019 Right quote 25.7
U+0060 Grave accent 25.5
U+0027 Apostrophe 25.5
U+02BB ’Okina 25.5
U+00B4 Acute accent 25.4
None 24.5
ady, av, ce en
U+04C0 Palochka 49.6
U+0031 ASCII 1 49.4
U+0399 Greek Iota 49.4
U+0406 Byelorusian/Ukrainian I 49.3
ak, bm, dyu, ee en
U+025B, U+0254 Latin open e/o 33.6
U+03B5, U+1D10 Greek epsilon; small capital O 25.8
U+0033, U+0029 3 and ) (chatspeak) 25.3
cv en
U+04D7, U+04D1, U+04AB Cyrillic codepoints 42.3
U+0115, U+0103, U+00E7 Latin Codepoints 43.0
Table 19: ChrF score where the source uses different versions of common Unicode points. The top line of each bloc represents the “correct” codepoints, whereas the lower lines are other ways of representing the same letter. In many cases there is very little difference in performance, but the African languages are affected by nonstandard o and e. When the apostrophe is removed entirely, the performance also drops noticeably.

5.5 Non-Unicode fonts

Before easy access to keyboards using the correct Unicode points, or before the Unicode standard itself, it was often not clear how to represent alphabets or characters that were not in the ASCII range. We already saw one consequence of this in Section 5.4. A more difficult consequence is the existence of non-Unicode fonts. The way this often works is that one types in ASCII characters, but downloads a special font that provides glyphs for these the ASCII code points that give the desired visual rendering – for instance, one types “l72is4wo”, and it renders as the Ewe word lãdisowo. Other non-Unicode fonts even assign different values to code points beyond the ASCSII block: one well-known example is the Zawgyi encoding for Myanmar (Burmese) (Liao, 2017). In the course of this work, we discovered that a wide variety of languages still use non-Unicode fonts. We ran into such fonts for Ewe (ee), Kashmiri (ks), Meiteilon (mni-Mtei), Mooré (mos), Navajo (nv), and Tamazight (ber-Latn), usually in the case of deliveries from professional translators. It is likely that there exists a large, hidden portion of data for some languages using these or other fonts. Our attempted reconstruction of some of these fonts can be seen in Appendix Section D.

6 Importance of Native Speakers

Since the advent of statistical and then neural machine translation, large data sets and improved modeling techniques have driven progress in translation quality. It has sometimes been difficult for the expertise of native speakers and linguists to continue helping to improve models in this environment. However, the participation of speakers of languages, and of members of affected communities, is nonetheless still vital. In the very-low-resource domain, there are more errors in models and data – and consequently, more opportunities for native speakers to help improve the quality.

We stress that where possible, it is important to try to build relationships with native speakers and members of these communities, rather than simply interacting with them as crowd-workers at a distance. For this work, the authors reached out to members of as many communities as we could, having conversations with over 100 members of these communities, many of whom were active in this project (See Acknowledgements).

Here is an incomplete list of ways in which speakers of these languages helped us extend and improve machine translation to their languages:

  1. Understanding Data: As in Kreutzer et al. (2022), we conducted an extensive review of the quality of our datasets (see Appendix Table 22). In addition to simply giving us an idea of which languages had higher or lower quality data, this also gave us valuable insights about other uses and aspects of the corpora that were useful beyond this project – for instance, which corpora had more colloquial (or religious) text, and which dialects were mixed with other languages.

  2. Understanding errors in reference translations: A variety of languages had extremely low automatic metrics (e.g. Bleu under 1.0), despite having large and clean corpora. For two such cases, native speakers helped us identify quality and fluency issues with our reference translations – and that the model outputs often looked better than the references.

  3. Specialized Filters: Our corpora for a few languages were polluted with related high-resource languages that had somehow passed all the previous rounds of filtering. For two cases, natives speakers helped design custom filters to remove the unwanted content, and for a third, they helped remove sensitive content.

  4. Transliteration and political sensitivity around script: For one language, we were initially unaware that we were using the wrong script. We were using a script that was associated with colonial times, and had since been replaced in the entire region, and was a matter of political sensitivity. Native speakers both pointed out this issue and helped us transliterate the text to the appropriate script. (See Appendix Section C)

  5. Understanding crowd-worker annotations: when we sent translations to crowd workers to rate their quality, several languages showed unusual rating patterns, or patterns that did not line up with our expectations. We were fortunate to have native speakers who helped us understand the ratings differentiate cases where raters were mis-calibrated against real quality issues. Freitag et al. (2021) explores some of these phenomena more.

  6. Clarifying utility for Community: Even if one can build a translation model for a language, should one? For some groups, community desires may differ from what many in the machine translation community might expect (Long, 2007; Coffey, 2021; Hiraishi, 2021). And if the translation model is of imperfect quality, is that still helpful for the community – or is it perhaps offensive? These are questions that can only be answered by members of the community. In our interviews, we generally found that native speakers of the languages we spoke to were very enthusiastic for even lower-quality translation offerings. This said, no one person can represent the entire community, and there is much to learn in how to handle situations where opinions and desires differ within a community.

  7. Commenting on Dialects: Many “languages” have a wide variety of dialects, sometimes hardly mutually intelligible. Native speakers helped us understand when our models were producing a particular dialect, or mixing and matching them.

  8. The correct name to use for a language: Whereas a language like French has a fairly unambiguous name, many languages have multiple names, some of which may be offensive or exclusive. Frequently there exists a colonial name (e.g. “Oriya” or “Lushai”), which may be more widely known, but may also be disliked by the members of the community, as well as a native name (e.g. “Odia” and “Mizo”) that is lesser known but preferred. Similarly, there may be names which may feel exclusive to some sub populations – for instance using the name “Manipuri” for the language of the Meitei ethnic group. Although the most common way to refer to the Meitei language is indeed “Manipuri”, this is exclusive to the many other ethnic groups in the state of Manipur with their own languages.

7 Conclusions and Open Problems

7.1 Main Findings

Starting with an initial seed dataset of monolingual sentences spanning over languages, we demonstrate that it is possible to build relatively clean web-mined monolingual text datasets for over languages. We highlight the importance of incorporating expressive semi-supervised LangID models, document-level consistency signals, and several word-based and custom filtering techniques to identify and filter web text in long-tail languages (Section 2.1). Using this approach we are able to build a multilingual unlabeled text dataset containing over million sentences for more than languages and over thousand sentences in more than languages (Section 2.2).

Training on this dataset and a parallel corpus spanning languages, we build massively multilingual models capable of translating across languages (Section 3). We highlight the importance of model capacity when training highly multilingual translation models and the positive effect on zero-resource quality when increasing the number of languages in the model. We also describe the significant quality improvements achievable by incorporating large-scale back-translation and self-training, and share our findings towards developing practical, inference-friendly models for long-tail languages.

We evaluate our models on evaluation sets collected for languages, and highlight the need for choosing the right automatic metric (ChrF) when evaluating long-tail languages (Section 4.2). Apart from automatic metrics on evaluation sets, we additionally release approximate reference-free quality scores of our -language MT model to provide an indicator of web-trained multilingual model quality on hundreds of previously under-studied languages (Section 4.3). We perform human evaluations on a subset of the languages in distilled models, and highlight that it is possible to build high quality, practical MT models for long-tail languages utilizing the approach described in this work (Section 4.4).

Through qualitative and quantitative analysis of the model outputs, we reveal a few characteristic error modes of our models; including confusing distributionally similar and infrequent tokens, and also producing verbose and inaccurate translations for short or single word queries arising from the extreme data sparsity of the zero-resource setting (Sections 4.5 and 4.7). We furthermore highlight several other observations from our studies, including non-English-centric direct translation, zero-shot transliteration, the effect of terminal punctuation on translation quality, and the robustness of the model to the non-standard glyph usages that are common for many languages (Section 5).

Finally, we highlight several indispensable contributions of native speakers who helped us evaluate, understand, filter and improve our datasets and models; and helped us understand the overall context of how these models should fit in with their communities (Section 6).

7.2 Related Work

There is a considerable wealth of literature on building highly multilingual text corpora, LangID models, and MT models. Our work differs largely in the scale, quality, and number of languages covered, together with the integration of many moving parts in the entire data-to-translation-model pipeline.

Access to multilingual datasets for NLP research has vastly improved over the past years. Since 2006, the Web as Corpus workshops have focused on the challenges around identifying relevant pages, extracting clean text, content de-duplication, and many other relevant topics (Barbaresi et al., 2020; Jakubíček et al., 2020). A variety of web-derived collections for hundreds of languages is available for anyone to download, such as the Corpora Collection at Leipzig University (Goldhahn et al., 2012), the Corpus of Global Language Use (Dunn, 2020), ParaCrawl (Esplà et al., 2019; Bañón et al., 2020), WikiMatrix (Schwenk et al., 2019), CCNET (Wenzek et al., 2020) and CCAligned (El-Kishky et al., 2020), OSCAR (Ortiz Suárez et al., 2019; Ortiz Suárez et al., 2020; Abadji et al., 2022), and several others; all of which have between 100 and 300 languages. The largest language coverage is probably An Crúbadán, which does not leverage LangID, and found (small amounts of) web data in about 2,000 languages (Scannell, 2007). These corpora have in turn enabled a variety of highly multilingual models, like mT5 (Xue et al., 2020), M2M-100 (Fan et al., 2020), and M4 (Arivazhagan et al., 2019; Siddhant et al., 2022).

Curating such datasets relies on the websites giving clues about the language of their contents (e.g. a language identifier in the URL) and on automatic language classification (LangID). It is commonly known that these automatically crawled and filtered datasets tend to have overall lower quality than hand-curated collections (Koehn et al., 2020), but their quality is rarely measured directly, and is rather judged through the improvements they bring to downstream applications (Schwenk et al., 2019). Therefore, many of these multilingual web corpora suffer from serious quality issues, especially for low-resource languages. A recent audit conducted by Kreutzer et al. (2022) on five public, multilingual datasets found pervasive issues. Many corpora claiming to be in one particular language in fact contained zero percent in-language content — and sometimes zero percent linguistic content entirely. Of the many issues contributing to this phenomenon, a fundamental one is the poor efficacy of LangID on low-resource languages.

Several works have investigated LangID at the level of multilinguality studied in this work. One relevant LangID implementation is Dunn (2020), achieving an F1 above 0.95 for 464 languages, and offering a thorough evaluation on different data sources and domains. The only LangID systems with higher coverage that we are aware of are those developed by Brown (2012, 2013, 2014), with the most recent version covering as many as 1,366 language varieties, with accuracy above 99%. Finally, Caswell et al. (2020) trains LangID models on 1,629 languages, and demonstrates that although they appear to have very high scores on held-out evaluation sets, in practice, when applied to web text, they produce datasets of almost unusable noisiness. Various error pathologies are detailed, and a few novel filtering techniques are proposed to counteract them, including Tf-iif filtering and semi supervised LangID (SSLID). The present work can be viewed as an extension of Caswell et al. (2020), and a fusion of it with translation technology.

Our MT modeling approach builds on several previous works on massively multilingual, zero-resource and self-supervised MT; differing primarily in terms of the scale of multilinguality, model capacity and the extreme data sparsity in our experimental setting. We call these approaches that combine different aspects of scale as M4: massively multilingual, massive machine translation555

Multilingual Neural Machine Translation models were first introduced during the last decade (Firat et al., 2016a; Johnson et al., 2017), but the initial versions of these models were limited to a few languages (10–12). Over the last few years, there has been an explosion of work focusing on massively multilingual models that could translate between around languages (Neubig & Hu, 2018; Aharoni et al., 2019b; Arivazhagan et al., 2019; Zhang et al., 2020; Tang et al., 2021; Fan et al., 2021). However, most work on massively multilingual MT has focused on the purely supervised setting. There are a few works that have ventured beyond the limitations of large-scale multilingual corpora, and trained MT models spanning over languages (Mueller et al., 2020), usually limited to narrow-domain (often religious) corpora.

Another stream of research on unsupervised MT developed modeling approaches to train MT models using monolingual datasets only (Lample et al., 2017; Artetxe et al., 2017; Song et al., 2019). With the advent of multilingual pre-training, with models like multilingual BERT (Devlin et al., 2019), XLM (Lample & Conneau, 2019), mBART (Liu et al., 2020) and others, the focus shifted towards fine-tuning pre-trained models with paired data in a sub-set of the pre-training languages to enable zero-resource translation in the remaining languages. These approaches are often complemented with large-scale back-translation (Sennrich et al., 2016; Edunov et al., 2018) to continue improving the model beyond its initial zero-shot performance.

Our work builds on Siddhant et al. (2020); Garcia et al. (2021b); Siddhant et al. (2022) that combines together multilingual supervised MT, zero-resource MT (Firat et al., 2016b) and self-supervised learning within a single model. We extend the work in Siddhant et al. (2022) by scaling to larger models, a more multilingual dataset, utilizing self-training and a novel filtering technique based on round-trip translation consistency and LangID predictions.

Apart from efforts focused on building highly multilingual web-mined corpora and MT models, another line of NLP research has focused on building datasets and NLP technologies for specific languages, not necessarily from web content. Many of these are grassroots, bottom-up efforts from the affected communities, organized through research collectives like Masakhane ( et al., 2020), Turkish Interlingua (Mirzakhalov et al., 2021a, b), and GhanaNLP (Azunre et al., 2021a); and conferences and workshops like AfricaNLP666, AmericasNLP (Mager et al., 2021) and ArabNLP777 These efforts, in addition to providing datasets, frequently provide models and baselines, or even public interfaces, like the Khaya Translator Web App888 by GhanaNLP for West African languages, and the lesan.ai999 translation website for Ethiopian languages.

Participation is especially strong from the African continent, including corpora and models for pan East-African languages (Babirye et al., 2022), languages from the Horn of Africa (Hadgu et al., 2022), Ethiopian languages (Abate et al., 2018; Gezmu et al., 2021), Ugandan languages (Akera et al., 2022), Emakhuwa (Ali et al., 2021), South-African languages (Eiselen & Puttkammer, 2014), Setswana and Sepedi (Marivate et al., 2020), Yorùbá (Adelani et al., 2021b, a), Oshiwambo (Nekoto et al., 2022), Igbo (Ezeani et al., 2020), Zulu (Mabuya et al., 2021), Twi (Azunre et al., 2021b), Gbe (Hacheme, 2021), Bambara (Tapo et al., 2021), and Fon (Emezue & Dossou, 2020). Outside of Africa, corpora have been created for languages of the Americas, including for four indigenous languages of Peru in Bustamante et al. (2020), the numerous results on the largely South- and Central American languages from the first AmericasNLP conference (Mager et al., 2021), and the Inuktitut language of Canada (Joanis et al., 2020). Datasets for lower-resourced languages of India have also sprung up, including the 13-language PMIndia (Haddow & Kirefu, 2020), and datasets focused on languages of the Northeast like Mizo (Thihlum et al., 2020), Khasi (Laskar et al., 2021) and Assamese (Laskar et al., 2020). Finally, a variety of such datasets and models are available for public use on HuggingFace101010 or Zenodo111111 We believe language-specific efforts to be orthogonal and complementary to massively multilingual approaches for corpora building and modeling, as we elaborate further in Section 7.3.

7.3 Future Work

Barring a dramatic increase in the amount of web text available for long-tail languages, the types of errors produced by our zero-resource models (Section 4.5) are likely to persist. We highlight a few potential directions for future research that could help address the data sparsity that underlies the quality limitations of these models.

Utilizing dictionaries to ground distributionally similar words: One approach to address errors with distributionally similar words could involve helping ground the model’s translations by utilizing bilingual dictionaries or similar resources. Dictionaries are relatively widely available and have already yielded promising results for individual language pairs (Xia et al., 2019; Duan et al., 2020; Karamanolakis et al., 2020; Reid et al., 2021). For languages where dictionaries do not exist or coverage is low, dictionaries are much cheaper to build as compared to a dataset of parallel sentences. Efforts to develop high-coverage dictionaries and modeling approaches to incorporate them in massively multilingual MT models could nicely complement a corpus of monolingual web text and massively multilingual MT.

Complementing massively multilingual modeling with language-specific efforts: The quality of web-mined datasets is unlikely to match that of language-specific, hand-curated datasets; and building hand-curated datasets might be the only way forward to build text datasets for languages with limited presence on the web. However, we believe the two approaches to be orthogonal and complementary. Leveraging highly multilingual web-mined datasets and models significantly reduces the amount of data and research efforts needed to build practical NLP technologies for these languages (Wang et al., 2020; Emezue & Dossou, 2021; Adelani et al., 2022; Alabi et al., 2022; Nekoto et al., 2022), and research efforts could be more efficient by building resources and modeling approaches that complement the weaknesses of web-based massively multilingual models. Furthermore, community-based contributions could yield other useful language-specific tools, like specialized data filters as in Section 4.8, tools to normalize orthography or script (like those described in Appendix C) or pre- and post-processors to correct certain mistakes, improve diacritization, etc.

Leveraging multimodal datasets and models: A large proportion of the languages spoken in the world have no written forms or standardized orthographic conventions. For a large majority there is limited amounts of text data available on the web. Being able to train models that can learn from and represent speech and text jointly (Zheng et al., 2021; Bapna et al., 2022; Chen et al., 2022; Bai et al., 2022; Tang et al., 2022) is essential to alleviate data sparsity and building more robust language technologies for such languages.

In future work we plan to investigate the above-mentioned and related threads of research, hopefully making progress towards building and supporting language and speech technologies for more languages.


We would like to extend our deepest gratitude to the following native speakers and members of the affected communities, who helped us in a wide variety of ways:

Yasser Salah Eddine Bouchareb (Algerian Arabic); Mfoniso Ukwak (Anaang); Bhaskar Borthakur, Janu Nelakanti, Kishor Barman, Rasika Saikia, Suraj Bharech (Assamese); Annette David, William Merza (Assyrian Neo Aramaic), Ruben Hilare Quispe (Aymara); Devina Suyanto, Puspa Si Pinguin (Balinese); Allahserix Auguste Tapo, Bakary Diarrassouba, Maimouna Siby, Moussa Doumbouya (Bambara); Mohammad Jahangir (Baluchi); Subhajit Naskar (Bengali); Animesh Pathak, Ankur Bapna, Anup Mohan, Chaitanya Joshi, Chandan Dubey, Kapil Kumar, Manish Katiyar, Mayank Srivastava, Neeharika, Saumya Pathak, Tanya Sinha, Vikas Singh (Bhojpuri); Bowen Liang, Ellie Chio, Eric Dong, Frank Tang, Jeff Pitman, John Wong, Kenneth Chang, Manish Goregaokar, Mingfei Lau, Ryan Li, Yiwen Luo (Cantonese); Monang Setyawan (Caribbean Javanese); Craig Cornelius (Cherokee); Anton Prokopyev (Chuvash); Nash Rafeeq (Dhivehi); Rajat Dogra, Sid Dogra (Dogri); Mohamed Kamagate (Dyula); Chris Assigbe, Dan Ameme, Emeafa Doe, Irene Nyavor, Thierry Gnanih, Yvonne Dumor (Ewe); Abdoulaye Barry, Adama Diallo, Fauzia van der Leeuw, Ibrahima Barry (Fulfulde); Isabel Papadimitriou (Greek); Alex Rudnick (Guarani); Mohammad Khdeir (Gulf Arabic); Paul Remollata (Hiligaynon); Ankur Bapna (Hindi); Mfoniso Ukwak (Ibibio); Nze Lawson (Igbo); D.J. Abuy, Miami Cabansay (Ilocano); Archana Koul, Shashwat Razdan, Sujeet Akula (Kashmiri); Jatin Kulkarni, Salil Rajadhyaksha, Sanjeet Hegde Desai, Sharayu Shenoy, Shashank Shanbhag, Shashi Shenoy (Konkani); Ryan Michael, Terrence Taylor (Krio); Bokan Jaff, Medya Ghazizadeh, Roshna Omer Abdulrahman, Saman Vaisipour, Sarchia Khursheed (Kurdish (Sorani)); Suphian Tweel (Libyan Arabic); Espoir Murhabazi, Doudou Kisabaka (Lingala); Colleen Mallahan, John Quinn (Luganda); Cynthia Mboli (Luyia); Abhishek Kumar, Neeraj Mishra, Priyaranjan Jha, Saket Kumar, Snehal Bhilare (Maithili); Lisa Wang (Mandarin Chinese); Cibu Johny (Malayalam); Viresh Ratnakar (Marathi); Abhi Sanoujam, Gautam Thockchom, Pritam Pebam, Sam Chaomai, Shangkar Mayanglambam, Thangjam Hindustani Devi (Meiteilon (Manipuri)); Hala Ajil (Mesopotamian Arabic); Hamdanil Rasyid (Minangkabau); Elizabeth John, Remi Ralte, S Lallienkawl Gangte, Vaiphei Thatsing, Vanlalzami Vanlalzami (Mizo); Ahmed Kachkach, Hanaa ElAzizi (Morrocan Arabic); George Ouais (MSA); Ujjwal Rajbhandari (Newari); Ebuka Ufere, Gabriel Fynecontry, Onome Ofoman, Titi Akinsanmi (Nigerian Pidgin); Marwa Khost Jarkas (North Levantine Arabic); Abduselam Shaltu, Ace Patterson, Adel Kassem, Mo Ali, Yonas Hambissa (Oromo); Carlos Molina Vital, Helvia Taina, Marisol Necochea (Quechua); Rowena Marin (Romani), AbdelKarim Mardini (Saidi Arabic); Ishank Saxena, Manasa Harish, Manish Godara, Mayank Agrawal, Nitin Kashyap, Ranjani Padmanabhan, Ruchi Lohani, Shilpa Jindal, Shreevatsa Rajagopalan, Vaibhav Agarwal, Vinod Krishnan (Sanskrit); Nabil Shahid (Saraiki); Ayanda Mnyakeni (Sesotho, Sepedi); Landis Baker (Seychellois Creole); Taps Matangira (Shona); Ashraf Elsharif (Sudanese Arabic); Sakhile Dlamini (Swati); Hakim Sidahmed (Tamazight); Melvin Johnson (Tamil); Sneha Kudugunta (Telugu); Alexander Tekle, Bserat Ghebremicael, Nami Russom, Naud Ghebre (Tigrinya); Abigail Annkah, Diana Akron, Maame Ofori, Monica Opoku-Geren, Seth Duodu-baah, Yvonne Dumor (Twi); Ousmane Loum (Wolof), and Daniel Virtheim (Yiddish).

The authors also thank Raiomond Doctor and Cibu Johny for their invaluable help with the Meiteilon (Manipuri) transliteration project, and Colin Cherry, Markus Freitag and Johan Schalkwyk for their invaluable comments that helped improve the paper.


  • Abadji et al. (2022) Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, and Benoît Sagot. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, art. arXiv:2201.06642, January 2022.
  • Abate et al. (2018) Solomon Teferra Abate, Michael Melese, Martha Yifiru Tachbelie, Million Meshesha, Solomon Atinafu, Wondwossen Mulugeta, Yaregal Assabie, Hafte Abera, Binyam Ephrem, Tewodros Abebe, Wondimagegnhue Tsegaye, Amanuel Lemma, Tsegaye Andargie, and Seifedin Shifaw. Parallel corpora for bi-lingual English-Ethiopian languages statistical machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 3102–3111, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL
  • Achom & Basu (2015) Amika Achom and Anupam Basu. Design and evaluation of Unicode compliance Meitei/Meetei Mayek keyboard layout. In Proceedings of 2015 International Symposium on Advanced Computing and Communication (ISACC), pp. 90–97, Silchar, India, 2015. IEEE. doi: 10.1109/ISACC.2015.7377322.
  • Adelani et al. (2021a) David Adelani, Dana Ruiter, Jesujoba Alabi, Damilola Adebonojo, Adesina Ayeni, Mofe Adeyemi, Ayodele Awokoya, and Cristina España-Bonet. MENYO-20k: A multi-domain English-Yorùbá corpus for machine translation and domain adaptation. CoRR, arXiv:2103.08647v1, 03 2021a. doi: 10.48550/ARXIV.2103.08647.
  • Adelani et al. (2021b) David Adelani, Dana Ruiter, Jesujoba Alabi, Damilola Adebonojo, Adesina Ayeni, Mofe Adeyemi, Ayodele Esther Awokoya, and Cristina España-Bonet. The effect of domain and diacritics in Yoruba–English neural machine translation. In Proceedings of Machine Translation Summit XVIII: Research Track, pp. 61–75, Virtual, August 2021b. Association for Machine Translation in the Americas. URL
  • Adelani et al. (2022) David Ifeoluwa Adelani, Jesujoba Oluwadara Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Chinenye Emezue, Colin Leong, Michael Beukman, Shamsuddeen Hassan Muhammad, Guyo Dub Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ayoade Ajibade, Tunde Oluwaseyi Ajayi, Yvonne Wambui Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, Fatoumata Ouoba Kabore, Godson Koffi Kalipe, Derguene Mbaye, Allahsera Auguste Tapo, Victoire Memdjokam Koagne, Edwin Munkoh-Buabeng, Valencia Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, and Sam Manthalu. A few thousand translations go a long way! leveraging pre-trained models for african news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2022. doi: 10.48550/ARXIV.2205.02022. URL
  • Aharoni et al. (2019a) Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3874–3884, Minneapolis, Minnesota, June 2019a. Association for Computational Linguistics. doi: 10.18653/v1/N19-1388. URL
  • Aharoni et al. (2019b) Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3874–3884, 2019b.
  • Aiken & Park (2010) Milam Aiken and Mina Park. The efficacy of round-trip translation for mt evaluation. Translation Journal, 14(1):1–10, 2010.
  • Akera et al. (2022) Benjamin Akera, Jonathan Mukiibi, Lydia Sanyu Naggayi, Claire Babirye, Isaac Owomugisha, Solomon Nsumba, Joyce Nakatumba-Nabende, Engineer Bainomugisha, Ernest Mwebaze, and John Quinn. Machine translation for african languages: Community creation of datasets and models in uganda. In

    3rd Workshop on African Natural Language Processing

    , 2022.
  • Alabi et al. (2022) Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. Multilingual language model adaptive fine-tuning: A study on african languages, 2022. URL
  • Ali et al. (2021) Felermino D. M. A. Ali, Andrew Caines, and Jaimito L. A. Malavi. Towards a parallel corpus of portuguese and the bantu language emakhuwa of mozambique. CoRR, abs/2104.05753, 2021. URL
  • Arivazhagan et al. (2019) Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019, 2019.
  • Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041, 2017.
  • Artioli et al. (2011) Gilberto Artioli, V Nociti, and Ivana Angelini. Gambling with Etruscan dice: A tale of numbers and letters. Archaeometry 53 (5), pp. 1031–1043, 2011. doi: 10.1111/j.1475-4754.2011.00596.x.
  • Azunre et al. (2021a) Paul Azunre, Salomey Osei, Salomey Addo, Lawrence Asamoah Adu-Gyamfi, Stephen Moore, Bernard Adabankah, Bernard Opoku, Clara Asare-Nyarko, Samuel Nyarko, Cynthia Amoaba, Esther Dansoa Appiah, Felix Akwerh, Richard Nii Lante Lawson, Joel Budu, Emmanuel Debrah, Nana Boateng, Wisdom Ofori, Edwin Buabeng-Munkoh, Franklin Adjei, Isaac Kojo Essel Ampomah, Joseph Otoo, Reindorf Borkor, Standylove Birago Mensah, Lucien Mensah, Mark Amoako Marcel, Anokye Acheampong Amponsah, and James Ben Hayfron-Acquah. NLP for ghanaian languages. CoRR, abs/2103.15475, 2021a. URL
  • Azunre et al. (2021b) Paul Azunre, Salomey Osei, Salomey Addo, Lawrence Asamoah Adu-Gyamfi, Stephen Moore, Bernard Adabankah, Bernard Opoku, Clara Asare-Nyarko, Samuel Nyarko, Cynthia Amoaba, Esther Dansoa Appiah, Felix Akwerh, Richard Nii Lante Lawson, Joel Budu, Emmanuel Debrah, Nana Boateng, Wisdom Ofori, Edwin Buabeng-Munkoh, Franklin Adjei, Isaac Kojo Essel Ampomah, Joseph Otoo, Reindorf Borkor, Standylove Birago Mensah, Lucien Mensah, Mark Amoako Marcel, Anokye Acheampong Amponsah, and James Ben Hayfron-Acquah. English-twi parallel corpus for machine translation. CoRR, abs/2103.15625, 2021b. URL
  • Babirye et al. (2022) Claire Babirye, Joyce Nakatumba-Nabende, Andrew Katumba, Ronald Ogwang, Jeremy Tusubira Francis, Jonathan Mukiibi, Medadi Ssentanda, Lilian D Wanzare, and Davis David. Building text and speech datasets for low resourced languages: A case of languages in east africa. In 3rd Workshop on African Natural Language Processing, 2022. URL
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, 2015.
  • Bai et al. (2022) He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, and Liang Huang. A3t: Alignment-aware acoustic and text pretraining for speech synthesis and editing. ArXiv, abs/2203.09690, 2022.
  • Bakalov et al. (2016) Anton Bakalov, Alex Salcianu, Andy Golding, Chris Alberti, Daniel Andor, David Weiss, Emily Pitler, Greg Coppola, Jason Riesa, Kuzman Ganchev, Michael Ringgaard, Nan Hua, Ryan McDonald, Slav Petrov, Stefan Istrate, and Terry Koo. Compact Language Detector v3 (CLD3), October 2016. URL
  • Bañón et al. (2020) Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza. ParaCrawl: Web-scale acquisition of parallel corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4555–4567, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.417. URL
  • Bapna et al. (2022) Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin Johnson, Yong Cheng, Simran Khanuja, Jason Riesa, and Alexis Conneau. mslam: Massively multilingual joint pre-training for speech and text. CoRR, abs/2202.01374, 2022. URL
  • Barbaresi et al. (2020) Adrien Barbaresi, Felix Bildhauer, Roland Schäfer, and Egon Stemle (eds.). Proceedings of the 12th Web as Corpus Workshop, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-68-9. URL
  • Berlin & Kay (1969) Brent Berlin and Paul Kay. Basic color terms: their universality and evolution. Univ California Press, Berkeley, CA, 1969.
  • Bird (2020) Steven Bird. Decolonising speech and language technology. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 3504–3519, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.313. URL
  • Brawer (2017) Sascha Brawer. Corpus Crawler., September 2017. URL
  • Brown (2014) Ralf Brown. Non-linear mapping for improved identification of 1300+ languages. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 627–632, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1069. URL
  • Brown (2012) Ralf D Brown. Finding and identifying text in 900+ languages. Digital Investigation, 9:S34–S43, 2012.
  • Brown (2013) Ralf D Brown. Selecting and weighting n-grams to identify 1100 languages. In International Conference on Text, Speech and Dialogue, pp. 475–483. Springer, 2013.
  • Bustamante et al. (2020) Gina Bustamante, Arturo Oncevay, and Roberto Zariquiey. No data to crawl? monolingual corpus creation from PDF files of truly low-resource languages in Peru. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 2914–2923, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL
  • Caswell et al. (2020) Isaac Caswell, Theresa Breiner, Daan van Esch, and Ankur Bapna. Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus. arXiv preprint arXiv:2010.14571, 2020. URL
  • Chelliah (1997) Shobhana Lakshmi Chelliah. A Grammar of Meithei, volume 17 of Mouton Grammar Library [MGL]. Mouton de Gruyter, Berlin, Germany, 1997. doi: 10.1515/9783110801118.
  • Chen et al. (2018) Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849, 2018.
  • Chen et al. (2022) Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno, Ankur Bapna, and Heiga Zen. Maestro: Matched speech text representations through modality matching. ArXiv, abs/2204.03409, 2022.
  • Coffey (2021) Donavyn Coffey. Māori are trying to save their language from Big Tech., 2021. Accessed: 2022-04-24.
  • Davis & Suignard (2021) Mark Davis and Michel Suignard. Unicode security mechanisms. Technical Report (Technical Standard #39), Unicode Consortium, August 2021. URL Version 14.0, Revision 24.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
  • Duan et al. (2020) Xiangyu Duan, Baijun Ji, Hao Jia, Min Tan, Min Zhang, Boxing Chen, Weihua Luo, and Yue Zhang. Bilingual dictionary based neural machine translation without using parallel sentences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1570–1579, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.143. URL
  • Dunn (2020) Jonathan Dunn. Mapping languages: the Corpus of Global Language Use. Language Resources and Evaluation, April 2020. doi: 10.1007/s10579-020-09489-2. URL
  • Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 489–500, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1045. URL
  • Eiselen & Puttkammer (2014) Roald Eiselen and Martin Puttkammer. Developing text resources for ten South African languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3698–3703, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). URL
  • El-Kishky et al. (2020) Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, and Philipp Koehn. CCAligned: A massive collection of cross-lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 5960–5969, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.480. URL
  • Emezue & Dossou (2021) Chris Chinenye Emezue and Bonaventure F. P. Dossou. MMTAfrica: Multilingual machine translation for African languages. In Proceedings of the Sixth Conference on Machine Translation, pp. 398–411, Online, November 2021. Association for Computational Linguistics. URL
  • Emezue & Dossou (2020) Chris Chinenye Emezue and Femi Pancrace Bonaventure Dossou. FFR v1.1: Fon-French neural machine translation. In Proceedings of the The Fourth Widening Natural Language Processing Workshop, pp. 83–87, Seattle, USA, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.winlp-1.21. URL
  • Esplà et al. (2019) Miquel Esplà, Mikel Forcada, Gema Ramírez-Sánchez, and Hieu Hoang. ParaCrawl: Web-scale parallel corpora for the languages of the EU. In Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks, pp. 118–119, Dublin, Ireland, August 2019. European Association for Machine Translation. URL
  • Esplà-Gomis (2009) Miquel Esplà-Gomis. Bitextor: a free/open-source software to harvest translation memories from multilingual websites. In Beyond Translation Memories: New Tools for Translators Workshop, Ottawa, Canada, August 26-30 2009. URL
  • Everson (2007) Michael Everson. Proposal for encoding the Meitei Mayek script in the BMP of the UCS. ISO/IEC JTC1/SC2/WG2 N3206R2, Unicode Consortium, August 2007. URL
  • Ezeani et al. (2020) Ignatius Ezeani, Paul Rayson, Ikechukwu E. Onyenwe, Chinedu Uchechukwu, and Mark Hepple. Igbo-english machine translation: An evaluation benchmark. CoRR, abs/2004.00648, 2020. URL
  • Fan et al. (2020) Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. Beyond english-centric multilingual machine translation. arXiv preprint arXiv:2010.11125, 2020. URL
  • Fan et al. (2021) Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, et al. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48, 2021.
  • Firat et al. (2016a) Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 866–875, San Diego, California, June 2016a. Association for Computational Linguistics. doi: 10.18653/v1/N16-1101. URL
  • Firat et al. (2016b) Orhan Firat, Baskaran Sankaran, Yaser Al-onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 268–277, Austin, Texas, November 2016b. Association for Computational Linguistics. doi: 10.18653/v1/D16-1026. URL
  • et al. (2020)  , Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabenamualu, Salomey Osei, Freshia Sackey, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa Berhe, Mofetoluwa Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Iroro Orife, Ignatius Ezeani, Idris Abdulkadir Dangana, Herman Kamper, Hady Elsahar, Goodness Duru, Ghollah Kioko, Murhabazi Espoir, Elan van Biljon, Daniel Whitenack, Christopher Onyefuluchi, Chris Chinenye Emezue, Bonaventure F. P. Dossou, Blessing Sibanda, Blessing Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp Öktem, Adewale Akinfaderin, and Abdallah Bashir. Participatory research for low-resourced machine translation: A case study in African languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 2020. URL
  • Freeman (1999) Philip Freeman. The Survival of the Etruscan Language. Etruscan Studies, Vol. 6, Article 2, 1999.
  • Freitag et al. (2019) Markus Freitag, Isaac Caswell, and Scott Roy. APE at scale and its implications on MT evaluation biases. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp. 34–44, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5204. URL
  • Freitag et al. (2020) Markus Freitag, David Grangier, and Isaac Caswell. BLEU might be guilty but references are not innocent. CoRR, abs/2004.06063, 2020. URL
  • Freitag et al. (2021) Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474, 2021. doi: 10.1162/tacl_a_00437. URL
  • Garcia et al. (2021a) Xavier Garcia, Noah Constant, Ankur Parikh, and Orhan Firat. Towards continual learning for multilingual machine translation via vocabulary substitution. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1184–1192, Online, June 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.93. URL
  • Garcia et al. (2021b) Xavier Garcia, Aditya Siddhant, Orhan Firat, and Ankur Parikh. Harnessing multilinguality in unsupervised machine translation for rare languages. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1126–1137, Online, June 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.89. URL
  • Gezmu et al. (2021) Andargachew Mekonnen Gezmu, Andreas Nürnberger, and Tesfaye Bayu Bati. Extended parallel corpus for amharic-english machine translation. CoRR, abs/2104.03543, 2021. URL
  • Goldhahn et al. (2012) Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. Building large monolingual dictionaries at the Leipzig corpora collection: From 100 to 200 languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pp. 759–765, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). URL
  • Gorman (2016) Kyle Gorman. Pynini: A python library for weighted finite-state grammar compilation. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, pp. 75–80, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-2409. URL
  • Gorman & Sproat (2021) Kyle Gorman and Richard Sproat. Finite-State Text Processing, volume 14 of Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2021.
  • Goyal et al. (2021) Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. The FLORES-101 evaluation benchmark for low-resource and multilingual machine translation. CoRR, abs/2106.03193, 2021. URL
  • Hacheme (2021) Gilles Hacheme. English2gbe: A multilingual machine translation model for Fon/Ewe gbe. arXiv preprint arXiv:2112.11482, 2021.
  • Haddow & Kirefu (2020) Barry Haddow and Faheem Kirefu. Pmindia–a collection of parallel corpora of languages of india. arXiv preprint arXiv:2001.09907, 2020.
  • Hadgu et al. (2022) Asmelash Teka Hadgu, Gebrekirstos G. Gebremeskel, and Abel Aregawi. HornMT., 2022.
  • He et al. (2019) Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio Ranzato. Revisiting self-training for neural sequence generation. arXiv preprint arXiv:1909.13788, 2019.
  • Hiraishi (2021) Ku’uwehi Hiraishi. Teaching computers ’ōlelo Hawai’i prompts debate on data sovereignty., 2021. Accessed: 2022-04-24.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Huang (1990) Xiuming Huang. A machine translation system for the target language inexpert. In COLING 1990 Volume 3: Papers presented to the 13th International Conference on Computational Linguistics, 1990.
  • Huang et al. (2019) Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al.

    Gpipe: Efficient training of giant neural networks using pipeline parallelism.

    Advances in neural information processing systems, 32, 2019.
  • ISO (2001) ISO. ISO 15919: Transliteration of Devanagari and related Indic scripts into Latin characters., 2001. International Organization for Standardization, Geneva, Switzerland.
  • ISO (2004) ISO. ISO 15924: Codes for the representation of names of scripts., 2004. International Organization for Standardization, Geneva, Switzerland.
  • Jakubíček et al. (2020) Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý, and Vit Suchomel. Current challenges in web corpus building. In Proceedings of the 12th Web as Corpus Workshop, pp. 1–4, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-68-9. URL
  • Joanis et al. (2020) Eric Joanis, Rebecca Knowles, Roland Kuhn, Samuel Larkin, Patrick Littell, Chi-kiu Lo, Darlene Stewart, and Jeffrey Micher. The Nunavut Hansard Inuktitut–English parallel corpus 3.0 with preliminary machine translation results. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 2562–2572, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL
  • Johnson et al. (2016) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR, abs/1611.04558, 2016. URL
  • Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, 2017. doi: 10.1162/tacl_a_00065. URL
  • Johny et al. (2021) Cibu Johny, Lawrence Wolf-Sonkin, Alexander Gutkin, and Brian Roark. Finite-state script normalization and processing utilities: The Nisaba Brahmic library. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 14–23, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-demos.3. URL
  • Karamanolakis et al. (2020) Giannis Karamanolakis, Daniel Hsu, and Luis Gravano. Cross-lingual text classification with minimal resources by transferring a sparse teacher. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3604–3622, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.323. URL
  • Khanganba & Jha (2014) K. Kabi Khanganba and Girish Nath Jha. Challenges in Indian language transliteration: a case of Devanagari, Bangla and Manipuri. In Proceedings of the 2nd Workshop on Indian Language Data: Resources and Evaluation (WILDRE2), pp. 77–83, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). URL
  • Kim & Rush (2016) Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
  • Koehn et al. (2020) Philipp Koehn, Vishrav Chaudhary, Ahmed El-Kishky, Naman Goyal, Peng-Jen Chen, and Francisco Guzmán. Findings of the WMT 2020 shared task on parallel corpus filtering and alignment. In Proceedings of the Fifth Conference on Machine Translation, pp. 726–742, Online, November 2020. Association for Computational Linguistics. URL
  • Kreutzer et al. (2022) Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, and Mofetoluwa Adeyemi. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 2022. doi: 10.1162/tacl_a_00447. URL
  • Kudo & Richardson (2018) Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL
  • Lample & Conneau (2019) Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
  • Lample et al. (2017) Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043, 2017.
  • Laskar et al. (2020) Sahinur Rahman Laskar, Abdullah Faiz Ur Rahman Khilji, Partha Pakray, and Sivaji Bandyopadhyay. EnAsCorp1.0: English-Assamese corpus. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, pp. 62–68, Suzhou, China, December 2020. Association for Computational Linguistics. URL
  • Laskar et al. (2021) Sahinur Rahman Laskar, Abdullah Faiz Ur Rahman Khilji Darsh Kaushik, Partha Pakray, and Sivaji Bandyopadhyay. EnKhCorp1.0: An English–Khasi corpus. In Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pp. 89–95, Virtual, August 2021. Association for Machine Translation in the Americas. URL
  • Liao (2017) Han-Teng Liao. Encoding for access: how Zawgyi success impedes full participation in digital Myanmar. ACM SIGCAS Computers and Society, 46(4):18–24, 2017.
  • Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742, 2020.
  • Lo (2019) Chi-kiu Lo. YiSi - a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 507–513, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5358. URL
  • Long (2007) Gideon Long. Chilean Mapuches in language row with Microsoft., January 2007. Reuters Technology News, Accessed: 2022-04-24.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1166. URL
  • Mabuya et al. (2021) Rooweither Mabuya, Jade Abbott, and Vukosi Marivate. Umsuka english - isizulu parallel corpus, June 2021. URL Thank you to Facebook Research for funding the creation of this dataset.
  • Mager et al. (2021) Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, John Ortega, Annette Rios, Angela Fan, Ximena Gutierrez-Vasques, Luis Chiruzzo, Gustavo Giménez-Lugo, Ricardo Ramos, Ivan Vladimir Meza Ruiz, Rolando Coto-Solano, Alexis Palmer, Elisabeth Mager-Hois, Vishrav Chaudhary, Graham Neubig, Ngoc Thang Vu, and Katharina Kann. Findings of the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pp. 202–217, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.americasnlp-1.23. URL
  • Marchisio et al. (2021) Kelly Marchisio, Markus Freitag, and David Grangier. On systematic style differences between unsupervised and supervised MT and an application for high-resource machine translation. CoRR, abs/2106.15818, 2021. URL
  • Marivate et al. (2020) Vukosi Marivate, Tshephisho Sefara, Vongani Chabalala, Keamogetswe Makhaya, Tumisho Mokgonyane, Rethabile Mokoena, and Abiodun Modupe. Investigating an approach for low resource language dataset creation, curation and classification: Setswana and sepedi. In Proceedings of the first workshop on Resources for African Indigenous Languages, pp. 15–20, Marseille, France, May 2020. European Language Resources Association (ELRA). ISBN 979-10-95546-60-3. URL
  • Mirzakhalov et al. (2021a) Jamshidbek Mirzakhalov, Anoop Babu, Duygu Ataman, Sherzod Kariev, Francis Tyers, Otabek Abduraufov, Mammad Hajili, Sardana Ivanova, Abror Khaytbaev, Antonio Laverghetta Jr, et al. A large-scale study of machine translation in the Turkic languages. arXiv preprint arXiv:2109.04593, 2021a.
  • Mirzakhalov et al. (2021b) Jamshidbek Mirzakhalov, Anoop Babu, Aigiz Kunafin, Ahsan Wahab, Behzod Moydinboyev, Sardana Ivanova, Mokhiyakhon Uzokova, Shaxnoza Pulatova, Duygu Ataman, Julia Kreutzer, et al. Evaluating multiway multilingual NMT in the Turkic languages. arXiv preprint arXiv:2109.06262, 2021b.
  • Mohri (2009) Mehryar Mohri. Weighted automata algorithms. In Manfred Droste, Werner Kuich, and Heiko Vogler (eds.), Handbook of Weighted Automata, Monographs in Theoretical Computer Science, pp. 213–254. Springer, 2009. doi:
  • Moirangthem & Nongmeikapam (2021) Gourashyam Moirangthem and Kishorjit Nongmeikapam. A back-transliteration based Manipuri Meetei Mayek keyboard IME. In Proceedings of 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), pp. 1–6, Kuala Lumpur, Malaysia, 2021. IEEE. doi: 10.1109/GUCON50781.2021.9573837.
  • Moon et al. (2020) Jihyung Moon, Hyunchang Cho, and Eunjeong L Park. Revisiting round-trip translation for quality estimation. arXiv preprint arXiv:2004.13937, 2020.
  • Mueller et al. (2020) Aaron Mueller, Garrett Nicolai, Arya D. McCarthy, Dylan Lewis, Winston Wu, and David Yarowsky. An analysis of massively multilingual neural machine translation for low-resource languages. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3710–3718, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL
  • Nekoto et al. (2022) Wilhelmina Nekoto, Julia Kreutzer, Jenalea Rajab, Millicent Ochieng, and Jade Abbott. Participatory translations of oshiwambo: Towards sustainable culture preservation with language technology. In 3rd Workshop on African Natural Language Processing, 2022. URL
  • Neubig & Hu (2018) Graham Neubig and Junjie Hu. Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 875–880, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1103. URL
  • Ortiz Suárez et al. (2019) Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, and Caroline Iliadi (eds.), Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pp. 9 – 16, Mannheim, 2019. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-9021. URL
  • Ortiz Suárez et al. (2020) Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot. A monolingual approach to contextualized word embeddings for mid-resource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1703–1714, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.156. URL
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Popović (2015) Maja Popović.

    chrF: character n-gram F-score for automatic MT evaluation.

    In Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/W15-3049. URL
  • Prasad et al. (2018) Manasa Prasad, Theresa Breiner, and Daan van Esch. Mining training data for language modeling across the world’s languages. In Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU 2018), 2018. URL
  • Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2685–2702, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.213. URL
  • Reid et al. (2021) Machel Reid, Junjie Hu, Graham Neubig, and Yutaka Matsuo. AfroMT: Pretraining strategies and reproducible benchmarks for translation of 8 African languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1306–1320, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.99. URL
  • Resnik & Smith (2003) Philip Resnik and Noah A. Smith. The web as a parallel corpus. Computational Linguistics, 29(3):349–380, 2003. doi: 10.1162/089120103322711578. URL
  • Saunders & Brakel (2002) Barbara Saunders and Jaap van Brakel. The trajectory of color. Perspectives on Science, pp. 302–355, 2002.
  • Scannell (2007) K. P. Scannell. The Crúbadán Project: Corpus building for under-resourced languages. In 3rd Web as Corpus Workshop, 2007, Louvain-la-Neuve, Belgium, 2007.
  • Schwenk et al. (2019) Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzmán. Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from Wikipedia. arXiv preprint arXiv:1907.05791, 2019. URL
  • Schwenk et al. (2021) Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, and Angela Fan. CCMatrix: Mining billions of high-quality parallel sentences on the web. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6490–6500, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.507. URL
  • Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur Parikh. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881–7892, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.704. URL
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1009. URL
  • Shen et al. (2019) Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295, 2019.
  • Siddhant et al. (2020) Aditya Siddhant, Ankur Bapna, Yuan Cao, Orhan Firat, Mia Chen, Sneha Kudugunta, Naveen Arivazhagan, and Yonghui Wu. Leveraging monolingual data with self-supervision for multilingual neural machine translation. arXiv preprint arXiv:2005.04816, 2020.
  • Siddhant et al. (2022) Aditya Siddhant, Ankur Bapna, Orhan Firat, Yuan Cao, Mia Xu Chen, Isaac Caswell, and Xavier Garcia. Towards the next 1000 languages in multilingual machine translation: Exploring the synergy between supervised and self-supervised learning. arXiv preprint arXiv:2201.03110, 2022.
  • Singh (2011) Harimohon Thounaojam Singh. The evolution and recent development of the Meitei Mayek script. In North East Indian Linguistics, volume 3, pp. 24–32. Cambridge University Press India, New Delhi, India, 2011.
  • Singh et al. (2007) Leihaorambam Sarbajit Singh, Kabita Thaoroijam, and Pradip Kumar Das. Written Manipuri (Meiteron) – phoneme to grapheme correspondence. Language in India, 7(6), June 2007. URL
  • Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: masked sequence to sequence pre-training for language generation. CoRR, abs/1905.02450, 2019. URL
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
  • Tang et al. (2022) Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Abdelrahman Mohamed, Michael Auli, and Juan Miguel Pino. Unified speech-text pre-training for speech translation and recognition. ArXiv, abs/2204.05409, 2022.
  • Tang et al. (2021) Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. Multilingual translation from denoising pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3450–3466, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.304. URL
  • Tapo et al. (2021) Allahsera Auguste Tapo, Michael Leventhal, Sarah Luger, Christopher M. Homan, and Marcos Zampieri. Domain-specific MT for low-resource languages: The case of bambara-french. CoRR, abs/2104.00041, 2021. URL
  • Thihlum et al. (2020) Zaitinkhuma Thihlum, Vanlalmuansangi Khenglawt, and Somen Debnath. Machine translation of english language to mizo language. In 2020 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), pp. 92–97, 2020. doi: 10.1109/CCEM50674.2020.00028.
  • Uszkoreit et al. (2010) Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1101–1109, Beijing, China, August 2010. Coling 2010 Organizing Committee. URL
  • van Esch et al. (2019) Daan van Esch, Elnaz Sarbar, Tamar Lucassen, Jeremy O’Brien, Theresa Breiner, Manasa Prasad, Evan Elizabeth Crew, Chieu Nguyen, and Francoise Beaufays. Writing across the world’s languages: Deep internationalization for gboard, the google keyboard. Technical report, Google, 2019. URL
  • van Esch et al. (2022) Daan van Esch, Tamar Lucassen, Sebastian Ruder, Isaac Caswell, and Clara E. Rivera. Writing system and speaker metadata for 2,800+ language varieties. In Proceedings of LREC, 2022.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems (NIPS), 30, 2017.
  • Wang et al. (2020) Zihan Wang, Karthikeyan K, Stephen Mayhew, and Dan Roth. Extending multilingual BERT to low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2649–2656, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.240. URL
  • Wenzek et al. (2020) Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4003–4012, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL
  • Xia et al. (2019) Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, and Graham Neubig. Generalized data augmentation for low-resource translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5786–5796, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1579. URL
  • Xue et al. (2020) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
  • Zhang et al. (2020) Biao Zhang, Philip Williams, Ivan Titov, and Rico Sennrich. Improving massively multilingual neural machine translation and zero-shot translation. arXiv preprint arXiv:2004.11867, 2020.
  • Zheng et al. (2021) Renjie Zheng, Junkun Chen, Mingbo Ma, and Liang Huang. Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation. In ICML, 2021.
  • Zipf (1935) George Zipf. The Psychology of Language. Houghton-Mifflin, 1935.

Appendix A Complete Human Evaluation and ChrF Results for Distilled Models

Table 20 gives the performance of the distilled models, as measured both by humans on a seven point scale (nonsense to perfect), and in ChrF. Please keep in mind that human evaluation numbers are quite hard to compare across languages.

LP avg 0 1 2 3 4 5 6 entropy ChrF
enti 5.4 0 0 1 4 11 18 66 0.45 22.0
enay 5.1 0 0 2 5 17 36 40 0.56 33.3
enbm 5.0 0 1 1 3 22 40 34 0.55 36.6
ents 4.9 1 0 2 6 24 32 37 0.59 46.5
enlus 4.9 3 0 5 0 33 13 47 0.55 41.1
enilo 4.7 0 0 3 2 25 64 7 0.44 54.8
enff 4.6 1 1 4 7 24 45 18 0.61 37.2
enln 4.6 0 0 3 10 28 41 18 0.61 34.9
endoi 4.5 0 0 2 3 38 50 6 0.49 40.9
enkri 4.4 0 2 9 14 26 23 27 0.71 36.6
enom 4.4 0 0 3 10 42 37 8 0.56 40.3
endv 4.1 2 2 8 12 33 29 14 0.71 45.6
ensa 4.1 4 1 3 7 52 30 4 0.57 33.3
enas 4.0 0 0 4 12 60 23 0 0.47 40.8
engom 4.0 4 1 14 11 38 12 21 0.71 41.5
enlg 3.9 0 8 11 18 22 25 16 0.76 39.7
enak 3.8 0 0 19 0 69 0 12 0.39 34.6
enckb 3.6 1 2 14 32 30 13 9 0.70 44.3
ennso 3.4 7 13 13 16 19 19 14 0.83 47.6
enkl 3.2 13 3 19 13 29 10 13 0.79 40.7
enmni-Mtei 3.2 0 0 24 32 43 1 0 0.51 47.7
engn 3.1 0 8 30 18 35 8 2 0.66 32.2
enqu 3.1 4 4 20 30 36 6 0 0.65 37.2
enee 3.0 7 14 18 21 19 12 9 0.82 39.9
enmai 3.0 8 1 30 20 30 8 4 0.71 40.2
enpcm 2.7 20 9 22 4 27 10 9 0.78 57.5
enbho 2.3 0 6 70 14 8 1 0 0.44 42.4
enyua 2.1 0 30 42 17 8 2 1 0.59 32.8
tien 4.6 0 0 3 9 31 37 21 0.62 45.8
ayen 4.8 2 1 5 6 12 48 27 0.61 38.8
lnen 4.8 0 1 1 6 25 46 22 0.58 32.3
lgen 5.6 0 1 2 3 4 8 82 0.34 41.1
mni-Mteien 3.5 0 1 20 14 65 1 0 0.44 62.9
bmen 5.1 0 0 0 4 12 52 32 0.50 38.7
tsen 4.3 1 1 8 14 27 33 17 0.68 47.5
lusen 4.6 4 4 5 4 13 40 31 0.66 41.7
iloen 4.6 0 0 3 3 30 59 6 0.47 62.4
ffen 4.6 7 1 10 4 13 11 54 0.63 45.8
doien 4.6 1 0 1 4 41 33 20 0.56 65.8
krien 5.0 0 1 6 8 13 18 53 0.59 64.8
omen 4.6 0 0 1 7 37 50 7 0.50 41.9
dven 3.4 0 2 19 33 33 14 0 0.61 48.7
saen 4.4 0 1 1 4 55 33 6 0.48 48.9
asen 5.2 0 2 8 2 10 15 64 0.51 60.4
gomen 5.5 0 0 2 1 5 30 62 0.43 57.2
aken 4.8 0 1 4 9 19 31 37 0.64 39.4
ckben 2.9 4 12 27 14 36 7 0 0.69 56.4
nsoen 3.4 2 12 14 18 34 9 12 0.76 52.8
klen 4.6 2 3 11 5 25 9 46 0.65 39.6
gnen 3.5 0 0 12 31 49 7 1 0.54 43.6
quen 2.9 0 4 28 47 18 5 0 0.57 35.6
eeen 4.5 3 3 8 12 13 22 39 0.71 37.3
maien 3.2 0 2 38 15 25 19 0 0.62 65.4
pcmen 4.6 1 0 3 7 31 39 20 0.61 65.4
bhoen 3.4 0 0 15 44 31 10 0 0.56 61.7
yuaen 3.4 0 1 20 36 29 13 2 0.63 42.1
Table 20: Performance of the distilled student model in ChrF and human evaluation, on a scale from 0 (nonsense/wrong language) to 6 (perfect). The value in the “avg” column reports the weighted average score across all sentences. The value under each of the numbers from 0 to 6 is the percent of sentences that were given that rating. The entropy is included to flag suspicious rating patterns: a low entropy may mean that most raters are assigning the same score to all sentences.

Appendix B Complete Audit Results

We rated samples of 100 sentences for 72 of the languages present in the dataset we crawled. The error metrics used are described in Table 21. The breakdown can be seen in Table 22.

Code Weight Description
CC 100 Natural in-language sentence. It’s ok if it has a few small issues, like spelling errors or a few words from another language, or if it’s a sentence fragment of reasonable length (about 5 words or more)
CB 50 In-language, but low-quality. This could be ungrammatical text, boilerplate, or very short fragments.
CA 30 Correct but ambiguous whether it’s in the correct language. This code is only applicable for dialects that are closely related to a major language. For instance, many short sentences in Gulf Arabic may also be valid in MSA, and many written Cantonese sentences might also be valid in Mandarin.
WD 20 This sentence is in a related but different dialect to the language it’s supposed to be in. This code is only applicable for dialects that are closely related to a similar dialect. For instance, it’s supposed to be in Sa’idi Arabic but it’s in Egyptian Arabic.
WL 0 Wrong Language, but still linguistic content
NL 0 Not a language – any sort of non-linguistic content. Proper nouns like “Ibuprofin”, “Calvin Klein”, or “Washington DC” also count as NL.
Table 21: Instructions descriptions of the error codes we used to rate samples of our datasets, along with the weight each one is given in the combined quality score.
Language Name (BCP-47) score cc cb ca wl nl wd
Northeastern Dinka (dip) 100 100 0 0 0 0 0
Zarma (dje) 100 100 0 0 0 0 0
Dombe (dov) 100 100 0 0 0 0 0
Dyula (dyu) 100 100 0 0 0 0 0
Wayuu (guc) 100 100 0 0 0 0 0
Kalenjin (kln) 100 100 0 0 0 0 0
Wolaytta (wal) 100 100 0 0 0 0 0
Assyrian Neo-Aramaic (aii) 100 100 0 0 0 0 0
Igbo (ig) 99 98 2 0 0 0 0
Balinese (ban) 98 96 4 0 0 0 0
Latinized Hindi (hi-Latn) 98 98 0 0 2 0 0
Twi (ak) 97 94 6 0 0 0 0
Luyia (luy) 97 97 0 0 0 0 3
Ewe (ee) 96 91 9 0 0 0 0
Seselwa Creole French (crs) 94 92 4 1 3 0 0
Bhojpuri (bho) 94 94 0 0 6 0 0
Ilocano (ilo) 94 88 12 0 0 0 0
Caribbean Javanese (jvn) 94 87 13 0 0 0 0
Meiteilon (Manipuri) (mni) 93 87 12 0 0 1 0
Luba-Katanga (lu) 92 83 17 0 0 0 0
Krio (kri) 92 89 5 0 6 0 0
Latinized Tamil (ta-Latn) 91 81 19 0 0 0 0
Lingala (ln) 90 80 19 0 1 0 0

Maharasthra Konkani (knn)

89 85 8 1 6 0 0
Northern Sami (se) 89 86 5 0 9 0 0
Cherokee (chr) 88 78 20 0 1 1 0
Latinized Malayalam (ml-Latn) 87 76 21 1 2 0 0
Latinized Bengali (bn-Latn) 86 72 28 0 0 0 0
Maithili (mai) 86 83 5 1 11 0 0
Hiligaynon (hil) 84 80 8 0 9 3 0
Tok Pisin (tpi) 84 83 2 0 15 0 0
Latinized Chinese (zh-Latn) 84 80 7 0 13 0 0
Gulf Arabic (afb) 83 74 17 1 8 0 0
Minangkabau (min) 81 64 34 0 2 0 0
Chuvash (cv) 80 78 4 0 2 2 14
Tamazight (ber-Latn) 79 69 20 0 10 1 0
Latinized Telugu (te-Latn) 79 74 10 0 0 16 0
Libyan Arabic (ayl) 78 68 14 10 7 1 0
Newari (new) 78 76 3 0 21 0 0
Pangasinan (pag) 75 67 17 0 0 17 0
Waray (war) 75 67 17 0 17 0 0
Goan Konkani (gom) 73 67 5 13 13 2 0
Sanskrit (sa) 73 66 14 0 15 5 0
North Levantine Arabic (apc) 73 62 21 0 16 1 0
Sudanese Arabic (apd-SD) 72 60 4 34 2 0 0
Ibibio (ibb) 72 63 17 0 20 0 0
Shona (sn) 72 43 57 0 0 0 0
Sena (seh) 71 71 0 0 29 0 0
Latinized Marathi (mr-Latn) 67 36 62 0 1 1 0
Ancient Greek (grc) 67 67 0 0 29 4 0
Makhuwa-Meetto (mgh) 67 67 0 0 33 0 0
Algerian Arabic (arq) 64 47 6 47 0 0 0
Latinized Goan Konkani (gom-Latn) 64 63 0 2 28 7 0
Ga (gaa) 58 58 0 0 0 0 41
Wolof (wo) 57 16 81 0 1 2 0
Nigerian Pidgin (pcm) 51 43 14 3 40 0 0
Saint Lucian Creole French (acf) 50 50 0 0 50 0 0
Kashmiri (ks-Deva) 48 47 1 2 17 33 0
Latinized Arabic (ar-Latn) 47 33 15 21 6 24 0
Kuanyama (kj) 43 43 0 0 57 0 0
Adangme (ada) 42 42 0 0 0 0 58
Anaang (anw) 36 26 19 0 54 1 0
Mesopotamian Arabic (acm) 34 6 0 93 1 0 0
North Ndebele (nd) 25 25 0 0 50 25 0
Morrocan Arabic (ar-MA) 16 1 24 9 59 7 0
Baluchi (bal) 10 10 0 0 90 0 0
Saidi Arabic (aec) 0 0 0 0 100 0 0
Eastern Baluchi (bgp) 0 0 0 0 100 0 0
Eastern Baluchi (bgp-Arab) 0 0 0 0 100 0 0
mean 74 68 10 3 15 2 2
median 80 74 5 0 3 0 0
Table 22: Results of an audit of the datasets we collected, conducted on sampled of 100 sentences each by a mix of native speakers and non-native speakers. The values are the percent of the audited sample that head each label. The “score” metric combines these numbers for an approximate notion of the percent of the data that is usable, and is described in Section 2.2.2.

b.1 Strict RTT LangID ChrF

In the main section of this paper (4.3), we reported the loose version of RttLangIDChrF, which seems to correlate better with ChrF and data size. One issue with this score is that it doesn’t penalize the model for producing content in the wrong language, whereas the strict version does.

Figure 3 shows strict RttLangIDChrF as a function of log data size. Although the correlation is worse than with the loose version, there is a clear trend, and there appears to be some sort of upper bound in quality as a function of data size.

Figure 3: Plot of RttLangIDChrF scores (strict) for languages as a function of log monolingual data size. This score has worse correlation with metrics like ChrF than the loose veriosn of RttLangIDChrF, but shows an interesting trend when plotted versus data size like this. Compared to the loose version, languages much below the trend line on the right-hand side are often close to high-resource languages (E.g. Betawi/bew, Sabah Malay/msi, Godwari/gdx, Darija/ar-MA), indicating that their apparently large monolingual datasets are actually a result of over-triggering with a higher-resource language like Indonesian, and there are many wrong-language intermediate translations.

Appendix C Transliteration for Meiteilon

The large majority of the text we found online for Meiteilon (Manipuri) was in the Bengali script. Finding almost no Meiteilon in its native script, Meetei Mayek (mni-Mtei), we initially erroneously assumed that this script was archaic or only used in rare contexts. However, conversations with Meiteilon speakers quickly disillusioned us of this notion – not only is it used, but it is now on its way to being the primary script used in the state. Most likely, the reason our mni-Mtei corpus was so small was because the available text online was largely in non-Unicode fonts, and therefore inaccessible to our LangID model (which had not been trained to detect such non-Unicode data). Therefore, we needed to convert our Bengali-script corpus to Meetei Mayek. However, this is not a straightforward task for such a non-injective script as Bengali.

Meiteilon is a tonal Tibeto-Burman language that is one of the scheduled languages of India and a lingua franca of the Manipur state (Chelliah, 1997; Singh, 2011). The Meetei Mayek script is an indigenous script that was used to record Meiteilon until the 18th century when it was mostly superseded by the Bengali script. Despite recent Meetei Mayek gradual revitalization efforts by the Indian government, Meetei Mayek literacy is still quite low (Singh et al., 2007) and the available online Meiteilon data is mostly in Bengali script (Achom & Basu, 2015; Moirangthem & Nongmeikapam, 2021).

Meetei Mayek belongs to the Tibetan family of Brahmic scripts and is well suited for Meiteilon phonology representing a near bijective mapping between graphemes and phonemes of a language (Singh et al., 2007). Unlike the major Brahmic scripts, this script uses a special class of explicit silent final consonants (lonsum iyek) in closed syllable codas, but these consonants are represented as full letters rather than combining signs. In modern Meetei Mayek orthography, the falling tone is often unmarked or sometimes marked with full stop punctuation, whereas in the traditional literature a special lum iyek sign was used (Everson, 2007).

Unlike Meetei Mayek, the orthographic conventions for Meiteilon in Bengali script are ambiguous due to its larger letter inventory, where more than one Bengali letter or clusters of letters may map to a single Meiteilon sound (Singh et al., 2007; Khanganba & Jha, 2014). This implies that any Bengali to Meeitei Mayek transliteration mechanism needs to implement a many-to-one relation.











Figure 4: Bengali (Beng) to Meetei Mayek (Mtei) transliteration components. The script codes are denoted according to ISO 15924 (ISO, 2004).

Our transliterator uses the open-source Nisaba library of finite-state script normalization and processing utilities (Johny et al., 2021).121212 The script operations in Nisaba are efficiently and succinctly represented as weighted finite-state transducers (WFSTs) using Pynini finite-state grammars (Gorman, 2016; Gorman & Sproat, 2021). The main components of transliteration workflow are shown in Figure 4. The four component WFSTs are compiled into the final transliteration WFST , where “” denotes FST composition operation (Mohri, 2009). The first component transducer implements visual normalization of the Bengali script input that consists of visually invariant normalization transformations including NFC (Johny et al., 2021). This is followed by the Meiteilon-specific Bengali to Latin script many-to-one mapping that produces Latin script output in ISO 15919 format (ISO, 2001) augmented with some placeholder markers required for the next processing stage.131313This Nisaba operation is denoted to distinguish it from the more general reversible romanization provided by Nisaba for Bengali and Assamese.

The third stage implements post-processing transformations that are required to resolve ambiguities represented by the placeholder markers (introduced by ) based on the orthographic context. One example of such transformation is the resolution of the Bengali virama sign, whose original purpose in the Bengali script is to mark silent consonants pronounced without an inherent vowel in consonant clusters. Its Meetei Mayek counterpart, the apun iyek mark, functions differently — it only applies to a non-silent subset of consonants (i.e., all consonants excluding the set of lonsum iyek mentioned above). Hence, given the virama placeholder in the input, two finite-state context-dependent rewrites are required for the resolution: convert the preceding consonant to lonsum iyek representation and remove the placeholder if the preceding consonant sound is covered by the lonsum iyek set, otherwise simply convert the virama placeholder to apun iyek for all other cases.

The final transducer implements reverse romanization transliterating unambiguous Latin script input in ISO 15919 format into corresponding representation in Meetei Mayek.

Appendix D Tables of non-Unicode fonts

Table 23 shows the mappings of ASCII characters to Unicode points for a few fonts we ran into. We discovered these mappings by copy-pasting text from an environment where it rendered (e.g. a PDF) to one where the font wasn’t installed (e.g. a text editor), and finding the Unicode character that looked like the way it rendered.

Tamazight (ber-Latn) Ewe (ee) Mooré (mos)
ASCII codepoint(s) ASCII codepoint(s) ASCII codepoint(s)
â U+025B 0 U+025B U+0303 à U+0269
ç U+010D 1 U+025B â U+00E3
é U+1E93 2 U+0256 è U+025B
ê U+1E25 3 U+028B ê U+1EBD
î U+1E6D 4 U+0254 î U+0129
o U+01E7 5 U+0192 Î U+0128
ô U+1E5B 6 U+0263 ô U+00F5
û U+1E63 7 U+00E3 ù U+028B
v U+1E0D 8 U+1EBD û U+0169
Ä U+0190 - U+0254 U+0303 À U+0196
Ç U+010C [ U+0169 Â U+00C3
É U+1E92 ] U+0292 È U+0190
Ë U+1E24 @ U+0189 Ê U+1EBC
Ï U+1E6C & U+00C3 Ô U+00D5
O U+01E6 % U+0191 Ù U+01B2
Ö U+1E5A U+014B Û U+0168
Ü U+1E62 ^ U+0194
V U+1E0C = U+0129
$ U+0263 ~ U+014A
£ U+0194 $ U+0186
Table 23: Three (possibly incomplete) fonts we reconstructed for three African languages that use an extended Latin character set. Table shows the mapping from the ASCII character to the “correct” Unicode codepoint.

Appendix E List of languages with regions, approximate speaker counts and data sizes

Information about the languages and datasets we found on the web, including the name, number of speakers, continent, script (writing system), cluster (see Section 2.1.3), langID F1 score for the SSLID model, and RttLangIDChrF (loose) score. The RttLangIDChrF score provides an approximate measure of quality of the model, but it should not be trusted as a reliable measure of translation quality (Section 4.3). Number of speakers refers to the estimated number of L1 speakers, following the estimates from van Esch et al. (2022).

BCP-47 Mono Language Name Speak. Cont. Script Clust. F1 RTT
en 7388M English 550M Europe Latn en 98.2 NA
es 1751M Spanish 490M Europe Latn es 98.8 77.8
de 1693M German 83M Europe Latn de 98.4 78.2
id 950M Indonesian 200M Asia Latn id 97.3 76.9
hu 724M Hungarian 13M Europe Latn hu 98.8 68.1
pl 687M Polish 38M Europe Latn pl 98.0 72.8
zh 676M Chinese 1000M Asia Hans zh 99.1 64.0
ko 672M Korean 52M