Language-Agnostic SEntence Representations
We show that margin-based bitext mining in a multilingual sentence space can be applied to monolingual corpora of billions of sentences. We are using ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totaling 32.7 billion unique sentences. Using one unified approach for 38 languages, we were able to mine 3.5 billions parallel sentences, out of which 661 million are aligned with English. 17 language pairs have more then 30 million parallel sentences, 82 more then 10 million, and most more than one million, including direct alignments between many European or Asian languages. To evaluate the quality of the mined bitexts, we train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets. Using our mined bitexts only and no human translated parallel data, we achieve a new state-of-the-art for a single system on the WMT'19 test set for translation between English and German, Russian and Chinese, as well as German/French. In particular, our English/German system outperforms the best single one by close to 4 BLEU points and is almost on pair with best WMT'19 evaluation system which uses system combination and back-translation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2019 workshop on Asian Translation (WAT).READ FULL TEXT VIEW PDF
We present an approach based on multilingual sentence embeddings to
This paper accompanies the release of Opusparcus, a new paraphrase corpu...
We learn a joint multilingual sentence embedding and use the distance be...
Machine translation is highly sensitive to the size and quality of the
Unsupervised neural machine translation (NMT) has attracted a lot of
We have developed a method for extracting the coherence features from a
This paper presents the NICT's participation in the WMT18 shared paralle...
Language-Agnostic SEntence Representations
Most of the current approaches in Natural Language Processing (NLP) are data-driven. The size of the resources used for training is often the primary concern, but the quality and a large variety of topics may be equally important. Monolingual texts are usually available in huge amounts for many topics and languages. However, multilingual resources, typically sentences in two languages which are mutual translations, are more limited, in particular when the two languages do not involve English. An important source of parallel texts are international organizations like the European ParliamentKoehn (2005) or the United Nations Ziemski et al. (2016). These are professional human translations, but they are in a more formal language and tend to be limited to political topics. There are several projects relying on volunteers to provide translations for public texts, e.g. news commentary Tiedemann (2012), OpensubTitles Lison and Tiedemann (2016) or the TED corpus Qi et al. (2018).
A first system to systematically mine parallel sentences for many language pairs in Wikipedia, including bitexts without English as one of the languages, was presented in Schwenk et al. (2019). In that work, parallel sentence mining was based on a distance measure in a joint multilingual sentence embedding space Schwenk (2018); Artetxe and Schwenk (2018a), using the freely available LASER toolkit111https://github.com/facebookresearch/LASER which provides a language agnostic sentence encoder which was trained on 93 languages Artetxe and Schwenk (2018b).
In this paper, we use the same underlying mining approach based on LASER and scale to a much larger corpus: ten crawls of a curated common crawl data set Wenzek et al. (2019) instead of Wikipedia (32.7 billion against 550 million unique sentences). On one hand, we had to redesign the processing pipeline in order to attack the substantial computational challenge: billions of sentence embeddings have to be compared. One the other hand, it is an interesting research question whether global mining scales to billions of sentences, i.e. systematically comparing each sentence in a source language with all sentences in the target language. To the best of our knowledge, all existing large scale bitext mining techniques apply an hierarchical approach. First, a subset of all the texts is selected, e.g. documents, which are expected to contain parallel sentences. Then, sentences limited to previously aligned documents are compared and the parallel ones are identified. This type of local mining has the advantage of being very fast since only a few thousand sentences need to be compared for each document. However, sentences which appear in documents which were not preselected can not be aligned.
In this work, we make no assumption on the structure of the monolingual text corpora - we simply compare all sentences against each other. Our experimental results seem to indicate that such an approach works surprisingly well: we are able to mine billions of parallel sentences which seem to be of high quality: NMT systems trained only on our mined data outperform the currently best single NMT systems in WMT’19 and WAT’19.
The paper is organized as follows. In the next section, we first discuss related work. We then present the corpus used in this work and summarize the underlying mining approach. Section 4.3 describes in detail how we applied this approach to extract parallel sentences. To asses the quality of the extracted bitexts, we train NMT systems for a subset of language pairs and evaluate them on the TED corpus Qi et al. (2018), test sets of WMT Barrault et al. (2019) and of the the workshop for Asian language (WAT) Nakazawa et al. (2019). These results are presented in section 6. The paper concludes with a discussion of future research directions.
There is a large body of research on mining parallel sentences in collections of monolingual texts, usually named “comparable coprora”. Initial approaches to bitext mining have relied on heavily engineered systems often based on metadata information, e.g. (Resnik, 1999; Resnik and Smith, 2003). More recent methods explore the textual content of the comparable documents. For instance, it was proposed to rely on cross-lingual document retrieval, e.g. (Utiyama and Isahara, 2003; Munteanu and Marcu, 2005). or machine translation, e.g. (Abdul-Rauf and Schwenk, 2009; Bouamor and Sajjad, 2018), typically to obtain an initial alignment that is then further filtered. In the shared task for bilingual document alignment Buck and Koehn (2016), many participants used techniques based on
-gram or neural language models, neural translation models and bag-of-words lexical translation probabilities for scoring candidate document pairs. The STACC method uses seed lexical translations induced from IBM alignments, which are combined with set expansion operations to score translation candidates through the Jaccard similarity coefficient(Etchegoyhen and Azpeitia, 2016; Azpeitia et al., 2017, 2018). Using multilingual noisy web-crawls such as ParaCrawl222http://www.paracrawl.eu/ for filtering good quality sentence pairs has been explored in the shared tasks for high resource Koehn et al. (2018) and low resource Koehn et al. (2019) languages.
In this work, we rely on massively multilingual sentence embeddings and margin-based mining in the joint embedding space, as described in Schwenk (2018); Artetxe and Schwenk (2018a, b). This approach has also proven to perform best in a low resource scenario Chaudhary et al. (2019); Koehn et al. (2019). Closest to this approach is the research described in España-Bonet et al. (2017); Hassan et al. (2018); Guo et al. (2018); Yang et al. (2019). However, in all these works, only bilingual sentence representations have been trained. Such an approach does not scale to many languages. Finally, related ideas have been also proposed in Bouamor and Sajjad (2018) or Grégoire and Langlais (2017). However, in those works, mining is not solely based on multilingual sentence embeddings, but they are part of a larger system.
Wikipedia is arguably the largest comparable corpus with high-quality human verified texts. One of the first attempts to exploit this resource was performed by Adafre and de Rijke (2006). An MT system was used to translate Dutch sentences into English and to compare them with the English texts. This method yielded several hundreds of Dutch/English parallel sentences. Later, a similar technique was applied to the Persian/English pair Mohammadi and GhasemAghaee (2010). Structural information in Wikipedia such as the topic categories of documents was used in the alignment of multilingual corpora Otero and López (2010). In another work, the mining approach of Munteanu and Marcu (2005) was applied to extract large corpora from Wikipedia in sixteen languages Smith et al. (2010). Otero et al. (2011) measured the comparability of Wikipedia corpora by the translation equivalents on three languages Portuguese, Spanish, and English. Patry and Langlais (2011) came up with a set of features such as Wikipedia entities to recognize parallel documents, and their approach was limited to a bilingual setting. Tufis et al. (2013) proposed an approach to mine parallel sentences from Wikipedia textual content, but they only considered high-resource languages, namely German, Spanish and Romanian paired with English. Tsai and Roth (2016) grounded multilingual mentions to English wikipedia by training cross-lingual embeddings on twelve languages. Gottschalk and Demidova (2017) searched for parallel text passages in Wikipedia by comparing their named entities and time expressions. Finally, Aghaebrahimian (2018) propose an approach based on bilingual BiLSTM sentence encoders to mine German, French and Persian parallel texts with English. Parallel data consisting of aligned Wikipedia titles have been extracted for twenty-three languages.333https://linguatools.org/tools/corpora/wikipedia-parallel-titles-corpora/ Since Wikipedia titles are rarely entire sentences with a subject, verb and object, it seems that only modest improvements were observed when adding this resource to the training material of NMT systems.
We are aware of two large-scale mining approaches applied to several languages pairs and large collections of texts. The European project ParaCrawl11footnotemark: 1 focuses on mining parallel data for all European languages, mainly aligned with English. The underlying alignment engine, called Bitextor,444https://github.com/bitextor/bitextor
uses a two stage approach: first parallel documents are identified, and then, pairs of documents are processed to identify parallel segments. Sentence alignments either uses a seed MT system, or bilingual lexiconsEsplà-Gomis and Forcada (2010), In another work, parallel sentences are mined in Wikipedia for many language pairs using a margin criterion in a multilingual sentence embedding space Schwenk et al. (2019)
In this work, we propose to mine parallel sentences from the Web, by using the data released by the Common Crawl project. Each month, a snapshot of the Web containing terabytes of web pages in various languages is obtained by randomly exploring URLs. We start by applying some preprocessing steps to the raw text data, following the pipeline introduced by Wenzek et al. (2019) and leading to the CCNet dataset. The first step is to deduplicate the data at the paragraph level, as the original crawls contain up to 70% of duplicated data. This preprocessing removes low quality content, such as boilerplate, navigation menus or cookie warnings. The second step of the pipeline is to identify the language of each document, using fastText555https://fasttext.cc/docs/en/language-identification.html Grave et al. (2018)
. This language identifier uses a linear classifier with character-gram features, and can recognize up to 176 languages. Finally, the last step of the preprocessing is to filter low quality content by training a language model on Wikipedia, and only keeping documents with a low perplexity score. We refer the reader to ccnet:2019:arxiv for more details about this preprocessing pipeline. In Figure 1, we report the number of unique sentences obtained after preprocessing ten snapshots from Common Crawl. We currently process 38 languages. The English Web content is abundant and we used only one snapshot.
The underling idea of the mining approach used in this work is to first learn a multilingual sentence embedding, i.e. an embedding space in which semantically similar sentences are close independently of the language they are written in. This means that the distance in that space can be used as an indicator whether two sentences are mutual translations or not. Using a simple absolute threshold on the cosine distance was shown to achieve competitive results Schwenk (2018). However, it has been observed that an absolute threshold on the cosine distance is globally not consistent, e.g. Guo et al. (2018).
Artetxe and Schwenk (2018a) showed that the alignment quality can be substantially improved by using a margin criterion instead of an absolute threshold. The margin between two candidate sentences and
is defined as the ratio between the cosine distance between the two sentence embeddings, and the average cosine similarity of its nearest neighbors in both directions:
where denotes the unique nearest neighbors of in the other language, and analogously for .
Artetxe and Schwenk (2018a) describe the “max-strategy” as one of the best performing ones: the margin is first calculated in both directions for all sentences in language and . Then, the union of these forward and backward candidates is build, candidates are sorted and pairs with source or target sentences which were already used are omitted. Finally, a threshold is applied on the margin score to decide whether two sentences are mutual translations or not. The reader is referred to Artetxe and Schwenk (2018a) for a detailed discussion with related work. The “max-strategy” was used in Schwenk et al. (2019) to mine parallel sentence in Wikipedia.
This strategy was initially motivated by an evaluation on the BUCC corpus Zweigenbaum et al. (2018), for which the reference alignments were known to be strictly 1:1. With increasing corpus size, namely billions of sentences in CCNet, the probability to find several perfect translations increases. This questions the restriction that each source sentence is aligned to exactly one and only one target sentence, and vice-versa. The value of in equation 1 should be also carefully selected to avoid that all the nearest sentences are valid translations, i.e. having similar distances and therefore a small margin. This would result in many valid translations being excluded. Therefore, we increased the value of the neighborhood in Equation 1 from 4, which was used in Schwenk et al. (2019), to 16.
Distance-based bitext mining requires a joint sentence embedding for all the considered languages. One may be tempted to train a bi-lingual embedding for each language pair, e.g. España-Bonet et al. (2017); Hassan et al. (2018); Guo et al. (2018); Yang et al. (2019), but this is difficult to scale to thousands of language pairs present in CCNet. We follow Schwenk et al. (2019)
and use one single massively multilingual sentence embedding for all languages, namely the one proposed by the open-source LASER toolkitArtetxe and Schwenk (2018b).
The underlying idea of LASER is to train a sequence-to-sequence system on many language pairs at once using a shared BPE vocabulary and a shared encoder for all languages. The sentence representation is obtained by max-pooling over all encoder output states. Figure1 illustrates this approach. The reader is referred to Artetxe and Schwenk (2018b) for a detailed description.
We use the same underlying mining procedure as Schwenk et al. (2019) who extracted 135 million parallel sentences from Wikipedia in 1620 different language pairs. However, our CCNet corpus is more than fifty times larger than Wikipedia: 32.7 billion against 595 million unique sentences. Our largest corpora are English and Russian, with 8.7 and 3 billion unique sentences, respectively. For ten languages, CCNet has more than one billion unique sentences (see Figure 1). This required to significantly modify the mining pipeline in order to tackle the substantially increased computational complexity. The overall processing pipeline can be structured into three tasks:
text extraction and processing including sentence splitting, language identification (LID) and deduplication;
creation of a compressed index for each language;
mining parallel data for each language pair using the sentence embeddings and indexes.
For each step, we aimed to parallelize the processing as much as possible, by splitting the data into several blocks. We used blocks of about fifty millions sentences. This size was chosen so that the different operations can be performed in a couple of hours. As example, all the English texts are split into 160 blocks.
The first task, text extraction and processing, consists in the following steps:
Extract the texts from the JSON data of CCNet (see Wenzek et al. (2019) for details).
Split the “paragraphs” into sentences.
Perform LID and exclude sentences which are not in the expected language.
Mark all sentences which are duplicates within each block.
Each of these four steps are performed in parallel for all blocks, and languages. As a final step, we merge all the block-wise deduplicated sentences and create one set of globally unique sentences for each language. We used a freely available Python tool666https://pypi.org/project/sentence-splitter/ to detect sentence boundaries. If specific rules for a language are not available, we fall-back to a linguistically similar languages, e.g. we use Spanish rules for Gallican, and default to English otherwise. Most of the Asian languages are handled by regular expressions. We exclude sentences with more than 500 characters. LID is performed at the sentence level with fastText Joulin et al. (2016); Grave et al. (2018). Once, the text preparation task is finished, we have a corpus of unique sentences for each language . These texts are the basis for the index creation and mining tasks. The amount of data for each language is given in Table 3, third column.
We follow Schwenk et al. (2019) and use the highly optimized FAISS toolkit Johnson et al. (2017)777https://github.com/facebookresearch/faiss/wiki/Faiss-indexes to create compact indexes of the sentence embedding. LASER’s sentence representations are 1024-dimensional. This means that the embeddings of all sentences would require
TB to store them. We use an aggressive vector compression based on a 64-bit product-quantizerJégou et al. (2011). In order to account for the huge number of sentences, we increase the amount of cells from 32k to 64k to partition the search space. This corresponds to the index type OPQ64,IVF65536,PQ64 in FAISS terms.
Exhaustive searching in huge indexes is only tractable if performed on GPU. FAISS supports sharding of a single index on multiple GPUs - this is most efficient if the GPUs are in the same machine and communicate very quickly. For our index type, and eight GPUs with 32GB of memory each, this allows to create an index of about three billion sentences. This includes all languages with the exception of English with 8.7 billion sentences. Therefore, we created three English indexes of 2.7 billion sentences each.
The processing pipeline to train and create the indexes is summarized in Figure 2. First, we train an index on 40 million sentences sampled in the whole corpus, when available. Once the index is trained, the data in each block is independently added to the common trained index. This can be also processed in parallel. These individual indexes are then merged into one index for each language. The Russian and and Japanese indexes with three billion sentences have a file size of about 200GB, all 28 indexes total about 2TB.
Once indexes for all languages are calculated, we can start the mining process for each language pair. Schwenk et al. (2019) pre-calculated the sentence embeddings for all languages and then started the pairwise mining process. The authors report that less than 3.5h on 8 GPUs are needed for the whole “max-mining” process between English and German, i.e 134M and 51M sentences respectively. This corresponds to about distances calculations.
Let us consider mining Japanese/Russian bitext in CCNet with 3.0 and 2.9 billion sentences respectively, i.e. . This means that we have to perform about 1300 times more distance calculations, which would translate to more than 6 months on a single machine with 8 GPUs. We tackle this computational challenge by decoupling the distance calculations in forward and backward direction and the margin calculation (see Equation 1), and processing all these steps in parallel. This processing pipeline is illustrated in Figure 3.
In addition, we had to use a special procedure to mine for parallel sentences with English due to the large amounts of English sentences. For the sake of explanation, let us assume that we want to extract German/English bitexts. It is computationally too expensive to perform -nn search in the German FAISS index for all the 8.7 billion English sentences (backward distances). Therefore, we are constraint to only use the forward distances . Remember that we had to partition all the English sentences in three indexes of about 2.7 billion sentences each. Consequently, for each German sentence, we search in the three different English indexes, and calculate the margin with respect to the nearest neighbors. We then combine the alignments and keep those which a margin superior to a threshold of 1.06. It can happen that the algorithm finds valid translation in each of the three indexes. We decided to keep those alternative translations.
Mining for parallel sentences in more than 32 billions sentences is computationally very expensive. In the current version of the CCMatrix corpus, we have limited the alignment process to 38 languages. Those were chosen to cover several language families and scripts. In the following, we first discuss the amount of extracted sentences. We then turn to a qualitative assessment by training NMT systems for many language pairs (Section 6).
The margin threshold used to mine parallel sentences will impact the quality of produced bitexts. A higher threshold will lead to better aligned sentences, and thus higher quality bitexts, but also to smaller datasets. Thus, there is a trade-off between the size of the extracted bitexts, and their quality. Exploratory experiments showed that a threshold around seems to give good results. To confirm this, we trained and evaluated machine translation systems on the Hu-Da pair for different values of the treshold. We report results in Figure 4, showing that leads to the best performance. Note that this value is different from the margin threshold of 1.04 reported in Schwenk et al. (2019) since we use neighborhood of instead of 4.
We were able to mine in total 3.5 billion parallel sentences when using a threshold of 1.06 on the margin, out of which 661 million are aligned with English (see Table 2).
Most of the current MT system focus on the translation from or into English. Other language pairs are usually handled by pivoting through English since direct parallel texts are much smaller. This can be suboptimal when translating between two morphologically rich languages, e.g. French/German, or very different languages, e.g. Russian/Japanese. We also provide parallel data for many language pairs not involving English. Due the high computational complexity, we only considered 28 languages (see Table 3). This yielded about three million parallel sentence pairs. To the best of our knowledge, this makes CCMatrix the largest collection of high-quality mined parallel texts.
The general tendency is of course that mining in large monolingual corpora leads to larger extracted bitexts. This is however not systematically true. Let us consider for examples Polish and Dutch which have both about 500 million unique sentences. When aligned with Czech, a Slavic language, there are slightly more bitexts with Polish than Dutch (13.2M in comparison to 11.6M). When aligned with German, a Germanic language like Dutch, there are substantially more bitexts for Dutch than Polish, 33.2M and 20.5M respectively. Finally, both Polish and Dutch have much smaller bitexts with Indonesian although there are more than 360M sentences for that language.
One one hand, a possible explanation could be that LASER alignments are more reliable for languages which are very similar, i.e. in the same language family. On the other hand, it may also be that people which live in nearby countries have similar interests which increases the chance to find translations on the Web.
In order to asses the quality of the extracted parallel sentences, we trained NMT systems on the extracted parallel sentences and evaluated them on several public test sets. A test set for many languages, based on the TED tasks, is provided in Qi et al. (2018). Our results on this test set are given in the next section. The workshop on machine translation (WMT) has a long history of organizing evaluations of machine translation, and many comparative results are published for these tasks Barrault et al. (2019). We provide very competitive BLEU scores for several WMT’19 evaluation tasks in Section 6.2. Finally, we consider the task of translating between Russian and Japanese as proposed by the 2019 edition of the workshop on Asian translation (see Section 6.3).
In this set of experiments, we are interested in the performance of NMT systems trained on our bitexts only. Following Gottschalk and Demidova (2017) and Schwenk et al. (2019), we evaluate on the test sets of the TED dataset Qi et al. (2018). This dataset contains parallel TED talk transcripts in 50 languages.888https://github.com/neulab/word-embeddings-for-nmt The TED datasets are tokenized and we first detokenize them using Moses, with the exception of pairs involving Korean because it creates artifacts. As we do not include the training set provided with the TED dataset, we are not guaranteed that our bitexts cover the same domains.
In the current version of CCMatrix, we consider different languages, resulting in NMT systems to train. Although the size of bitexts vary for the different language pairs, we used the same pipeline for each pair. In paraticular, we limit the bitext size to 15M sentences to avoid very long training times. We tokenize the dataset with Moses, with the exception of Chinese where we use Jieba and Japanese where we use Mecab. We compute a BPE vocabulary Sennrich et al. (2016) of size
k on the resulting tokenized training bitext. Then, for all the pairs, we train the same architecture, that is a Transformer network withlayers for both the encoder and decoder. We use a dimension of for the word embeddings and FFN=. We train each model for epochs with an initial learning rate of . We keep the model with the best BLEU score on the validation set of TED.
In Table 4, we report tokenized BLEU scores on the test set (using Moses, jieba and mecab tokenization). The average BLEU is for all the pairs and for pairs with English. In comparision with Wikimatrix Schwenk et al. (2019), we have pairs out of with a BLEU above while they had only out of language pairs. Their best pair reached BLEU (for Brazilian Portuguese into English), while we have pairs that surpasses , with our best pairs reaching BLEU (Norwegian to English). These results should not be considered as the state-of-the-art on the TED corpus since we did not attempt to optimize the Transformer architecture for each language pair. We believe that they give a good indication of the quality of the mined parallel sentences, and suggest that our bitext mining approach is robust to the noise and difference in domains that exist in a large corpora like Common Crawl.
|NT’18 WMT bitext||46.2||45.9||33.5||33.4||25.8||39.2||-||-|
|NT’19 WMT bitext||41.0||40.4||31.4||38.1||-||-||-||-|
. Newstest’19 best are the best BLEU scores achieved by ensembles of models trained on both parallel and back-translated WMT’19 data as of the moment of writing, according tohttp://matrix.statmt.org/
We also evaluate our bitexts on the WMT’19 news translation task. We only consider high resource directions for this comparison as they constitute the biggest challenge, because the existing baseline systems perform very strongly and achieving superior performance with mined data only is very challenging. We are following the setup described in (Ng et al., 2019) to train systems on En-De, En-Ru, En-Zh and De-Fr. We used Transformer Big architecture with increased FFN size (8192), we trained these models for 500k updates on 8 GPUs with batch size of 3500 tokens. Given the large amounts of mined bitexts for the considered language pairs (see Table 3), we limit the sentence pairs to those with score higher than or equal to 1.07 except for En-Zh where we apply a margin threshold of 1.06. This gives us: 40.6M En-De, 39.5M En-Ru, 32.6M De-Fr and 17.6M En-Zh sentence pairs. For each direction we learned joined source-target BPE encoding Sennrich et al. (2016) and used shared input/output embeddings. For En-De and En-Ru models, we increased model size even further to 9 encoder and decoder layers, used layer dropout Fan et al. (2019) and increased embed dimensions to 2048. We tuned training parameters on Newstest 2014-2016 when available and on the WMT’19 dev set for De-Fr. We compare performance of a single model for each direction with the performance of published single models trained on bitext data only. We found that systems trained on CCMatrix outperform systems trained on bitext data (see Table 5). This can be seen as a clear indicator of the quality of the mined data.
To answer another question of how does this data combine with real human translated data we train a system using a combination of CCMatrix and bitexts provided by WMT’19, at the example of En-De. We found that this system outperforms the system trained on CCMatrix data only by 0.8 BLEU points in average, achieving an BLEU score of 50.9 on newstest2018 and of 45.1 on newstest2019.
|System||Ja / Ru||Ru / Ja|
|WAT’19 test best||14.26111111http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/list.php?t=67&o=1||16.41121212http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/list.php?t=66&o=4|
Finally, we have evaluated the translation between Russian and Japanese as proposed in the 2019 Workshop on Asian Translation (WAT) Nakazawa et al. (2019).999http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2019/index.html According to the organizers of the WAT workshop, this language pairs represents “an extremely low resource situation for distant language pairs”. The organizers provide only a tiny amount of parallel data from the Global Voices domain for training (12,356 sentences), and a development (486) and test set (600 sentences) from News Commentary domain, respectively.101010https://github.com/aizhanti/JaRuNC The participants in the WAT’19 Russian/Japanese evaluation were encouraged to use provided Russian/English and Japanese/English bitexts and train multilingual NMT systems.
We trained an NMT system on CCMatrix Russian/Japanese bitexts only, without using other resources or texts aligned with English. We applied a threshold of 1.06 on the margin. We use the same NMT architecture than in Section 6.2, without layer dropout. We report tokenized BLEU scores using multi-bleu.perl using Moses tokenization for Russian, and Mecab for Japanese (see Table 6). We were able to outperform the best performing system at the WAT’19 evaluation, in particular when translating into Japanese (see Table 6). The participant in the WAT translation task were constraint to only use the provided resources, which included alignments with English. Therefore, our results are not directly comparable, but we argue that they are still a good indicator of the alignment quality of our mined bitexts.
We have shown that margin-based mining in a joint multilingual sentence embedding space can be scaled to monolingual texts of more than 36 billions unique sentences in 38 languages. Our approach is generic and simply compares all sentences among each other, without requiring any document alignment. We tackled the computational complexity by parallelizing all processing steps. This procedure yielded 661 million sentences aligned with English, and 3.5 billion for pairwise alignments of 28 languages. To the best of our knowledge, this is by far the largest collection of high quality parallel sentences.
We have performed an extensive evaluation of the quality of the mined bitexts by training NMT systems for many language pairs. The mined bitexts seem to be of high quality. Training only on our mined data, we are able to outperform the best reported single NMT system at the WMT’19 evaluations for the translation between German, Russian and Chinese and English, as well as between German and French. We also achieve state-of-the-art BLEU scores for the translation between Russian and Japanese on the WAT’19 test set. We provide a script to reproduce our results on the LASER github.111111https://github.com/facebookresearch/LASER
In the next version of the CCMatrix corpus, we will increase the number of common crawl snapshots and focus on low-resource languages. The mined data can be used to train improved multilingual LASER sentence embeddings. The large amount of parallel data also raises interesting questions, namely how to use it best, for instance, how to efficiently train NMT systems on more than fifty million high quality bitexts?
We would like to thank Matthijs Douze for support with the use of FAISS and Vishrav Chaudhary for helpful comments on this work.
Deep neural networks at the service of multilingual parallel sentence extraction. In Coling, Cited by: §2.
Combining content-based and url-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. The Prague Bulletin of Mathematical Linguistics 9, pp. 77–86. Cited by: §2.
BUCC 2017 Shared Task: a First Attempt Toward a Deep Learning Framework for Identifying Parallel Sentences in Comparable Corpora. In BUCC, pp. 46–50. External Links: Cited by: §2.
When and why are pre-trained word embeddings useful for neural machine translation?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 529–535. External Links: Cited by: §1, §1, §6.1, Table 4, §6.