Most of the current approaches in Natural Language Processing (NLP) are data-driven. The size of the resources used for training is often the primary concern, but the quality and a large variety of topics may be equally important. Monolingual texts are usually available in huge amounts for many topics and languages. However, multilingual resources, typically sentences in two languages which are mutual translations, are more limited, in particular when the two languages do not involve English. An important source of parallel texts are international organizations like the European ParliamentKoehn (2005) or the United Nations Ziemski et al. (2016). These are professional human translations, but they are in a more formal language and tend to be limited to political topics. There are several projects relying on volunteers to provide translations for public texts, e.g. news commentary Tiedemann (2012), OpensubTitles Lison and Tiedemann (2016) or the TED corpus Qi et al. (2018)
Wikipedia is probably the largest free multilingual resource on the Internet. The content of Wikipedia is very diverse and covers many topics. Articles exist in more than 300 languages. Some content on Wikipedia was human translated from an existing article into another language, not necessarily from or into English. Eventually, the translated articles have been later independently edited and are not parallel any more. Wikipedia strongly discourages the use of unedited machine translation,222https://en.wikipedia.org/wiki/Wikipedia:Translation but the existence of such articles can not be totally excluded. Many articles have been written independently, but may nevertheless contain sentences which are mutual translations. This makes Wikipedia a very appropriate resource to mine for parallel texts for a large number of language pairs. To the best of our knowledge, this is the first work to process the entire Wikipedia and systematically mine for parallel sentences in all language pairs. We hope that this resource will be useful for several research areas and enable the development of NLP applications for more languages.
In this work, we build on a recent approach to mine parallel texts based on a distance measure in a joint multilingual sentence embedding space Schwenk (2018); Artetxe and Schwenk (2018a). For this, we use the freely available LASER toolkit333https://github.com/facebookresearch/LASER which provides a language agnostic sentence encoder which was trained on 93 languages Artetxe and Schwenk (2018b). We approach the computational challenge to mine in almost six hundred million sentences by using fast indexing and similarity search algorithms.
The paper is organized as follows. In the next section, we first discuss related work. We then summarize the underlying mining approach. Section 4 describes in detail how we applied this approach to extract parallel sentences from Wikipedia in 1620 language pairs. To asses the quality of the extracted bitexts, we train NMT systems for a subset of language pairs and evaluate them on the TED corpus Qi et al. (2018) for 45 languages. These results are presented in section 5. The paper concludes with a discussion of future research directions.
2 Related work
There is a large body of research on mining parallel sentences in collections of monolingual texts, usually named “comparable coprora”. Initial approaches to bitext mining have relied on heavily engineered systems often based on metadata information, e.g. (Resnik, 1999; Resnik and Smith, 2003). More recent methods explore the textual content of the comparable documents. For instance, it was proposed to rely on cross-lingual document retrieval, e.g. (Utiyama and Isahara, 2003; Munteanu and Marcu, 2005) or machine translation, e.g. (Abdul-Rauf and Schwenk, 2009; Bouamor and Sajjad, 2018), typically to obtain an initial alignment that is then further filtered. In the shared task for bilingual document alignment Buck and Koehn (2016)
, many participants used techniques based on n-gram or neural language models, neural translation models and bag-of-words lexical translation probabilities for scoring candidate document pairs. The STACC method uses seed lexical translations induced from IBM alignments, which are combined with set expansion operations to score translation candidates through the Jaccard similarity coefficient(Etchegoyhen and Azpeitia, 2016; Azpeitia et al., 2017, 2018). Using multilingual noisy web-crawls such as ParaCrawl444http://www.paracrawl.eu/ for filtering good quality sentence pairs has been explored in the shared tasks for high resource Koehn et al. (2018) and low resource Koehn et al. (2019) languages.
In this work, we rely on massively multilingual sentence embeddings and margin-based mining in the joint embedding space, as described in Schwenk (2018); Artetxe and Schwenk (2018a); Artetxe and Schwenk (2018b). This approach has also proven to perform best in a low resource scenario Chaudhary et al. (2019); Koehn et al. (2019). Closest to this approach is the research described in España-Bonet et al. (2017); Hassan et al. (2018); Guo et al. (2018); Yang et al. (2019). However, in all these works, only bilingual sentence representations have been trained. Such an approach does not scale to many languages, in particular when considering all possible language pairs in Wikipedia. Finally, related ideas have been also proposed in Bouamor and Sajjad (2018) or Grégoire and Langlais (2017). However, in those works, mining is not solely based on multilingual sentence embeddings, but they are part of a larger system. To the best of our knowledge, this work is the first one that applies the same mining approach to all combinations of many different languages, written in more than twenty different scripts.
Wikipedia is arguably the largest comparable corpus. One of the first attempts to exploit this resource was performed by Adafre and de Rijke (2006). An MT system was used to translate Dutch sentences into English and to compare them with the English texts. This method yielded several hundreds of Dutch/English parallel sentences. Later, a similar technique was applied to the Persian/English pair Mohammadi and GhasemAghaee (2010). Structural information in Wikipedia such as the topic categories of documents was used in the alignment of multilingual corpora Otero and López (2010). In another work, the mining approach of Munteanu and Marcu (2005) was applied to extract large corpora from Wikipedia in sixteen languages Smith et al. (2010). Otero et al. (2011) measured the comparability of Wikipedia corpora by the translation equivalents on three languages Portuguese, Spanish, and English. Patry and Langlais (2011) came up with a set of features such as Wikipedia entities to recognize parallel documents, and their approach was limited to a bilingual setting. Tufis et al. (2013) proposed an approach to mine parallel sentences from Wikipedia textual content, but they only considered high-resource languages, namely German, Spanish and Romanian paired with English. Tsai and Roth (2016) grounded multilingual mentions to English wikipedia by training cross-lingual embeddings on twelve languages. Gottschalk and Demidova (2017) searched for parallel text passages in Wikipedia by comparing their named entities and time expressions. Finally, Aghaebrahimian (2018) propose an approach based on bilingual BiLSTM sentence encoders to mine German, French and Persian parallel texts with English. Parallel data consisting of aligned Wikipedia titles have been extracted for twenty-three languages555https://linguatools.org/tools/corpora/wikipedia-parallel-titles-corpora/. Since Wikipedia titles are rarely entire sentences with a subject, verb and object, it seems that only modest improvements were observed when adding this resource to the training material of NMT systems.
We are not aware of other attempts to systematically mine for parallel sentences in the textual content of Wikipedia for a large number of languages.
3 Distance-based mining approach
The underling idea of the mining approach used in this work is to first learn a multilingual sentence embedding, i.e. an embedding space in which semantically similar sentences are close independently of the language they are written in. This means that the distance in that space can be used as an indicator whether two sentences are mutual translations or not. Using a simple absolute threshold on the cosine distance was shown to achieve competitive results Schwenk (2018). However, it has been observed that an absolute threshold on the cosine distance is globally not consistent, e.g. Guo et al. (2018). The difficulty to select one global threshold is emphasized in our setting since we are mining parallel sentences for many different language pairs.
3.1 Margin criterion
The alignment quality can be substantially improved by using a margin criterion instead of an absolute threshold Artetxe and Schwenk (2018a). In that work, the margin between two candidate sentences and
is defined as the ratio between the cosine distance between the two sentence embeddings, and the average cosine similarity of its nearest neighbors in both directions:
where denotes the unique nearest neighbors of in the other language, and analogously for . We used in all experiments.
We follow the “max” strategy as described in Artetxe and Schwenk (2018a): the margin is first calculated in both directions for all sentences in language and . We then create the union of these forward and backward candidates. Candidates are sorted and pairs with source or target sentences which were already used are omitted. We then apply a threshold on the margin score to decide whether two sentences are mutual translations or not. Note that with this technique, we always get the same aligned sentences, independently of the mining direction, e.g. searching translations of French sentences in a German corpus, or in the opposite direction. The reader is referred to Artetxe and Schwenk (2018a) for a detailed discussion with related work.
The complexity of a distance-based mining approach is , where and are the number of sentences in each monolingual corpus. This makes a brute-force approach with exhaustive distance calculations intractable for large corpora. Margin-based mining was shown to significantly outperform the state-of-the-art on the shared-task of the workshop on Building and Using Comparable Corpora (BUCC) Artetxe and Schwenk (2018a). The corpora in the BUCC corpus are rather small: at most 567k sentences.
The languages with the largest Wikipedia are English and German with 134M and 51M sentences, respectively, after pre-processing (see Section 4.1 for details). This would require distance calculations.666Strictly speaking, Cebuano and Swedish are larger than German, yet mostly consist of template/machine translated text https://en.wikipedia.org/wiki/List_of_Wikipedias We show in Section 3.3 how to tackle this computational challenge.
3.2 Multilingual sentence embeddings
Distance-based bitext mining requires a joint sentence embedding for all the considered languages. One may be tempted to train a bi-lingual embedding for each language pair, e.g. España-Bonet et al. (2017); Hassan et al. (2018); Guo et al. (2018); Yang et al. (2019), but this is difficult to scale to thousands of language pairs present in Wikipedia. Instead, we chose to use one single massively multilingual sentence embedding for all languages, namely the one proposed by the open-source LASER toolkit Artetxe and Schwenk (2018b). Training one joint multilingual embedding on many languages at once also has the advantage that low-resource languages can benefit from the similarity to other language in the same language family. For example, we were able to mine parallel data for several Romance (minority) languages like Aragonese, Lombard, Mirandese or Sicilian although data in those languages was not used to train the multilingual LASER embeddings.
The underlying idea of LASER is to train a sequence-to-sequence system on many language pairs at once using a shared BPE vocabulary and a shared encoder for all languages. The sentence representation is obtained by max-pooling over all encoder output states. Figure1 illustrates this approach. The reader is referred to Artetxe and Schwenk (2018b) for a detailed description.
3.3 Fast similarity search
Fast large-scale similarity search is an area with a large body of research. Traditionally, the application domain is image search, but the algorithms are generic and can be applied to any type of vectors. In this work, we use the open-source FAISS library777https://github.com/facebookresearch/faiss which implements highly efficient algorithms to perform similarity search on billions of vectors Johnson et al. (2017). An additional advantage is that FAISS has support to run on multiple GPUs. Our sentence representations are 1024-dimensional. This means that the embeddings of all English sentences require GB of memory. Therefore, dimensionality reduction and data compression are needed for efficient search. In this work, we chose a rather aggressive compression based on a 64-bit product-quantizer Jégou et al. (2011), and portioning the search space in 32k cells. This corresponds to the index type “OPQ64,IVF32768,PQ64” in FAISS terms.888https://github.com/facebookresearch/faiss/wiki/Faiss-indexes Another interesting compression method is scalar quantization. A detailed comparison is left for future research. We build and train one FAISS index for each language.
The compressed FAISS index for English requires only 9.2GB, i.e. more than fifty times smaller than the original sentences embeddings. This makes it possible to load the whole index on a standard GPU and to run the search in a very efficient way on multiple GPUs in parallel, without the need to shard the index. The overall mining process for German/English requires less than 3.5 hours on 8 GPUs, including the nearest neighbor search in both direction and scoring all candidates
4 Bitext mining in Wikipedia
For each Wikipedia article, it is possible to get the link to the corresponding article in other languages. This could be used to mine sentences limited to the respective articles. One one hand, this local mining has several advantages: 1) mining is very fast since each article usually has a few hundreds of sentences only; 2) it seems reasonable to assume that a translation of a sentence is more likely to be found in the same article than anywhere in the whole Wikipedia. On the other hand, we hypothesize that the margin criterion will be less efficient since one article has usually few sentences which are similar. This may lead to many sentences in the overall mined corpus of the type “NAME was born on DATE in CITY”, “BUILDING is a monument in CITY built on DATE”, etc. Although those alignments may be correct, we hypothesize that they are of limited use to train an NMT system, in particular when they are too frequent. In general, there is a risk that we will get sentences which are close in structure and content.
The other option is to consider the whole Wikipedia for each language: for each sentence in the source language, we mine in all target sentences. This global mining has several potential advantages: 1) we can try to align two languages even though there are only few articles in common; 2) many short sentences which only differ by the name entities are likely to be excluded by the margin criterion. A drawback of this global mining is a potentially increased risk of misalignment and a lower recall.
In this work, we chose the global mining option. This will allow us to scale the same approach to other, potentially huge, corpora for which document-level alignments are not easily available, e.g. Common Crawl. An in depth comparison of local and global mining (on Wikipedia) is left for future research.
4.1 Corpus preparation
Extracting the textual content of Wikipedia articles in all languages is a rather challenging task, i.e. removing all tables, pictures, citations, footnotes or formatting markup. There are several ways to download Wikipedia content. In this study, we use the so-called CirrusSearch dumps since they directly provide the textual content without any meta information.999https://dumps.wikimedia.org/other/cirrussearch/ We downloaded this dump in March 2019. A total of about 300 languages are available, but the size obviously varies a lot between languages. We applied the following processing:
extract the textual content;
split the paragraphs into sentences;
remove duplicate sentences;
perform language identification and remove sentences which are not in the expected language (usually, citations or references to texts in another language).
|(French)||Ceci est une très grande maison|
|(German)||Das ist ein sehr großes Haus|
|This is a very big house|
|Ez egy nagyon nagy ház|
|Ini rumah yang sangat besar|
It should be pointed out that sentence segmentation is not a trivial task, with many exceptions and specific rules for the various languages. For instance, it is rather difficult to make an exhaustive list of common abbreviations for all languages. In German, points are used after numbers in enumerations, but numbers may also appear at the end of sentences. Other languages do not use specific symbols to mark the end of a sentence, namely Thai. We are not aware of a reliable and freely available sentence segmenter for Thai and we had to exclude that language. We used the freely available Python tool SegTok101010https://pypi.org/project/segtok/ which has specific rules for 24 languages. Regular expressions were used for most of the Asian languages, falling back to English for the remaining languages. This gives us 879 million sentences in 300 languages. The margin criterion to mine for parallel data requires that the texts do not contain duplicates. This removes about 25% of the sentences.111111The Cebuano and Waray Wikipedia were largely created by a bot and contain more than 65% of duplicates.
LASER’s sentence embeddings are totally language agnostic. This has the side effect that the sentences in other languages (e.g. citations or quotes) may be considered closer in the embedding space than a potential translation in the target language. Table 1 illustrates this problem. The algorithm would not select the German sentence although it is a perfect translation. The sentences in the other languages are also valid translations which would yield a very small margin. To avoid this problem, we perform language identification (LID) on all sentences and remove those which are not in the expected language. LID is performed with fasttext121212https://fasttext.cc/docs/en/language-identification.html Joulin et al. (2016). Fasttext does not support all the 300 languages present in Wikipedia and we disregarded the missing ones (which typically have only few sentences anyway). After deduplication and LID, we dispose of 595M sentences in 182 languages. English accounts for 134M sentences, and German with 51M sentences is the second largest language. The sizes for all languages are given in Tables 3 and 5.
4.2 Threshold optimization
Artetxe and Schwenk (2018a) optimized their mining approach for each language pair on a provided corpus of gold alignments. This is not possible when mining Wikipedia, in particular when considering many language pairs. In this work, we use an evaluation protocol inspired by the WMT shared task on parallel corpus filtering for low-resource conditions Koehn et al. (2019): an NMT system is trained on the extracted bitexts – for different thresholds – and the resulting BLEU scores are compared. We choose newstest2014 of the WMT evaluations since it provides an -way parallel test sets for English, French, German and Czech. We favoured the translation between two morphologically rich languages from different families and considered the following language pairs: German/English, German/French, Czech/German and Czech/French. The size of mined bitexts is in the range of 100k to more than 2M (see Table 2 and Figure 2). We did not try to optimize the architecture of the NMT system to the size of the bitexts and used the same architecture for all systems: the encoder and decoder are 5-layer transformer models as implemented in fairseq Ott et al. (2019). The goal of this study is not to develop the best performing NMT system for the considered languages pairs, but to compare different mining parameters.
The evolution of the BLEU score in function of the margin threshold is given in Figure 2. Decreasing the threshold naturally leads to more mined data – we observe an exponential increase of the data size. The performance of the NMT systems trained on the mined data seems to change as expected, in a surprisingly smooth way. The BLEU score first improves with increasing amounts of available training data, reaches a maximum and than decreases since the additional data gets more and more noisy, i.e. contains wrong translations. It is also not surprising that a careful choice of the margin threshold is more important in a low-resource setting. Every additional parallel sentence is important. According to Figure 2, the optimal value of the margin threshold seems to be 1.05 when many sentences can be extracted, in our case German/English and German/French. When less parallel data is available, i.e. Czech/German and Czech/French, a value in the range of 1.03–1.04 seems to be a better choice. Aiming at one threshold for all language pairs, we chose a value of 1.04. It seems to be a good compromise for most language pairs. However, for the open release of this corpus, we provide all mined sentence with a margin of 1.02 or better. This would enable end users to choose an optimal threshold for their particular applications. However, it should be emphasized that we do not expect that many sentence pairs with a margin as low as 1.02 are good translations.
For comparison, we also trained NMT systems on the Europarl corpus V7 Koehn (2005), i.e. professional human translations, first on all available data, and then on the same number of sentences than the mined ones (see Table 2). With the exception of Czech/French, we were able to achieve better BLEU scores with the automatically mined bitexts in Wikipedia than with Europarl of the same size. Adding the mined text to the full Europarl corpus, also leads to further improvements of 1.1 to 3.1 BLEU. We argue that this is a good indicator of the quality of the automatically extracted parallel sentences.
5 Result analysis
We run the alignment process for all possible combinations of languages in Wikipedia. This yielded 1620 language pairs for which we were able to mine at least ten thousand sentences. Remember that mining is identical to , and is counted only once. We propose to analyze and evaluate the extracted bitexts in two ways. First, we discuss the amount of extracted sentences (Section 5.1). We then turn to a qualitative assessment by training NMT systems for all language pairs with more than twenty-five thousand mined sentences (Section 5.2).
5.1 Quantitative analysis
|sh||Serbo-Croatian languages||South Slavic||2069||17||27||46||19||373||46||8||14||13||17||35||9||45||42||36||3337|
Due to space limits, Table 3 summarizes the number of extracted parallel sentences only for languages which have a total of at least five hundred thousand parallel sentences (with all other languages at a margin threshold of 1.04). Additional results are given in Table 5 in the Appendix.
There are many reasons which can influence the number of mined sentences. Obviously, the larger the monolingual texts, the more likely it is to mine many parallel sentences. Not surprisingly, we observe that more sentences could be mined when English is one of the two languages. Let us point out some languages for which it is usually not obvious to find parallel data with English, namely Indonesian (1M), Hebrew (545k), Farsi (303k) or Marathi (124k sentences). The largest mined texts not involving English are Russian/Ukrainian (2.5M), Catalan/Spanish (1.6M), between the Romance languages French, Spanish, Italian and Portuguese (480k–923k), and German/French (626k).
It is striking to see that we were able to mine more sentences when Galician and Catalan are paired with Spanish than with English. On one hand, this could be explained by the fact that LASER’s multilingual sentence embeddings may be better since the involved languages are linguistically very similar. On the other, it could be that the Wikipedia articles in both languages share a lot of content, or are obtained by mutual translation.
Services from the European Commission provide human translations of (legal) texts in all the 24 official languages of the European Union. This N-way parallel corpus enables training of MT system to directly translate between these languages, without the need to pivot through English. This is usually not the case when translating between other major languages, for example in Asia. Let us list some interesting language pairs for which we were able to mine more than hundred thousand sentences: Korean/Japanese (222k), Russian/Japanese (196k), Indonesian/Vietnamese (146k), or Hebrew/Romance languages (120–150k sentences).
Overall, we were able to extract at least ten thousand parallel sentences for 85 different languages.13131399 languages have more than 5,000 parallel sentences. For several low-resource languages, we were able to extract more parallel sentences with other languages than English. These include, among others, Aragonse with Spanish, Lombard with Italian, Breton with several Romance languages, Western Frisian with Dutch, Luxembourgish with German or Egyptian Arabic and Wu Chinese with the respective major language.
Finally, Cebuano (ceb) falls clearly apart: it has a rather huge Wikipedia (17.9M filtered sentence), but most of it was generated by a bot, as for the Waray language141414https://en.wikipedia.org/wiki/Lsjbot. This certainly explains that only a very small number of parallel sentences could be extracted. Although the same bot was also used to generate articles in the Swedish Wikipedia, our alignments seem to be better for that language.
5.2 Qualitative evaluation
Aiming to perform a large-scale assessment of the quality of the extracted parallel sentences, we trained NMT systems on the extracted parallel sentences. We identified a publicly available data set which provide test sets for many language pairs: translations of TED talks as proposed in the context of a study on pretrained word embeddings for NMT151515https://github.com/neulab/word-embeddings-for-nmt Qi et al. (2018)
. We would like to emphasize that we did not use the training data provided by TED – we only trained on the mined sentences from Wikipedia. The goal of this study is not to build state-of-the-art NMT system for for the TED task, but to get an estimate of the quality of our extracted data, for many language pairs. In particular, there may be a mismatch in the topic and language style between Wikipedia texts and the transcribed and translated TED talks.
For training NMT systems, we used a transformer model from fairseq (Ott et al., 2019) with the parameter settings shown in Figure 3 in the appendix. For preprocessing, the text was tokenized using the Moses tokenizer (without true casing) and a 5000 subword vocabulary was learnt using SentencePiece Kudo and Richardson (2018). Decoding was done with beam size 5 and length normalization 1.2.
We evaluate the trained translation systems on the TED dataset Qi et al. (2018). The TED data consists of parallel TED talk transcripts in multiple languages, and it provides development and test sets for 50 languages. Since the development and test sets were already tokenized, we first detokenize them using Moses. We trained NMT systems for all possible language pairs with more than twenty-five thousand mined sentences. This gives us in total 1886 language pairs in 45 languages. We train and with the same mined bitexts /. Scores on the test sets were computed with SacreBLEU (Post, 2018). Table 4 summarizes all the results. Due to space constraints, we are unable to report BLEU score for all language combinations in that table. Some additional results are reported in Table 6 in the annex. 23 NMT systems achieve BLEU scores over 30, the best one being 37.3 for Brazilian Portuguese to English. Several results are worth mentioning, like Farsi/English: 16.7, Hebrew/English: 25.7, Indonesian/English: 24.9 or English/Hindi: 25.7 We also achieve interesting results for translation between various non English language pairs for which it is usually not easy to find parallel data, e.g. Norwegian Danish 33, Norwegian Swedish 25, Indonesian Vietnamese 16 or Japanese / Korean 17.
Our results on the TED set give an indication on the quality of the mined parallel sentences. These BLEU scores should be of course appreciated in context of the sizes of the mined corpora as given in Table 3. Obviously, we can not exclude that the provided data contains some wrong alignments even though the margin is large. Finally, we would like to point out that we run our approach on all available languages in Wikipedia, independently of the quality of LASER’s sentence embeddings for each one.
We have presented an approach to systematically mine for parallel sentences in the textual content of Wikipedia, for all possible language pairs. We use a recently proposed mining approach based on massively multilingual sentence embeddings Artetxe and Schwenk (2018b) and a margin criterion Artetxe and Schwenk (2018a). The same approach is used for all language pairs without the need of a language specific optimization. In total, we make available 135M parallel sentences in 85 languages, out of which only 34M sentences are aligned with English. We were able to mine more than ten thousands sentences for 1620 different language pairs. This corpus of parallel sentences is freely available.161616https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix We also performed a large scale evaluation of the quality of the mined sentences by training 1886 NMT systems and evaluating them on the 45 languages of the TED corpus Qi et al. (2018).
This work opens several directions for future research. The mined texts could be used to first retrain LASER’s multilingual sentence embeddings with the hope to improve the performance on low-resource languages, and then to rerun mining in Wikipedia. This process could be iteratively repeated. We also plan to apply the same methodology to other large multilingual collections. The monolingual texts made available by ParaCrawl or CommonCrawl171717http://commoncrawl.org/ are good candidates.
We expect that the WikiMatrix corpus has mostly well-formed sentences and it should not contain social media language. The mined parallel sentences are not limited to specific topics like many of the currently available resources (parliament proceedings, subtitles, software documentation, ), but are expected to cover many topics of Wikipedia. The fraction of unedited machine translated text is also expected to be low. We hope that this resource will be useful to support research in multilinguality, in particular machine translation.
We would like to thank Edoaurd Grave for help with handling the Wikipedia corpus and Matthijs Douze for support with the use of FAISS.
- Abdul-Rauf and Schwenk (2009) Sadaf Abdul-Rauf and Holger Schwenk. 2009. On the Use of Comparable Corpora to Improve SMT performance. In EACL, pages 16–23.
- Adafre and de Rijke (2006) Sisay Fissaha Adafre and Maarten de Rijke. 2006. Finding similar sentences across multiple languages in Wikipedia. In Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources.
Ahmad Aghaebrahimian. 2018.
Deep neural networks at the service of multilingual parallel sentence extraction.In Coling.
- Artetxe and Schwenk (2018a) Mikel Artetxe and Holger Schwenk. 2018a. Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings. https://arxiv.org/abs/1811.01136.
- Artetxe and Schwenk (2018b) Mikel Artetxe and Holger Schwenk. 2018b. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. In https://arxiv.org/abs/1812.10464.
- Azpeitia et al. (2017) Andoni Azpeitia, Thierry Etchegoyhen, and Eva Martínez Garcia. 2017. Weighted Set-Theoretic Alignment of Comparable Sentences. In BUCC, pages 41–45.
- Azpeitia et al. (2018) Andoni Azpeitia, Thierry Etchegoyhen, and Eva Martínez Garcia. 2018. Extracting Parallel Sentences from Comparable Corpora with STACC Variants. In BUCC.
- Bouamor and Sajjad (2018) Houda Bouamor and Hassan Sajjad. 2018. H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings. In BUCC.
- Buck and Koehn (2016) Christian Buck and Philipp Koehn. 2016. Findings of the wmt 2016 bilingual document alignment shared task. In Proceedings of the First Conference on Machine Translation, pages 554–563, Berlin, Germany. Association for Computational Linguistics.
- Chaudhary et al. (2019) Vishrav Chaudhary, Yuqing Tang, Francisco Guzmán, Holger Schwenk, and Philipp Koehn. 2019. Low-resource corpus filtering using multilingual sentence embeddings. In Proceedings of the Fourth Conference on Machine Translation (WMT).
- España-Bonet et al. (2017) Cristina España-Bonet, Ádám Csaba Varga, Alberto Barrón-Cedeño, and Josef van Genabith. 2017. An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification. IEEE Journal of Selected Topics in Signal Processing, pages 1340–1348.
- Etchegoyhen and Azpeitia (2016) Thierry Etchegoyhen and Andoni Azpeitia. 2016. Set-Theoretic Alignment for Comparable Corpora. In ACL, pages 2009–2018.
- Gottschalk and Demidova (2017) Simon Gottschalk and Elena Demidova. 2017. Multiwiki: Interlingual text passage alignment in Wikipedia. ACM Transactions on the Web (TWEB), 11(1):6.
- Grégoire and Langlais (2017) Francis Grégoire and Philippe Langlais. 2017. BUCC 2017 Shared Task: a First Attempt Toward a Deep Learning Framework for Identifying Parallel Sentences in Comparable Corpora. In BUCC, pages 46–50.
- Guo et al. (2018) Mandy Guo, Qinlan Shen, Yinfei Yang, Heming Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith Stevens, Noah Constant, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Effective Parallel Corpus Mining using Bilingual Sentence Embeddings. arXiv:1807.11906.
- Hassan et al. (2018) Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving Human Parity on Automatic Chinese to English News Translation. arXiv:1803.05567.
- Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734.
- Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. https://arxiv.org/abs/1607.01759.
- Jégou et al. (2011) H. Jégou, M. Douze, and C. Schmid. 2011. Product quantization for nearest neighbor search. IEEE Trans. PAMI, 33(1):117–128.
- Koehn (2005) Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit.
- Koehn et al. (2019) Philipp Koehn, Francisco Guzmán, Vishrav Chaudhary, and Juan M. Pino. 2019. Findings of the wmt 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the Fourth Conference on Machine Translation, Volume 2: Shared Task Papers, Florence, Italy. Association for Computational Linguistics.
- Koehn et al. (2018) Philipp Koehn, Huda Khayrallah, Kenneth Heafield, and Mikel L. Forcada. 2018. Findings of the wmt 2018 shared task on parallel corpus filtering. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 726–739, Belgium, Brussels. Association for Computational Linguistics.
- Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
- Lison and Tiedemann (2016) P. Lison and J. Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. In LREC.
- Mohammadi and GhasemAghaee (2010) Mehdi Zadeh Mohammadi and Nasser GhasemAghaee. 2010. Building bilingual parallel corpora based on Wikipedia. In 2010 Second International Conference on Computer Engineering and Applications, pages 264–268.
- Munteanu and Marcu (2005) Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. Computational Linguistics, 31(4):477–504.
- Otero et al. (2011) P Otero, I López, S Cilenis, and Santiago de Compostela. 2011. Measuring comparability of multilingual corpora extracted from Wikipedia. Iberian Cross-Language Natural Language Processings Tasks (ICL), page 8.
- Otero and López (2010) Pablo Gamallo Otero and Isaac González López. 2010. Wikipedia as multilingual source of comparable corpora. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC, pages 21–25.
- Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
- Patry and Langlais (2011) Alexandre Patry and Philippe Langlais. 2011. Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in Wikipedia. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pages 87–95. Association for Computational Linguistics.
- Post (2018) Matt Post. 2018. A call for clarity in reporting bleu scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
- Qi et al. (2018) Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535, New Orleans, Louisiana. Association for Computational Linguistics.
- Resnik (1999) Philip Resnik. 1999. Mining the Web for Bilingual Text. In ACL.
- Resnik and Smith (2003) Philip Resnik and Noah A. Smith. 2003. The Web as a Parallel Corpus. Computational Linguistics, 29(3):349–380.
- Schwenk (2018) Holger Schwenk. 2018. Filtering and mining parallel data in a joint multilingual space. In ACL, pages 228–234.
- Smith et al. (2010) Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In NAACL, pages 403–411.
- Tiedemann (2012) J. Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In LREC.
- Tsai and Roth (2016) Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual wikification using multilingual embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 589–598.
- Tufis et al. (2013) Dan Tufis, Radu Ion, Ștefan Daniel, Dumitrescu, and Dan Ștefănescu. 2013. Wikipedia as an smt training corpus. In RANLP, pages 702–709.
- Utiyama and Isahara (2003) Masao Utiyama and Hitoshi Isahara. 2003. Reliable Measures for Aligning Japanese-English News Articles and Sentences. In ACL.
- Yang et al. (2019) Yinfei Yang, Gustavo Hernández Ábrego, Steve Yuan, Mandy Guo, Qinlan Shen, Daniel Cer, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019. Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. In https://arxiv.org/abs/1902.08564.
- Ziemski et al. (2016) Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations Parallel Corpus v1.0. In LREC.
Appendix A Appendix