A Massive Collection of Cross-Lingual Web-Document Pairs

11/10/2019
by   Ahmed El-Kishky, et al.
0

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Small-scale efforts have been made to collect aligned document level data on a limited set of language-pairs such as English-German or on limited comparable collections such as Wikipedia. In this paper, we mine twelve snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of 54 million URL pairs from Common Crawl covering documents in 92 languages paired with English. We evaluate the quality of the dataset by measuring the quality of machine translations from models that have been trained on mined parallel sentence pairs from this aligned corpora and introduce a simple yet effective baseline for identifying these aligned documents. The objective of this dataset and paper is to foster new research in cross-lingual NLP across a variety of low, mid, and high-resource languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2020

Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance

Cross-lingual document alignment aims to identify pairs of documents in ...
research
11/27/2019

word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs

We present word2word, a publicly available dataset and an open-source Py...
research
11/18/2021

Detecting Cross-Language Plagiarism using Open Knowledge Graphs

Identifying cross-language plagiarism is challenging, especially for dis...
research
11/28/2019

Legal document retrieval across languages: topic hierarchies based on synsets

Cross-lingual annotations of legislative texts enable us to explore majo...
research
11/30/2021

Bilingual Topic Models for Comparable Corpora

Probabilistic topic models like Latent Dirichlet Allocation (LDA) have b...
research
07/29/2017

Bilingual Document Alignment with Latent Semantic Indexing

We apply cross-lingual Latent Semantic Indexing to the Bilingual Documen...
research
12/15/2020

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

With the ongoing growth in number of digital articles in a wider set of ...

Please sign up or login with your details

Forgot password? Click here to reset