CDA: a Cost Efficient Content-based Multilingual Web Document Aligner

02/20/2021
by   Thuy Vu, et al.
1

We introduce a Content-based Document Alignment approach (CDA), an efficient method to align multilingual web documents based on content in creating parallel training data for machine translation (MT) systems operating at the industrial level. CDA works in two steps: (i) projecting documents of a web domain to a shared multilingual space; then (ii) aligning them based on the similarity of their representations in such space. We leverage lexical translation models to build vector representations using TF-IDF. CDA achieves performance comparable with state-of-the-art systems in the WMT-16 Bilingual Document Alignment Shared Task benchmark while operating in multilingual space. Besides, we created two web-scale datasets to examine the robustness of CDA in an industrial setting involving up to 28 languages and millions of documents. The experiments show that CDA is robust, cost-effective, and is significantly superior in (i) processing large and noisy web data and (ii) scaling to new and low-resourced languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2021

Multilingual Document-Level Translation Enables Zero-Shot Transfer From Sentences to Documents

Document-level neural machine translation (DocNMT) delivers coherent tra...
research
04/30/2020

Exploiting Sentence Order in Document Alignment

In this work, we exploit the simple idea that a document and its transla...
research
01/11/2022

The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

This paper presents a new training dataset for automatic genre identific...
research
12/15/2022

TRIP: Triangular Document-level Pre-training for Multilingual Language Models

Despite the current success of multilingual pre-training, most prior wor...
research
09/02/2020

Identifying Documents In-Scope of a Collection from Web Archives

Web archive data usually contains high-quality documents that are very u...
research
12/06/2019

Document Network Embedding: Coping for Missing Content and Missing Links

Searching through networks of documents is an important task. A promisin...
research
12/20/2022

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

As demand for large corpora increases with the size of current state-of-...

Please sign up or login with your details

Forgot password? Click here to reset