Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

12/05/2015
by   Krzysztof Wołk, et al.
0

The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such systems have a very limited availability especially for some languages and very narrow text domains. In this research we present our improvements to current comparable corpora mining methodologies by re- implementation of the comparison algorithms (using Needleman-Wunch algorithm), introduction of a tuning script and computation time improvement by GPU acceleration. Experiments are carried out on bilingual data extracted from the Wikipedia, on various domains. For the Wikipedia itself, additional cross-lingual comparison heuristics were introduced. The modifications made a positive impact on the quality and quantity of mined data and on the translation quality.

READ FULL TEXT
research
09/29/2015

Tuned and GPU-accelerated parallel data mining from comparable corpora

The multilingual nature of the world makes translation a crucial require...
research
09/29/2015

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Parallel sentences are a relatively scarce but extremely useful resource...
research
04/30/2020

A Call for More Rigor in Unsupervised Cross-lingual Learning

We review motivations, definition, approaches, and methodology for unsup...
research
05/03/2020

Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction

We propose an automatic language-independent graph-based method to build...
research
05/23/2022

Unsupervised Tokenization Learning

In the presented study, we discover that the so-called "transition freed...

Please sign up or login with your details

Forgot password? Click here to reset