Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

09/29/2015
by   Krzysztof Wołk, et al.
0

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from Wikipedia articles. We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs. We describe our implementation of a specialized tool for this task as well as training and adaption of a machine translation system that supplies our filter with additional information about the similarity of comparable sentence pairs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2021

Bilingual Terminology Extraction from Non-Parallel E-Commerce Corpora

Bilingual terminologies are important resources for natural language pro...
research
03/22/2016

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Parallel texts are a relatively rare language resource, however, they co...
research
12/05/2015

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

The multilingual nature of the world makes translation a crucial require...
research
12/16/2021

Idiomatic Expression Paraphrasing without Strong Supervision

Idiomatic expressions (IEs) play an essential role in natural language. ...
research
09/29/2015

Tuned and GPU-accelerated parallel data mining from comparable corpora

The multilingual nature of the world makes translation a crucial require...
research
05/24/2018

Filtering and Mining Parallel Data in a Joint Multilingual Space

We learn a joint multilingual sentence embedding and use the distance be...

Please sign up or login with your details

Forgot password? Click here to reset