Generating Multilingual Parallel Corpus Using Subtitles

04/11/2018
by   Farshad Jafari, et al.
0

Neural Machine Translation with its significant results, still has a great problem: lack or absence of parallel corpus for many languages. This article suggests a method for generating considerable amount of parallel corpus for any language pairs, extracted from open source materials existing on the Internet. Parallel corpus contents will be derived from video subtitles. It needs a set of video titles, with some attributes like release date, rating, duration and etc. Process of finding and downloading subtitle pairs for desired language pairs is automated by using a crawler. Finally sentence pairs will be extracted from synchronous dialogues in subtitles. The main problem of this method is unsynchronized subtitle pairs. Therefore subtitles will be verified before downloading. If two subtitle were not synchronized, then another subtitle of that video will be processed till it finds the matching subtitle. Using this approach gives ability to make context based parallel corpus through filtering videos by genre. Context based corpus can be used in complex translators which decode sentences by different networks after determining contents subject. Languages have many differences in their formal and informal styles, including words and syntax. Other advantage of this method is to make corpus of informal style of languages. Because most of movies dialogues are parts of a conversation. So they had informal style. This feature of generated corpus can be used in real-time translators to have more accurate conversation translations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2021

The ELITR ECA Corpus

We present the ELITR ECA corpus, a multilingual corpus derived from publ...
research
05/06/2019

A Large Parallel Corpus of Full-Text Scientific Articles

The Scielo database is an important source of scientific information in ...
research
07/10/2019

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

We present an approach based on multilingual sentence embeddings to auto...
research
05/05/2019

BVS Corpus: A Multilingual Parallel Corpus of Biomedical Scientific Texts

The BVS database (Health Virtual Library) is a centralized source of bio...
research
02/28/2016

Identification of Parallel Passages Across a Large Hebrew/Aramaic Corpus

We propose a method for efficiently finding all parallel passages in a l...
research
09/17/2018

Open Subtitles Paraphrase Corpus for Six Languages

This paper accompanies the release of Opusparcus, a new paraphrase corpu...
research
12/16/2019

Characterizing the dynamics of learning in repeated reference games

The language we use over the course of conversation changes as we establ...

Please sign up or login with your details

Forgot password? Click here to reset