PJAIT Systems for the IWSLT 2015 Evaluation Campaign Enhanced by Comparable Corpora

12/05/2015
by   Krzysztof Wołk, et al.
0

In this paper, we attempt to improve Statistical Machine Translation (SMT) systems on a very diverse set of language pairs (in both directions): Czech - English, Vietnamese - English, French - English and German - English. To accomplish this, we performed translation model training, created adaptations of training settings for each language pair, and obtained comparable corpora for our SMT systems. Innovative tools and data adaptation techniques were employed. The TED parallel text corpora for the IWSLT 2015 evaluation campaign were used to train language models, and to develop, tune, and test the system. In addition, we prepared Wikipedia-based comparable corpora for use with our SMT system. This data was specified as permissible for the IWSLT 2015 evaluation. We explored the use of domain adaptation techniques, symmetrized word alignment models, the unsupervised transliteration models and the KenLM language modeling tool. To evaluate the effects of different preparations on translation results, we conducted experiments and used the BLEU, NIST and TER metrics. Our results indicate that our approach produced a positive impact on SMT quality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2015

Polish - English Speech Statistical Machine Translation Systems for the IWSLT 2014

This research explores effects of various training settings between Poli...
research
10/29/2009

Word Sense Disambiguation Using English-Spanish Aligned Phrases over Comparable Corpora

In this paper we describe a WSD experiment based on bilingual English-Sp...
research
09/30/2015

Polish - English Speech Statistical Machine Translation Systems for the IWSLT 2013

This research explores the effects of various training settings from Pol...
research
11/02/2017

Extracting an English-Persian Parallel Corpus from Comparable Corpora

Parallel data are an important part of a reliable Statistical Machine Tr...
research
05/23/2022

Unsupervised Tokenization Learning

In the presented study, we discover that the so-called "transition freed...
research
02/04/2021

One Size Does Not Fit All: Finding the Optimal N-gram Sizes for FastText Models across Languages

Unsupervised word representation learning from large corpora is badly ne...
research
10/03/2016

Orthographic Syllable as basic unit for SMT between Related Languages

We explore the use of the orthographic syllable, a variable-length conso...

Please sign up or login with your details

Forgot password? Click here to reset