Using English as Pivot to Extract Persian-Italian Parallel Sentences from Non-Parallel Corpora

01/29/2017
by   Ebrahim Ansari, et al.
0

The effectiveness of a statistical machine translation system (SMT) is very dependent upon the amount of parallel corpus used in the training phase. For low-resource language pairs there are not enough parallel corpora to build an accurate SMT. In this paper, a novel approach is presented to extract bilingual Persian-Italian parallel sentences from a non-parallel (comparable) corpus. In this study, English is used as the pivot language to compute the matching scores between source and target sentences and candidate selection phase. Additionally, a new monolingual sentence similarity metric, Normalized Google Distance (NGD) is proposed to improve the matching process. Moreover, some extensions of the baseline system are applied to improve the quality of extracted sentences measured with BLEU. Experimental results show that using the new pivot based extraction can increase the quality of bilingual corpus significantly and consequently improves the performance of the Persian-Italian SMT system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2018

Demo of Sanskrit-Hindi SMT System

The demo proposal presents a Phrase-based Sanskrit-Hindi (SaHiT) Statist...
research
11/02/2017

Extracting an English-Persian Parallel Corpus from Comparable Corpora

Parallel data are an important part of a reliable Statistical Machine Tr...
research
03/22/2021

Monolingual and Parallel Corpora for Kangri Low Resource Language

In this paper we present the dataset of Himachali low resource endangere...
research
11/19/2012

A Rule-Based Approach For Aligning Japanese-Spanish Sentences From A Comparable Corpora

The performance of a Statistical Machine Translation System (SMT) system...
research
05/05/2019

A Parallel Corpus of Theses and Dissertations Abstracts

In Brazil, the governmental body responsible for overseeing and coordina...
research
03/22/2016

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Parallel texts are a relatively rare language resource, however, they co...
research
04/15/2021

Bilingual Terminology Extraction from Non-Parallel E-Commerce Corpora

Bilingual terminologies are important resources for natural language pro...

Please sign up or login with your details

Forgot password? Click here to reset