DeepAI
Log In Sign Up

Central Kurdish machine translation: First large scale parallel corpus and experiments

06/17/2021
by   Zhila Amini, et al.
0

While the computational processing of Kurdish has experienced a relative increase, the machine translation of this language seems to be lacking a considerable body of scientific work. This is in part due to the lack of resources especially curated for this task. In this paper, we present the first large scale parallel corpus of Central Kurdish-English, Awta, containing 229,222 pairs of manually aligned translations. Our corpus is collected from different text genres and domains in an attempt to build more robust and real-world applications of machine translation. We make a portion of this corpus publicly available in order to foster research in this area. Further, we build several neural machine translation models in order to benchmark the task of Kurdish machine translation. Additionally, we perform extensive experimental analysis of results in order to identify the major challenges that Central Kurdish machine translation faces. These challenges include language-dependent and-independent ones as categorized in this paper, the first group of which are aware of Central Kurdish linguistic properties on different morphological, syntactic and semantic levels. Our best performing systems achieve 22.72 and 16.81 in BLEU score for Ku→EN and En→Ku, respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

01/07/2018

MIZAN: A Large Persian-English Parallel Corpus

One of the most major and essential tasks in natural language processing...
01/18/2022

Syntax-based data augmentation for Hungarian-English machine translation

We train Transformer-based neural machine translation models for Hungari...
10/04/2020

Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

Machine translation has been a major motivation of development in natura...
09/09/2020

Central Yup'ik and Machine Translation of Low-Resource Polysynthetic Languages

Machine translation tools do not yet exist for the Yup'ik language, a po...
05/31/2022

Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish

We develop machine translation and speech synthesis systems to complemen...
09/06/2019

Self Learning from Large Scale Code Corpus to Infer Structure of Method Invocations

Automatically generating code from a textual description of method invoc...
03/04/2022

EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation

Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves supe...