Central Kurdish machine translation: First large scale parallel corpus and experiments

by   Zhila Amini, et al.

While the computational processing of Kurdish has experienced a relative increase, the machine translation of this language seems to be lacking a considerable body of scientific work. This is in part due to the lack of resources especially curated for this task. In this paper, we present the first large scale parallel corpus of Central Kurdish-English, Awta, containing 229,222 pairs of manually aligned translations. Our corpus is collected from different text genres and domains in an attempt to build more robust and real-world applications of machine translation. We make a portion of this corpus publicly available in order to foster research in this area. Further, we build several neural machine translation models in order to benchmark the task of Kurdish machine translation. Additionally, we perform extensive experimental analysis of results in order to identify the major challenges that Central Kurdish machine translation faces. These challenges include language-dependent and-independent ones as categorized in this paper, the first group of which are aware of Central Kurdish linguistic properties on different morphological, syntactic and semantic levels. Our best performing systems achieve 22.72 and 16.81 in BLEU score for Ku→EN and En→Ku, respectively.


page 1

page 2

page 3

page 4


MIZAN: A Large Persian-English Parallel Corpus

One of the most major and essential tasks in natural language processing...

Syntax-based data augmentation for Hungarian-English machine translation

We train Transformer-based neural machine translation models for Hungari...

Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

Machine translation has been a major motivation of development in natura...

Central Yup'ik and Machine Translation of Low-Resource Polysynthetic Languages

Machine translation tools do not yet exist for the Yup'ik language, a po...

Self Learning from Large Scale Code Corpus to Infer Structure of Method Invocations

Automatically generating code from a textual description of method invoc...

Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish

We develop machine translation and speech synthesis systems to complemen...

EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation

Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves supe...

Please sign up or login with your details

Forgot password? Click here to reset