Exploiting Out-of-Domain Data Sources for Dialectal Arabic Statistical Machine Translation

09/07/2015
by   Katrin Kirchhoff, et al.
0

Statistical machine translation for dialectal Arabic is characterized by a lack of data since data acquisition involves the transcription and translation of spoken language. In this study we develop techniques for extracting parallel data for one particular dialect of Arabic (Iraqi Arabic) from out-of-domain corpora in different dialects of Arabic or in Modern Standard Arabic. We compare two different data selection strategies (cross-entropy based and submodular selection) and demonstrate that a very small but highly targeted amount of found data can improve the performance of a baseline machine translation system. We furthermore report on preliminary experiments on using automatically translated speech data as additional training data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/09/2023

Automatic Standardization of Arabic Dialects for Machine Translation

Based on an annotated multimedia corpus, television series Marāyā 2013, ...
research
06/08/2016

First Result on Arabic Neural Machine Translation

Neural machine translation has become a major alternative to widely used...
research
12/18/2017

Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic

We present the second ever evaluated Arabic dialect-to-dialect machine t...
research
04/09/2019

Data Selection with Cluster-Based Language Difference Models and Cynical Selection

We present and apply two methods for addressing the problem of selecting...
research
09/29/2021

Improving Arabic Diacritization by Learning to Diacritize and Translate

We propose a novel multitask learning method for diacritization which tr...
research
08/19/2017

Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM

Arabic word segmentation is essential for a variety of NLP applications ...
research
12/10/2019

Homograph Disambiguation Through Selective Diacritic Restoration

Lexical ambiguity, a challenging phenomenon in all natural languages, is...

Please sign up or login with your details

Forgot password? Click here to reset