Synthetic Data for Neural Machine Translation of Spoken-Dialects

07/01/2017
by   Hany Hassan, et al.
0

In this paper, we introduce a novel approach to generate synthetic data for training Neural Machine Translation systems. The proposed approach transforms a given parallel corpus between a written language and a target language to a parallel corpus between a spoken dialect variant and the target language. Our approach is language independent and can be used to generate data for any variant of the source language such as slang or spoken dialect or even for a different language that is closely related to the source language. The proposed approach is based on local embedding projection of distributed representations which utilizes monolingual embeddings to transform parallel data across language variants. We report experimental results on Levantine to English translation using Neural Machine Translation. We show that the generated data can improve a very large scale system by more than 2.8 Bleu points using synthetic spoken data which shows that it can be used to provide a reliable translation system for a spoken dialect that does not have sufficient parallel data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/05/2020

Reference Language based Unsupervised Neural Machine Translation

Exploiting common language as an auxiliary for better translation has a ...
research
05/12/2022

AppTek's Submission to the IWSLT 2022 Isometric Spoken Language Translation Task

To participate in the Isometric Spoken Language Translation Task of the ...
research
06/26/2023

Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation

In this paper, we introduce a data-driven approach for Formality-Sensiti...
research
02/15/2021

Crowdsourcing Parallel Corpus for English-Oromo Neural Machine Translation using Community Engagement Platform

Even though Afaan Oromo is the most widely spoken language in the Cushit...
research
05/31/2022

Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish

We develop machine translation and speech synthesis systems to complemen...
research
02/26/2018

Gender Aware Spoken Language Translation Applied to English-Arabic

Spoken Language Translation (SLT) is becoming more widely used and becom...
research
04/08/2022

PharmMT: A Neural Machine Translation Approach to Simplify Prescription Directions

The language used by physicians and health professionals in prescription...

Please sign up or login with your details

Forgot password? Click here to reset