MTNT: A Testbed for Machine Translation of Noisy Text

09/02/2018
by   Paul Michel, et al.
0

Noisy or non-standard input text can cause disastrous mistranslations in most modern Machine Translation (MT) systems, and there has been growing research interest in creating noise-robust MT systems. However, as of yet there are no publicly available parallel corpora of with naturally occurring noisy inputs and translations, and thus previous work has resorted to evaluating on synthetically created datasets. In this paper, we propose a benchmark dataset for Machine Translation of Noisy Text (MTNT), consisting of noisy comments on Reddit (www.reddit.com) and professionally sourced translations. We commissioned translations of English comments into French and Japanese, as well as French and Japanese comments into English, on the order of 7k-37k sentences per language pair. We qualitatively and quantitatively examine the types of noise included in this dataset, then demonstrate that existing MT models fail badly on a number of noise-related phenomena, even after performing adaptation on a small training set of in-domain data. This indicates that this dataset can provide an attractive testbed for methods tailored to handling noisy text in MT. The data is publicly available at www.cs.cmu.edu/ pmichel1/mtnt/.

READ FULL TEXT
research
02/25/2019

Improving Robustness of Machine Translation with Synthetic Noise

Modern Machine Translation (MT) systems perform consistently well on cle...
research
06/27/2019

Findings of the First Shared Task on Machine Translation Robustness

We share the findings of the first shared task on improving robustness o...
research
05/21/2023

VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages

In this work, we present our deployment-ready Speech-to-Speech Machine T...
research
06/05/2022

Finetuning a Kalaallisut-English machine translation system using web-crawled data

West Greenlandic, known by native speakers as Kalaallisut, is an extreme...
research
02/13/2017

Towards speech-to-text translation without speech recognition

We explore the problem of translating speech to text in low-resource sce...
research
05/24/2022

Principled Paraphrase Generation with Parallel Corpora

Round-trip Machine Translation (MT) is a popular choice for paraphrase g...
research
04/10/2021

Sentiment-based Candidate Selection for NMT

The explosion of user-generated content (UGC)–e.g. social media posts, c...

Please sign up or login with your details

Forgot password? Click here to reset