An Arabic-Hebrew parallel corpus of TED talks

10/03/2016
by   Mauro Cettolo, et al.
0

We describe an Arabic-Hebrew parallel corpus of TED talks built upon WIT3, the Web inventory that repurposes the original content of the TED website in a way which is more convenient for MT researchers. The benchmark consists of about 2,000 talks, whose subtitles in Arabic and Hebrew have been accurately aligned and rearranged in sentences, for a total of about 3.5M tokens per language. Talks have been partitioned in train, development and test sets similarly in all respects to the MT tasks of the IWSLT 2016 evaluation campaign. In addition to describing the benchmark, we list the problems encountered in preparing it and the novel methods designed to solve them. Baseline MT results and some measures on sentence length are provided as an extrinsic evaluation of the quality of the benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2019

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

We present an approach based on multilingual sentence embeddings to auto...
research
07/14/2019

Simple Automatic Post-editing for Arabic-Japanese Machine Translation

A common bottleneck for developing machine translation (MT) systems for ...
research
12/18/2020

A Benchmark Arabic Dataset for Commonsense Explanation

Language comprehension and commonsense knowledge validation by machines ...
research
09/21/2023

OSN-MDAD: Machine Translation Dataset for Arabic Multi-Dialectal Conversations on Online Social Media

While resources for English language are fairly sufficient to understand...
research
12/11/2020

Document-aligned Japanese-English Conversation Parallel Corpus

Sentence-level (SL) machine translation (MT) has reached acceptable qual...
research
08/28/2022

CJaFr-v3 : A Freely Available Filtered Japanese-French Aligned Corpus

We present a free Japanese-French parallel corpus. It includes 15M align...

Please sign up or login with your details

Forgot password? Click here to reset