scb-mt-en-th-2020: A Large English-Thai Parallel Corpus

07/07/2020
by   Lalita Lowphansirikul, et al.
0

The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation models based on this dataset. Our models' performance are comparable to that of Google Translation API (as of May 2020) for Thai-English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai-English and English-Thai translation. The dataset, pre-trained models, and source code to reproduce our work are available for public use.

READ FULL TEXT
research
02/25/2022

JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

Most current machine translation models are mainly trained with parallel...
research
08/20/2020

Lite Training Strategies for Portuguese-English and English-Portuguese Translation

Despite the widespread adoption of deep learning for machine translation...
research
07/16/2021

Darmok and Jalad at Tanagra: A Dataset and Model for English-to-Tamarian Translation

Tamarian, a fictional language introduced in the Star Trek episode Darmo...
research
10/29/2017

JESC: Japanese-English Subtitle Corpus

In this paper we describe the Japanese-English Subtitle Corpus (JESC). J...
research
09/18/2020

Unsupervised Parallel Corpus Mining on Web Data

With a large amount of parallel data, neural machine translation systems...
research
10/26/2019

Yall should read this! Identifying Plurality in Second-Person Personal Pronouns in English Texts

Distinguishing between singular and plural "you" in English is a challen...
research
06/06/2021

Itihasa: A large-scale corpus for Sanskrit to English translation

This work introduces Itihasa, a large-scale translation dataset containi...

Please sign up or login with your details

Forgot password? Click here to reset