Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

12/26/2019
by   Haiyue Song, et al.
0

Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese–English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we will release our code for parallel data creation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2022

MorisienMT: A Dataset for Mauritian Creole Machine Translation

In this paper, we describe MorisienMT, a dataset for benchmarking machin...
research
09/20/2020

Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation

Despite being the seventh most widely spoken language in the world, Beng...
research
02/12/2015

A new hybrid metric for verifying parallel corpora of Arabic-English

This paper discusses a new metric that has been applied to verify the qu...
research
11/25/2019

JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

Recent machine translation algorithms mainly rely on parallel corpora. H...
research
06/27/2023

SAHAAYAK 2023 – the Multi Domain Bilingual Parallel Corpus of Sanskrit to Hindi for Machine Translation

The data article presents the large bilingual parallel corpus of low-res...
research
05/25/2023

IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

India has a rich linguistic landscape with languages from 4 major langua...
research
04/10/2019

Corpora Generation for Grammatical Error Correction

Grammatical Error Correction (GEC) has been recently modeled using the s...

Please sign up or login with your details

Forgot password? Click here to reset