JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

11/25/2019
by   Makoto Morishita, et al.
0

Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them. In this paper, we constructed a parallel corpus for English-Japanese, where the amount of publicly available parallel corpora is still limited. We constructed a parallel corpus by broadly crawling the web and automatically aligning parallel sentences. Our collected corpus, called JParaCrawl, amassed over 8.7 million sentence pairs. We show how it includes broader domains, and the NMT model trained with it works as a good pre-trained model for fine-tuning specific domains. The pre-training and fine-tuning approaches surpassed or achieved comparable performance to the model training from the initial state and largely reduced the training cost. Additionally, we trained the model with an in-domain dataset and JParaCrawl to show how we achieved the best performance with them. JParaCrawl and the pre-trained models are freely available online for research purposes.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

research
02/25/2022

JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

Most current machine translation models are mainly trained with parallel...
research
05/13/2020

Parallel Corpus Filtering via Pre-trained Language Models

Web-crawled data provides a good source of parallel corpora for training...
research
12/04/2019

Acquiring Knowledge from Pre-trained Model to Neural Machine Translation

Pre-training and fine-tuning have achieved great success in the natural ...
research
12/26/2019

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Lectures translation is a case of spoken language translation and there ...
research
08/26/2021

SAUCE: Truncated Sparse Document Signature Bit-Vectors for Fast Web-Scale Corpus Expansion

Recent advances in text representation have shown that training on large...
research
12/19/2022

Synthetic Pre-Training Tasks for Neural Machine Translation

Pre-training is an effective technique for ensuring robust performance o...
research
09/29/2015

Tuned and GPU-accelerated parallel data mining from comparable corpora

The multilingual nature of the world makes translation a crucial require...

Please sign up or login with your details

Forgot password? Click here to reset