NICT's Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task

09/19/2018
by   Rui Wang, et al.
0

This paper presents the NICT's participation in the WMT18 shared parallel corpus filtering task. The organizers provided 1 billion words German-English corpus crawled from the web as part of the Paracrawl project. This corpus is too noisy to build an acceptable neural machine translation (NMT) system. Using the clean data of the WMT18 shared news translation task, we designed several features and trained a classifier to score each sentence pairs in the noisy data. Finally, we sampled 100 million and 10 million words and built corresponding NMT systems. Empirical results show that our NMT systems trained on sampled data achieve promising performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/19/2023

Improving Machine Translation with Phrase Pair Injection and Corpus Filtering

In this paper, we show that the combination of Phrase Pair Injection and...
research
03/11/2021

Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora

Large web-crawled corpora represent an excellent resource for improving ...
research
09/01/2018

Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora

In this work we introduce dual conditional cross-entropy filtering for n...
research
02/28/2016

Identification of Parallel Passages Across a Large Hebrew/Aramaic Corpus

We propose a method for efficiently finding all parallel passages in a l...
research
11/16/2020

Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions

This paper describes our submission to the WMT20 sentence filtering task...
research
04/02/2017

Building a Neural Machine Translation System Using Only Synthetic Parallel Data

Recent works have shown that synthetic parallel data automatically gener...
research
10/17/2017

Paying Attention to Multi-Word Expressions in Neural Machine Translation

Processing of multi-word expressions (MWEs) is a known problem for any n...

Please sign up or login with your details

Forgot password? Click here to reset