Volctrans Parallel Corpus Filtering System for WMT 2020

10/27/2020
by   Runxin Xu, et al.
0

In this paper, we describe our submissions to the WMT20 shared task on parallel corpus filtering and alignment for low-resource conditions. The task requires the participants to align potential parallel sentence pairs out of the given document pairs, and score them so that low-quality pairs can be filtered. Our system, Volctrans, is made of two modules, i.e., a mining module and a scoring module. Based on the word alignment model, the mining module adopts an iterative mining strategy to extract latent parallel sentences. In the scoring module, an XLM-based scorer provides scores, followed by reranking mechanisms and ensemble. Our submissions outperform the baseline by 3.x/2.x and 2.x/2.x for km-en and ps-en on From Scratch/Fine-Tune conditions, which is the highest among all submissions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/16/2020

Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions

This paper describes our submission to the WMT20 sentence filtering task...
research
06/20/2019

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

In this paper, we describe our submission to the WMT19 low-resource para...
research
07/10/2019

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

We present an approach based on multilingual sentence embeddings to auto...
research
03/15/2022

Better Quality Estimation for Low Resource Corpus Mining

Quality Estimation (QE) models have the potential to change how we evalu...
research
10/31/2022

Very Low Resource Sentence Alignment: Luhya and Swahili

Language-agnostic sentence embeddings generated by pre-trained models su...
research
03/30/2023

A BERT-based Unsupervised Grammatical Error Correction Framework

Grammatical error correction (GEC) is a challenging task of natural lang...
research
04/23/2023

NAIST-SIC-Aligned: Automatically-Aligned English-Japanese Simultaneous Interpretation Corpus

It remains a question that how simultaneous interpretation (SI) data aff...

Please sign up or login with your details

Forgot password? Click here to reset