Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions

11/16/2020
by   Muhammad N. ElNokrashy, et al.
0

This paper describes our submission to the WMT20 sentence filtering task. We combine scores from (1) a custom LASER built for each source language, (2) a classifier built to distinguish positive and negative pairs by semantic alignment, and (3) the original scores included in the task devkit. For the mBART finetuning setup, provided by the organizers, our method shows 7 relative improvement over baseline, in sacreBLEU score on the test set for Pashto and Khmer respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2020

Volctrans Parallel Corpus Filtering System for WMT 2020

In this paper, we describe our submissions to the WMT20 shared task on p...
research
06/20/2019

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

In this paper, we describe our submission to the WMT19 low-resource para...
research
10/31/2022

Very Low Resource Sentence Alignment: Luhya and Swahili

Language-agnostic sentence embeddings generated by pre-trained models su...
research
09/19/2018

NICT's Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task

This paper presents the NICT's participation in the WMT18 shared paralle...
research
10/19/2022

Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages

We participated in the WMT 2022 Large-Scale Machine Translation Evaluati...
research
07/06/2020

LMVE at SemEval-2020 Task 4: Commonsense Validation and Explanation using Pretraining Language Model

This paper describes our submission to subtask a and b of SemEval-2020 T...
research
01/13/2021

Uzbek Cyrillic-Latin-Cyrillic Machine Transliteration

In this paper, we introduce a data-driven approach to transliterating Uz...

Please sign up or login with your details

Forgot password? Click here to reset