xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

06/22/2023
by   Mingda Chen, et al.
0

We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xSIM++ also reports performance for different error types, offering more fine-grained feedback for model development.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2021

Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural Machine Translation

This paper demonstrates that multilingual pretraining, a proper fine-tun...
research
05/25/2022

Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

Scaling multilingual representation learning beyond the hundred most fre...
research
11/30/2021

Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages

We conduct an empirical study of neural machine translation (NMT) for tr...
research
02/09/2019

Multilingual Neural Machine Translation With Soft Decoupled Encoding

Multilingual training of neural machine translation (NMT) systems has le...
research
09/13/2022

Data-adaptive Transfer Learning for Translation: A Case Study in Haitian and Jamaican

Multilingual transfer techniques often improve low-resource machine tran...
research
03/07/2021

Translating the Unseen? Yorùbá → English MT in Low-Resource, Morphologically-Unmarked Settings

Translating between languages where certain features are marked morpholo...
research
06/12/2019

Using Small Proxy Datasets to Accelerate Hyperparameter Search

One of the biggest bottlenecks in a machine learning workflow is waiting...

Please sign up or login with your details

Forgot password? Click here to reset