El Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks

07/03/2020
by   Maria Khvalchik, et al.
0

Pre-training large-scale language models (LMs) requires huge amounts of text corpora. LMs for English enjoy ever growing corpora of diverse language resources. However, less resourced languages and their mono- and multilingual LMs often struggle to obtain bigger datasets. A typical approach in this case implies using machine translation of English corpora to a target language. In this work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing lead to improved performance and overall LMs robustness. In the empirical evaluation, we perform a comparison of directly translated against curated Spanish SQuAD datasets on both user and system levels. Further experimental results on XQuAD and MLQA transfer-learning evaluation question answering tasks show that presumably multilingual LMs exhibit more resilience to machine translation artifacts in terms of the exact match score.

READ FULL TEXT

page 2

page 5

research
03/07/2023

Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora

This paper introduces two multilingual government themed corpora in vari...
research
02/23/2022

Using natural language prompts for machine translation

We explore the use of natural language prompts for controlling various a...
research
06/29/2023

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Vision-Language Pre-training (VLP) has advanced the performance of many ...
research
03/04/2019

Polylingual Wordnet

Princeton WordNet is one of the most important resources for natural lan...
research
10/03/2017

MMCR4NLP: Multilingual Multiway Corpora Repository for Natural Language Processing

Multilinguality is gradually becoming ubiquitous in the sense that more ...
research
08/15/2023

A User-Centered Evaluation of Spanish Text Simplification

We present an evaluation of text simplification (TS) in Spanish for a pr...
research
08/31/2023

Towards Multilingual Automatic Dialogue Evaluation

The main limiting factor in the development of robust multilingual dialo...

Please sign up or login with your details

Forgot password? Click here to reset