PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale

04/24/2023
by   Bryan Li, et al.
0

Existing question answering (QA) systems owe much of their success to large, high-quality training data. Such annotation efforts are costly, and the difficulty compounds in the cross-lingual setting. Therefore, prior cross-lingual QA work has focused on releasing evaluation datasets, and then applying zero-shot methods as baselines. In this work, we propose a synthetic data generation method for cross-lingual QA which leverages indirect supervision from existing parallel corpora. Our method termed PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages. In the first stage, we apply a question generation (QG) model to the English side. In the second stage, we apply annotation projection to translate both the questions and answers. To better translate questions, we propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We release cross-lingual QA datasets across 4 languages, totaling 662K QA examples. We then show that extractive QA models fine-tuned on these datasets outperform both zero-shot and prior synthetic data generation models, showing the sufficient quality of our generations. We find that the largest performance gains are for cross-lingual directions with non-English questions and English contexts. Ablation studies show that our dataset generation method is relatively robust to noise from automatic word alignments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2019

MLQA: Evaluating Cross-lingual Extractive Question Answering

Question answering (QA) models have shown rapid progress enabled by the ...
research
10/23/2020

Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering

Coupled with the availability of large scale datasets, deep learning arc...
research
07/13/2019

Cross-Lingual Transfer Learning for Question Answering

Deep learning based question answering (QA) on English documents has ach...
research
11/28/2022

Frustratingly Easy Label Projection for Cross-lingual Transfer

Translating training data into many languages has emerged as a practical...
research
06/11/2019

HEAD-QA: A Healthcare Dataset for Complex Reasoning

We present HEAD-QA, a multi-choice question answering testbed to encoura...
research
05/28/2021

Towards More Equitable Question Answering Systems: How Much More Data Do You Need?

Question answering (QA) in English has been widely explored, but multili...
research
07/05/2022

Cross-Lingual QA as a Stepping Stone for Monolingual Open QA in Icelandic

It can be challenging to build effective open question answering (open Q...

Please sign up or login with your details

Forgot password? Click here to reset