GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval

04/26/2021
by   Timo Möller, et al.
0

A major challenge of research on non-English machine reading for question answering (QA) is the lack of annotated datasets. In this paper, we present GermanQuAD, a dataset of 13,722 extractive question/answer pairs. To improve the reproducibility of the dataset creation approach and foster QA research on other languages, we summarize lessons learned and evaluate reformulation of question/answer pairs as a way to speed up the annotation process. An extractive QA model trained on GermanQuAD significantly outperforms multilingual models and also shows that machine-translated training data cannot fully substitute hand-annotated training data in the target language. Finally, we demonstrate the wide range of applications of GermanQuAD by adapting it to GermanDPR, a training dataset for dense passage retrieval (DPR), and train and evaluate the first non-English DPR model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/03/2022

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension

Question Answering (QA) is a task in which a machine understands a given...
research
02/20/2021

Multilingual Answer Sentence Reranking via Automatically Translated Data

We present a study on the design of multilingual Answer Sentence Selecti...
research
12/06/2022

Dataset vs Reality: Understanding Model Performance from the Perspective of Information Need

Deep learning technologies have brought us many models that outperform h...
research
12/17/2022

Improving Question Answering Performance through Manual Annotation: Costs, Benefits and Strategies

Recently proposed systems for open-domain question answering (OpenQA) re...
research
05/09/2023

MAUPQA: Massive Automatically-created Polish Question Answering Dataset

Recently, open-domain question answering systems have begun to rely heav...
research
10/02/2017

Building Chatbots from Forum Data: Model Selection Using Question Answering Metrics

We propose to use question answering (QA) data from Web forums to train ...
research
08/31/2021

When Retriever-Reader Meets Scenario-Based Multiple-Choice Questions

Scenario-based question answering (SQA) requires retrieving and reading ...

Please sign up or login with your details

Forgot password? Click here to reset