UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

11/02/2021
by   Samreen Kazi, et al.
0

In recent years, low-resource Machine Reading Comprehension (MRC) has made significant progress, with models getting remarkable performance on various language datasets. However, none of these models have been customized for the Urdu language. This work explores the semi-automated creation of the Urdu Question Answering Dataset (UQuAD1.0) by combining machine-translated SQuAD with human-generated samples derived from Wikipedia articles and Urdu RC worksheets from Cambridge O-level books. UQuAD1.0 is a large-scale Urdu dataset intended for extractive machine reading comprehension tasks consisting of 49k question Answers pairs in question, passage, and answer format. In UQuAD1.0, 45000 pairs of QA were generated by machine translation of the original SQuAD1.0 and approximately 4000 pairs via crowdsourcing. In this study, we used two types of MRC models: rule-based baseline and advanced Transformer-based models. However, we have discovered that the latter outperforms the others; thus, we have decided to concentrate solely on Transformer-based architectures. Using XLMRoBERTa and multi-lingual BERT, we acquire an F1 score of 0.66 and 0.63, respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/16/2019

KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension

Machine Reading Comprehension (MRC) is a task that requires machine to u...
research
02/14/2020

FQuAD: French Question Answering Dataset

Recent advances in the field of language modeling have improved state-of...
research
06/08/2021

Cheap and Good? Simple and Effective Data Augmentation for Low Resource Machine Reading

We propose a simple and effective strategy for data augmentation for low...
research
10/22/2019

MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension

We present the results of the Machine Reading for Question Answering (MR...
research
09/19/2023

Benchmarks for Pirá 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change

Pirá is a reading comprehension dataset focused on the ocean, the Brazil...
research
09/25/2019

Question Answering is a Format; When is it Useful?

Recent years have seen a dramatic expansion of tasks and datasets posed ...
research
09/27/2021

FQuAD2.0: French Question Answering and knowing that you know nothing

Question Answering, including Reading Comprehension, is one of the NLP r...

Please sign up or login with your details

Forgot password? Click here to reset