New Vietnamese Corpus for Machine ReadingComprehension of Health News Articles

06/19/2020
by   Kiet Van Nguyen, et al.
0

Although over 95 million people in the world speak the Vietnamese language, there are not any large and qualified datasets for automatic reading comprehension. In addition, machine reading comprehension for the health domain offers great potential for practical applications; however, there is still very little machine reading comprehension research in this domain. In this study, we present ViNewsQA as a new corpus for the low-resource Vietnamese language to evaluate models of machine reading comprehension. The corpus comprises 10,138 human-generated question-answer pairs. Crowdworkers created the questions and answers based on a set of over 2,030 online Vietnamese news articles from the VnExpress news website, where the answers comprised spans extracted from the corresponding articles. In particular, we developed a process of creating a corpus for the Vietnamese language. Comprehensive evaluations demonstrated that our corpus requires abilities beyond simple reasoning such as word matching, as well as demanding difficult reasoning similar to inferences based on single-or-multiple-sentence information. We conducted experiments using state-of-the-art methods for machine reading comprehension to obtain the first baseline performance measures, which will be compared with further models' performances. We measured human performance based on the corpus and compared it with several strong neural models. Our experiments showed that the best model was BERT, which achieved an exact match score of 57.57 on our corpus. The significant difference between humans and the best model (F1-score of 15.93 in ViNewsQA can be explored in future research. Our corpus is freely available on our website in order to encourage the research community to make these improvements.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2020

A Vietnamese Dataset for Evaluating Machine Reading Comprehension

Over 97 million inhabitants speak Vietnamese as the native language in t...
research
01/16/2020

A Pilot Study on Multiple Choice Machine Reading Comprehension for Vietnamese Texts

Machine Reading Comprehension (MRC) is the task of natural language proc...
research
05/04/2021

Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts

Machine reading comprehension (MRC) is a sub-field in natural language p...
research
11/29/2016

NewsQA: A Machine Comprehension Dataset

We present NewsQA, a challenging machine comprehension dataset of over 1...
research
06/16/2016

SQuAD: 100,000+ Questions for Machine Comprehension of Text

We present the Stanford Question Answering Dataset (SQuAD), a new readin...
research
06/09/2016

A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task

Enabling a computer to understand a document so that it can answer compr...
research
12/04/2015

What Makes it Difficult to Understand a Scientific Literature?

In the artificial intelligence area, one of the ultimate goals is to mak...

Please sign up or login with your details

Forgot password? Click here to reset