DuReaderrobust: A Chinese Dataset Towards Evaluating the Robustness of Machine Reading Comprehension Models

by   Hongxuan Tang, et al.

Machine Reading Comprehension (MRC) is a crucial and challenging task in natural language processing. Although several MRC models obtains human parity performance on several datasets, we find that these models are still far from robust. To comprehensively evaluate the robustness of MRC models, we create a Chinese dataset, namely DuReader_robust. It is designed to challenge MRC models from the following aspects: (1) over-sensitivity, (2) over-stability and (3) generalization. Most of previous work studies these problems by altering the inputs to unnatural texts. By contrast, the advantage of DuReader_robust is that its questions and documents are natural texts. It presents the robustness challenges when applying MRC models to real-world applications. The experimental results show that MRC models based on the pre-trained language models perform much worse than human does on the robustness test set, although they perform as well as human on in-domain test set. Additionally, we analyze the behavior of existing models on the robustness test set, which might give suggestions for future model development. The dataset and codes are available at <https://github.com/PaddlePaddle/Research/tree/master/NLP/DuReader-Robust-BASELINE>



There are no comments yet.


page 1

page 2

page 3

page 4


A Span-Extraction Dataset for Chinese Machine Reading Comprehension

Machine Reading Comprehension (MRC) has become enormously popular recent...

A Sentence Cloze Dataset for Chinese Machine Reading Comprehension

Owing to the continuous contributions by the Chinese NLP community, more...

Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

Visual referring expression recognition is a challenging task that requi...

Benchmarking Robustness of Machine Reading Comprehension Models

Machine Reading Comprehension (MRC) is an important testbed for evaluati...

Understanding Model Robustness to User-generated Noisy Texts

Sensitivity of deep-neural models to input noise is known to be a challe...

Evaluating NLP Models via Contrast Sets

Standard test sets for supervised learning evaluate in-distribution gene...

Interactive Fiction Game Playing as Multi-Paragraph Reading Comprehension with Reinforcement Learning

Interactive Fiction (IF) games with real human-written natural language ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.