DeepAI AI Chat
Log In Sign Up

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

by   Wei He, et al.

In this paper, we introduce DuReader, a new large-scale, open-domain Chinese machine reading comprehension (MRC) dataset, aiming to tackle real-world MRC problems. In comparison to prior datasets, DuReader has the following characteristics: (a) the questions and the documents are all extracted from real application data, and the answers are human generated; (b) it provides rich annotations for question types, especially yes-no and opinion questions, which take a large proportion in real users' questions but have not been well studied before; (c) it provides multiple answers for each question. The first release of DuReader contains 200k questions, 1,000k documents, and 420k answers, which, to the best of our knowledge, is the largest Chinese MRC dataset so far. Experimental results show there exists big gap between the state-of-the-art baseline systems and human performance, which indicates DuReader is a challenging dataset that deserves future study. The dataset and the code of the baseline systems are publicly available now.


page 1

page 2

page 3

page 4


DRCD: a Chinese Machine Reading Comprehension Dataset

In this paper, we introduce DRCD (Delta Reading Comprehension Dataset), ...

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

This paper presents our recent work on the design and development of a n...

Knowledge-Empowered Representation Learning for Chinese Medical Reading Comprehension: Task, Model and Resources

Machine Reading Comprehension (MRC) aims to extract answers to questions...

Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading Comprehension

We present Native Chinese Reader (NCR), a new machine reading comprehens...

QBSUM: a Large-Scale Query-Based Document Summarization Dataset from Real-world Applications

Query-based document summarization aims to extract or generate a summary...

ChID: A Large-scale Chinese IDiom Dataset for Cloze Test

Cloze-style reading comprehension in Chinese is still limited due to the...

SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis

SberQuAD—a large scale analog of Stanford SQuAD in the Russian language—...