DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

11/14/2017
by   Wei He, et al.
0

In this paper, we introduce DuReader, a new large-scale, open-domain Chinese machine reading comprehension (MRC) dataset, aiming to tackle real-world MRC problems. In comparison to prior datasets, DuReader has the following characteristics: (a) the questions and the documents are all extracted from real application data, and the answers are human generated; (b) it provides rich annotations for question types, especially yes-no and opinion questions, which take a large proportion in real users' questions but have not been well studied before; (c) it provides multiple answers for each question. The first release of DuReader contains 200k questions, 1,000k documents, and 420k answers, which, to the best of our knowledge, is the largest Chinese MRC dataset so far. Experimental results show there exists big gap between the state-of-the-art baseline systems and human performance, which indicates DuReader is a challenging dataset that deserves future study. The dataset and the code of the baseline systems are publicly available now.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2018

DRCD: a Chinese Machine Reading Comprehension Dataset

In this paper, we introduce DRCD (Delta Reading Comprehension Dataset), ...
research
11/28/2016

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

This paper presents our recent work on the design and development of a n...
research
08/24/2020

Knowledge-Empowered Representation Learning for Chinese Medical Reading Comprehension: Task, Model and Resources

Machine Reading Comprehension (MRC) aims to extract answers to questions...
research
12/13/2021

Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading Comprehension

We present Native Chinese Reader (NCR), a new machine reading comprehens...
research
10/27/2020

QBSUM: a Large-Scale Query-Based Document Summarization Dataset from Real-world Applications

Query-based document summarization aims to extract or generate a summary...
research
06/02/2021

Why Machine Reading Comprehension Models Learn Shortcuts?

Recent studies report that many machine reading comprehension (MRC) mode...
research
12/20/2019

SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis

SberQuAD—a large scale analog of Stanford SQuAD in the Russian language—...

Please sign up or login with your details

Forgot password? Click here to reset