Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading Comprehension

12/13/2021
by   Shusheng Xu, et al.
0

We present Native Chinese Reader (NCR), a new machine reading comprehension (MRC) dataset with particularly long articles in both modern and classical Chinese. NCR is collected from the exam questions for the Chinese course in China's high schools, which are designed to evaluate the language proficiency of native Chinese youth. Existing Chinese MRC datasets are either domain-specific or focusing on short contexts of a few hundreds of characters in modern Chinese only. By contrast, NCR contains 8390 documents with an average length of 1024 characters covering a wide range of Chinese writing styles, including modern articles, classical literature and classical poetry. A total of 20477 questions on these documents also require strong reasoning abilities and common sense to figure out the correct answers. We implemented multiple baseline models using popular Chinese pre-trained models and additionally launched an online competition using our dataset to examine the limit of current methods. The best model achieves 59 evaluation shows an average accuracy of 79 performance gap between current MRC models and native Chinese speakers. We release the dataset at https://sites.google.com/view/native-chinese-reader/.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2018

DRCD: a Chinese Machine Reading Comprehension Dataset

In this paper, we introduce DRCD (Delta Reading Comprehension Dataset), ...
research
04/07/2020

A Sentence Cloze Dataset for Chinese Machine Reading Comprehension

Owing to the continuous contributions by the Chinese NLP community, more...
research
11/14/2017

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

In this paper, we introduce DuReader, a new large-scale, open-domain Chi...
research
04/21/2019

Probing Prior Knowledge Needed in Challenging Chinese Machine Reading Comprehension

With an ultimate goal of narrowing the gap between human and machine rea...
research
09/11/2021

Extract, Integrate, Compete: Towards Verification Style Reading Comprehension

In this paper, we present a new verification style reading comprehension...
research
05/22/2023

Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models

Recent studies in natural language processing (NLP) have focused on mode...
research
08/11/2022

Overview of CTC 2021: Chinese Text Correction for Native Speakers

In this paper, we present an overview of the CTC 2021, a Chinese text co...

Please sign up or login with your details

Forgot password? Click here to reset