To read and comprehend natural languages is the key to achieve advanced artificial intelligence. Machine Reading Comprehension (MRC) aims to comprehend the context of given articles and answer the questions based on them. Various types of machine reading comprehension datasets have been proposed, such as cloze-style reading comprehension(Hermann et al., 2015; Hill et al., 2015; Cui et al., 2016), span-extraction reading comprehension (Rajpurkar et al., 2016; Trischler et al., 2016), open-domain reading comprehension (Nguyen et al., 2016; He et al., 2017), reading comprehension with multiple-choice (Richardson et al., 2013; Lai et al., 2017)
etc. Along with the development of the reading comprehension dataset, various neural network approaches have been proposed and make a big advancement in this area(Kadlec et al., 2016; Cui et al., 2017; Dhingra et al., 2017; Wang and Jiang, 2016; Xiong et al., 2016; Wang et al., 2017; Hu et al., 2018; Wang et al., 2018; Yu et al., 2018).
We also have seen various efforts on the construction of Chinese machine reading comprehension datasets. In cloze-style reading comprehension, Cui et al. (2016) proposed a Chinese cloze-style reading comprehension dataset: People Daily & Children’s Fairy Tale, including People Daily news datasets and Children’s Fairy Tale datasets. To add difficulties to the dataset, along with the automatically generated evaluation sets (development and test), they also release a human-annotated evaluation set. Later, Cui et al. (2018)
propose another dataset, which is gathered from children’s reading material. To add more diversity and for further investigation on transfer learning, they also provide another evaluation dataset which is also annotated by human, but the query is more natural than the cloze type. The dataset was used in the first evaluation workshop on Chinese machine reading comprehension (CMRC 2017). In open-domain reading comprehension,He et al. (2017) propose a large-scale open-domain Chinese machine reading comprehension dataset, which contains 200k queries annotated from the user query logs on search engine.
Though we have seen that the current machine learning approaches have surpassed the human performance on the SQuAD datasetRajpurkar et al. (2016), we wonder if these state-of-the-art models could also give similar performance on the dataset of different languages. To further accelerate the development of the machine reading comprehension research, we propose a span-extraction dataset for Chinese machine reading comprehension.
The main contributions of our work can be concluded as follows.
We propose a Chinese span-extraction based reading comprehension dataset to add linguistic diversity in the machine reading comprehension field.
To throughly test the ability of the machine reading comprehension systems, we made a set of 500 annotated questions that require various clues in the context to answer, which is relatively difficult than the development and test set.
We hosted the second evaluation workshop on Chinese machine reading comprehension, and the top system still have a much bigger gap to the human performance, indicating that there are still much efforts should be made.
2 The Proposed Dataset
The Figure 1 show an example of the proposed dataset.
2.1 Task Definition
The reading comprehension can be described as a triple , where represents Document, represents Question and the represents Answer. Specifically, for span-extraction reading comprehension task, the question is annotated by the human which is much more natural than the cloze-style reading comprehension dataset (Hill et al., 2015; Cui et al., 2016). The answer should be directly extracted from the document . According to the most of the works on SQuAD dataset, the task can be simplified by predicting the start and end point in the document to get the answer span (Wang and Jiang, 2016).
2.2 Data Pre-processing
We downloaded Chinese portion of Wikipedia webpage dump on Jan 22, 2018222https://dumps.wikimedia.org/zhwiki/latest/ and use open-source toolkit Wikipedia Extractor333http://medialab.di.unipi.it/wiki/Wikipedia_Extractor for pre-processing the files into plain text. We also convert the traditional Chinese characters into simplified Chinese for normalization purpose using opencc444https://github.com/BYVoid/OpenCC toolkit.
2.3 Human Annotation
The questions in the proposed dataset are totally annotated by human which is different from previous works that relies on the automatic data generation (Hermann et al., 2015; Hill et al., 2015; Cui et al., 2016). Before annotating, the document is divided into several paragraphs and each paragraph is limited to have no more than 500 Chinese words, where the word is counted by using LTP (Che et al., 2010). Then, the annotator was instructed to first evaluate the appropriateness of the document, because some of the documents are extremely difficult for the public to understand. Following rules are applied when discarding the document.
The paragraph contains many non-Chinese characters, say over 30
The paragraph contains many professional words that hard to understand.
The paragraph contains many special characters (may induced by the pre-processing process).
The paragraph is written in classical Chinese.
After identifying the paragraph is appropriate for annotation, the annotator will read the paragraph and ask the questions based on it and annotated a primary answer. During the question annotation, the following rules are applied.
No more than 5 questions for each paragraph.
The answer MUST be a span in the paragraph to meet the task definition.
Encourage the question diversity in various types, such as who/when/where/why/how etc.
Avoid directly using the description in the paragraph. Use paraphrase or syntax transformation to add difficulties for answering.
Long answers ( characters) will be discarded.
For the evaluation sets (development, test, challenge), there are three answers available for better evaluation. Besides the primary answer that was annotated by the question proposer, we also invite two additional annotator to write the second and third answer for the question. During this phase, the annotator could not see the primary answer to ensure the answer was not copied from others and encourage the diversities in the answer.
2.4 Challenge Set
In order to examine how well can reading comprehension models deal with the questions that need comprehensive reasoning over various clues in the context, we additionally annotated a small challenge set for this purpose. The annotation was also done by three annotators in a similar way that for development and test set annotation. The question should meet the following standard to be qualified into this set.
The answer cannot be inferred by a single sentence in the context, if the answer is only a single word or short phrase.
If the answer belongs to a type of named entity, it can not be the only one in the context, or the machine could easily pick it out according to its named entity type. For example, if there is only one person name appears in the context, then it cannot be used for annotating questions. There should be at least two person names that could mislead the machine for answering.
The Figure 2 show an example of the challenge dataset.
|Answer # per query||1||3||3||3|
|Max doc tokens||962||961||980||916|
|Max question tokens||89||56||50||47|
|Max answer tokens||100||85||92||77|
|Average doc tokens||452||469||472||464|
|Average question tokens||15||15||15||18|
|Average answer tokens||17||9||9||19|
|Estimated Human Performance||91.083||97.348||92.400||97.914||90.382||95.248|
|Z-Reader (single model)||79.776||92.696||74.178||88.145||13.889||37.422|
|MCA-Reader (single model)||63.902||82.618||68.335||85.707||13.690||33.964|
|RCEN (single model)||73.253||89.750||64.576||83.136||10.516||30.994|
|OmegaOne (single model)||64.430||82.699||64.188||81.539||10.119||29.716|
|GM-Reader (single model)||56.322||77.412||60.470||80.035||13.690||33.990|
|R-NET (single model)||45.418||69.825||50.112||73.353||9.921||29.324|
|SXU-Reader (single model)||37.310||66.121||44.270||70.673||6.548||28.116|
|T-Reader (single model)||39.422||62.414||44.883||66.859||7.341||22.317|
|Unnamed Sys by usst (single model)||34.490||59.539||37.916||63.502||5.159||18.687|
|Unnamed Sys by whu (single model)||18.577||42.560||22.288||46.774||2.183||21.587|
|Unnamed Sys by LittleBai (single model)||7.021||31.657||10.848||37.231||0.397||9.498|
|Unnamed Sys by jspi (single model)||13.793||39.720||0.449||34.224||2.579||20.048|
3 Evaluation Metrics
In this paper, we adopt two evaluation metrics followingRajpurkar et al. (2016). However, Chinese language is fairly different from English, we adapt the original metrics in the following ways. Note that , the common punctuations, white spaces are ignored.
3.1 Exact Match
Measure the exact match between the prediction and ground truths, i.e. 1 for exact match, otherwise the score is 0.
Measure the character-level fuzzy match between the prediction and ground truths. Instead of treating the predictions and ground truths as bags of words, we calculate the length of the longest common sequence (LCS) between them and compute the F1-score accordingly. We take the maximum F1 over all of the ground truth answers for a given question. Note that, non-Chinese words will not be segmented.
3.3 Human Performance
We also report the human performance in order to measure the difficulty of the proposed dataset. As we have illustrated in the previous section, there are three answers for each question in development, test, and challenge set. Unlike Rajpurkar et al. (2016), we use a cross-validation method to calculate the performance. We regard the first answer as human prediction and treat rest of the answers as ground truths. In this way, we can get three human prediction performance by iteratively regarding the first, second, and third answer as human prediction. Finally, we calculate the average of three results as the final human performance on this dataset.
4.1 CMRC 2018 Evaluation Results
The final leaderboard of the CMRC 2018 evaluation is shown in Table 2. As we can see that most of the systems could obtain over 80 in test F1. While compared to F1 metric, the EM metric is substantially lower compared to the SQuAD dataset (usually within 10 points). This suggests that how to determine the exact boundary in Chinese machine reading comprehension plays a key role to improve the system performance.
4.2 Results on the Challenge Set
Not surprisingly, as shown in the last column of Table 2, though the top-ranked systems obtain decent scores on the development and test set, they are failed to give satisfactory results on the challenge set. However, as we can see that the estimated human performance on the development, test, and challenge set are relatively the same, where the challenge set gives slightly lower scores. We also observed that though Z-Reader obtains best scores on the test set, it failed to give consistent performance on the EM metric of the challenge set. This suggests that the current reading comprehension models are not capable of handling difficult questions that need comprehensive reasoning among several clues in the context.
In this work, we propose a span-extraction dataset for Chinese machine reading comprehension, namely CMRC 2018, which is also used in the second evaluation workshop on Chinese machine reading comprehension. The dataset is annotated by human with near 20,000 questions as well as a challenging set which is composed by the questions that need reasoning over multiple clues. The evaluation results show that the machine could give excellent scores on the development and test set with only near 10 points below the estimated human performance in F1-score. However, when it comes to the challenge set, the scores are declining drastically while the human performance remains almost the same with the non-challenge set, indicating that there are still potential challenges in designing more sophisticated models to improve the performance. We hope the release of this dataset could bring language diversity in machine reading comprehension task, and accelerate further investigation on solving the questions that need comprehensive reasoning over multiple clues.
We would like to invite more researchers doing experiments on CMRC 2018 datasets and evaluate on the hidden test and challenge set to further test the generalization of the models. You can follow the instructions on our CodaLab worksheet to submit your model via https://worksheets.codalab.org/worksheets/0x92a80d2fab4b4f79a2b4064f7ddca9ce/.
We would like to thank our resource team for annotating and verifying evaluation data. Also, we thank the Seventeenth China National Conference on Computational Linguistics (CCL 2018)555http://www.cips-cl.org/static/CCL2018/index.html and Changsha University of Science and Technology for providing venue for evaluation workshop. This work was supported by the National 863 Leading Technology Research Project via grant 2015AA015409.
- Che et al. (2010) Wanxiang Che, Zhenghua Li, and Ting Liu. 2010. Ltp: A chinese language technology platform. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, pages 13–16. Association for Computational Linguistics.
- Cui et al. (2017) Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. 2017. Attention-over-attention neural networks for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 593–602. Association for Computational Linguistics.
- Cui et al. (2018) Yiming Cui, Ting Liu, Zhipeng Chen, Wentao Ma, Shijin Wang, and Guoping Hu. 2018. Dataset for the first evaluation on chinese machine reading comprehension. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA).
- Cui et al. (2016) Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, and Guoping Hu. 2016. Consensus attention-based neural networks for chinese reading comprehension. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1777–1786. The COLING 2016 Organizing Committee.
- Dhingra et al. (2017) Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. 2017. Gated-attention readers for text comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1832–1846. Association for Computational Linguistics.
- He et al. (2017) Wei He, Kai Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, et al. 2017. Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073.
- Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1684–1692.
- Hill et al. (2015) Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.
- Hu et al. (2018) Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. 2018. Reinforced mnemonic reader for machine reading comprehension. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4099–4106. International Joint Conferences on Artificial Intelligence Organization.
- Kadlec et al. (2016) Rudolf Kadlec, Martin Schmid, Ondřej Bajgar, and Jan Kleindienst. 2016. Text understanding with the attention sum reader network. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 908–918. Association for Computational Linguistics.
Lai et al. (2017)
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017.
Race: Large-scale reading comprehension dataset from examinations.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794. Association for Computational Linguistics.
- Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392. Association for Computational Linguistics.
- Richardson et al. (2013) Matthew Richardson, Christopher JC Burges, and Erin Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 193–203.
- Trischler et al. (2016) Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2016. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830.
- Wang and Jiang (2016) Shuohang Wang and Jing Jiang. 2016. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905.
- Wang et al. (2018) Wei Wang, Ming Yan, and Chen Wu. 2018. Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1705–1714. Association for Computational Linguistics.
- Wang et al. (2017) Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 189–198. Association for Computational Linguistics.
- Xiong et al. (2016) Caiming Xiong, Victor Zhong, and Richard Socher. 2016. Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604.
- Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541.