Dual Co-Matching Network for Multi-choice Reading Comprehension

01/27/2019 ∙ by Shuailiang Zhang, et al. ∙ Shanghai Jiao Tong University CloudWalk Technology Co., Ltd. 0

Multi-choice reading comprehension is a challenging task that requires complex reasoning procedure. Given passage and question, a correct answer need to be selected from a set of candidate answers. In this paper, we propose Dual Co-Matching Network (DCMN) which model the relationship among passage, question and answer bidirectionally. Different from existing approaches which only calculate question-aware or option-aware passage representation, we calculate passage-aware question representation and passage-aware answer representation at the same time. To demonstrate the effectiveness of our model, we evaluate our model on a large-scale multiple choice machine reading comprehension dataset( i.e. RACE). Experimental result show that our proposed model achieves new state-of-the-art results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine reading comprehension and question answering has becomes a crucial application problem in evaluating the progress of AI system in the realm of natural language processing and understanding

Zhang et al. . The computational linguistics communities have devoted significant attention to the general problem of machine reading comprehension and question answering.

However, most of existing reading comprehension tasks only focus on shallow QA tasks that can be tackled very effectively by existing retrieval-based techniquesZhang et al. (2018a). For example, recently we have seen increased interest in constructing extractive machine reading comprehension datasets such as SQuADRajpurkar et al. (2016) and NewsQATrischler et al. (2017). Given a document and a question, the expected answer is a short span in the document. Question context usually contains sufficient information for identifying evidence sentences that entail question-answer pairs. For example, 90.2% questions in SQuAD reported by MinMin et al. (2018) are answerable from the content of a single sentence. Even in some multi-turn conversation tasks, the existing modelsZhang et al. (2018b) mostly focus on retrieval-based response matching.

In this paper, we focus on multiple-choice reading comprehension datasets such as RACELai et al. (2017) in which each question comes with a set of answer options. The correct answer for most questions may not appear in the original passage which makes the task more challenging and allow a rich type of questions such as passage summarization and attitude analysis. This requires a more in-depth understanding of a single document and leverage external world knowledge to answer these questions. Besides, comparing to traditional reading comprehension problem, we need to fully consider passage-question-answer triplets instead of passage-question pairwise matching.

In this paper, we propose a new model, Dual Co-Matching Network, to match a question-answer pair to a given passage bidirectionally. Our network leverages the latest breakthrough in NLP: BERTDevlin et al. (2018)

contextual embedding. In the origin BERT paper, the final hidden vector corresponding to first input token ([CLS]) is used as the aggregation representation and then a standard classification loss is computed with a classification layer. We think this method is too rough to handle the passage-question-answer triplet because it only roughly concatenates the passage and question as the first sequence and uses question as the second sequence, without considering the relationship between the question and the passage. So we propose a new method to model the relationship among the passage, the question and the candidate answer.

Firstly we use BERT as our encode layer to get the contextual representation of the passage, question, answer options respectively. Then a matching layer is constructed to get the passage-question-answer triplet matching representation which encodes the locational information of the question and the candidate answer matched to a specific context of the passage. Finally we apply a hierarchical aggregation method over the matching representation from word-level to sequence-level and then from sequence level to document-level. Our model improves the state-of-the-art model by 2.6 percentage on the RACE dataset with BERT base model and further improves the result by 3 percentage with BERT large model.

2 Model

For the task of multi-choice reading comprehension, the machine is given a passage, a question and a set of candidate answers. The goal is to select the correct answer from the candidates. P, Q, and A

are used to represent the passage, the question and a candidate answer respectively. For each candidate answer, our model constructs a question-aware passage representation, a question-aware passage representation and a question-aware passage representation. After a max-pooling layer, the three representations are concatenated as the final representation of the candidate answer. The representations of all candidate answers are then used for answer selection.

In section 2.1, we introduce the encoding mechanism. Then in section 2.2, we introduce the calculation procedure of the matching representation between the passage, the question and the candidate answer. In section 2.3, we introduce the aggregation method and the objective function.

2.1 Encoding layer

This layer encodes each token in passage and question into a fixed-length vector including both word embedding and contextualized embedding. We utilize the latest result from BERTDevlin et al. (2018) as our encoder and the final hidden state of BERT is used as our final embedding. In the origin BERTDevlin et al. (2018), the procedure of processing multi-choice problem is that the final hidden vector corresponding to first input token ([CLS]) is used as the aggregation representation of the passage, the question and the candidate answer, which we think is too simple and too rough. So we encode the passage, the question and the candidate answer respectively as follows:

(1)

where , and are sequences of hidden state generated by BERT. , , are the sequence length of the passage, the question and the candidate answer respectively. is the dimension of the BERT hidden state.

2.2 Matching layer

To fully mine the information in a {P, Q, A} triplet , We make use of the attention mechanism to get the bi-directional aggregation representation between the passage and the answer and do the same process between the passage and the question. The attention vectors between the passage and the answer are calculated as follows:

(2)

where and are the parameters to learn. is the attention weight matrix between the passage and the answer. represent how each hidden state in passage can be aligned to the answe rand represent how the candidate answer can be aligned to each hidden state in passage. In the same method, we can get and for the representation between the passage and the question.

To integrate the original contextual representation, we follow the idea from Wang et al. (2018) to fuse with original and so is . The final representation of passage and the candidate answer is calculated as follows:

(3)

where and are the parameters to learn. is the column-wise concatenation and are the element-wise subtraction and multiplication between two matrices. Previous work in Tai et al. (2015); Wang and Jiang (2016) shows this method can build better matching representation.

is the activation function and we choose

activation function there. and are the final representations of the passage and candidate answer. In the question side, we can get and in the same calculation method.

Single Model RACE-M RACE-H RACE
DFNXu et al. 51.5 45.7 47.4
MRUTay et al. 57.7 47.4 50.4
HCMWang et al. (2018) 55.8 48.2 50.4
OFTRadford (2018) 62.9 57.4 59.0
RSMSun et al. (2018) 69.2 61.5 63.8
Our baseline
BERT 67.9 62.8 64.3
BERT 71.7 65.7 67.5
Our Model
DCMN 70.7 64.8 66.5
DCMN 73.4 68.1 69.7
Tukers 85.1 69.4 73.3
Ceiling 95.4 94.2 94.5
Table 2: Model and human performance on the RACE test set. Tukers is the performance of Amazon Tukers on a random subset of the RACE test set. Ceiling is the percentage of unambiguous questions in the test set. In DCMN, we use BERT as our encoder and use BERT as encoder in DCMN.

2.3 Aggregation layer

To get the final representation for each candidate answer, a row-wise max pooling operation is used to and . Then we get and respectively. In the question side, and are calculated. Finally, we concatenate all of them as the final output for each {P, Q, A} triplet.

(4)

For each candidate answer choice , its matching representation with the passage and question can be represented as

. Then our loss function is computed as follows:

(5)

where is a parameter to learn.

3 Experiment

We evaluate our model on RACE datasetLai et al. (2017), which consists of two subsets: RACE-M and RACE-H. RACE-M comes from middle school examinations while RACE-H comes from high school examinations. RACE is the combination of the two.

We compare our model with the following baselines: MRU(Multi-range Reasoning)Tay et al. , DFN(Dynamic Fusion Networks)Xu et al. , HCM(Hierarchical Co-Matching)Wang et al. (2018), OFT(OpenAI Finetuned Transformer LM)Radford (2018), RSM(Reading Strategies Model)Sun et al. (2018). We also compare our model with the BERT baseline and implement the method described in the original paperDevlin et al. (2018), which uses the final hidden vector corresponding to the first input token ([CLS]) as the aggregate representation followed by a classification layer and finally a standard classification loss is computed.

Results are shown in Table 2. We can see that the performance of BERT is very close to the previous state-of-the-art and BERT even outperforms it for 3.7%. But experimental result shows that our model is more powerful and we further improve the result for 2.2% computed to BERT and 2.2% computed to BERT.

4 Conclusions

In this paper, we propose a Dual Co-Matching Network, DCMN, to model the relationship among the passage, question and the candidate answer bidirectionally. By incorporating the latest breakthrough, BERT, in an innovative way, our model achieves the new state-of-the-art in RACE dataset, outperforming the previous state-of-the-art model by 2.2% in RACE full dataset.

References

  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR.
  • Lai et al. (2017) Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794. Association for Computational Linguistics.
  • Min et al. (2018) Sewon Min, Victor Zhong, Richard Socher, and Caiming Xiong. 2018. Efficient and robust question answering from minimal context over documents. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1725–1735. Association for Computational Linguistics.
  • Radford (2018) Alec Radford. 2018. Improving language understanding by generative pre-training.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392. Association for Computational Linguistics.
  • Sun et al. (2018) Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2018. Improving machine reading comprehension with general reading strategies. CoRR.
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015.

    Improved semantic representations from tree-structured long short-term memory networks.

    In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1556–1566. Association for Computational Linguistics.
  • (8) Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. Multi-range reasoning for machine comprehension. CoRR.
  • Trischler et al. (2017) Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191–200. Association for Computational Linguistics.
  • Wang and Jiang (2016) Shuohang Wang and Jing Jiang. 2016. A compare-aggregate model for matching text sequences. CoRR.
  • Wang et al. (2018) Shuohang Wang, Mo Yu, Jing Jiang, and Shiyu Chang. 2018. A co-matching model for multi-choice reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 746–751. Association for Computational Linguistics.
  • (12) Yichong Xu, Jingjing Liu, Jianfeng Gao, Yelong Shen, and Xiaodong Liu. Towards human-level machine reading comprehension: Reasoning and inference with multiple strategies. CoRR.
  • Zhang et al. (2018a) Zhuosheng Zhang, Yafang Huang, and Hai Zhao. 2018a. Subword-augmented embedding for cloze reading comprehension. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1802–1814. Association for Computational Linguistics.
  • (14) Zhuosheng Zhang, Yafang Huang, Pengfei Zhu, and Hai Zhao. Effective character-augmented word embedding for machine reading comprehension. CoRR.
  • Zhang et al. (2018b) Zhuosheng Zhang, Jiangtong Li, Pengfei Zhu, Hai Zhao, and Gongshen Liu. 2018b. Modeling multi-turn conversation with deep utterance aggregation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3740–3752. Association for Computational Linguistics.