DeepAI

Context-guided Triple Matching for Multiple Choice Question Answering

The task of multiple choice question answering (MCQA) refers to identifying a suitable answer from multiple candidates, by estimating the matching score among the triple of the passage, question and answer. Despite the general research interest in this regard, existing methods decouple the process into several pair-wise or dual matching steps, that limited the ability of assessing cases with multiple evidence sentences. To alleviate this issue, this paper introduces a novel Context-guided Triple Matching algorithm, which is achieved by integrating a Triple Matching (TM) module and a Contrastive Regularization (CR). The former is designed to enumerate one component from the triple as the background context, and estimate its semantic matching with the other two. Additionally, the contrastive term is further proposed to capture the dissimilarity between the correct answer and distractive ones. We validate the proposed algorithm on several benchmarking MCQA datasets, which exhibits competitive performances against state-of-the-arts.

• 1 publication
• 1 publication
• 5 publications
• 1 publication
• 158 publications
• 39 publications
08/11/2021

Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering

Video question answering has recently received a lot of attention from m...
12/16/2021

Utilizing Evidence Spans via Sequence-Level Contrastive Learning for Long-Context Question Answering

Long-range transformer models have achieved encouraging results on long-...
04/18/2017

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

We publicly release a new large-scale dataset, called SearchQA, for mach...
10/29/2022

Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering

We propose a simple refactoring of multi-choice question answering (MCQA...
09/17/2020

Self-supervised pre-training and contrastive representation learning for multiple-choice video QA

Video Question Answering (Video QA) requires fine-grained understanding ...
07/02/2020

IIE-NLP-NUT at SemEval-2020 Task 4: Guiding PLM with Prompt Template Reconstruction Strategy for ComVE

This paper introduces our systems for the first two subtasks of SemEval ...
10/05/2022

I Introduction

Question answering turns out to be one of the most popular and challenging research topics in machine reading comprehension (MRC). Existing studies of question answering focus on either discovering (extracting) spans from the given passage [Seonwoo et al.(2020)Seonwoo, Kim, Ha, and Oh, Joshi et al.(2020)Joshi, Chen, Liu, Weld, Zettlemoyer, and Levy], or identifying (selecting) the most suitable answers to questions from a set of candidates, known as multiple choice question answering (MCQA) [Duan, Huang, and Wu(2021), Li et al.(2021)Li, Jiang, Wang, Lu, Zhao, and Chen, Zhang et al.(2020a)Zhang, Zhao, Wu, Zhang, Zhou, and Zhou]. This paper is on a novel method for MCQA.

Approaches to MCQA usually consist of a two-step process. In the first step, words in the passages, questions and candidate answers are encoded, using a pre-trained language model, into fixed length of vectors. The second step is to generate a representation by exploring the semantic-relationship matching among a passage, question, and answer.

In general, improvement of methods for MCQA can be achieved by fine-tuning the pre-trained model to the context of the passage, question and answer and/or improving the subsequent matching [Zhang et al.(2020b)Zhang, Wu, Zhou, Duan, Zhao, and Wang, Zhu, Zhao, and Li(2020)]. Typical work on the former include those in [Yang et al.(2019)Yang, Dai, Yang, Carbonell, Salakhutdinov, and Le, Shoeybi et al.(2020)Shoeybi, Patwary, Puri, LeGresley, Casper, and Catanzaro, Liu et al.(2019)Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer, and Stoyanov]. A recent work on the latter is , where conventional unidirectional matching is extended to bidirectional matching among the pairs of question-passage, question-answer and passage-answer. The bidirectional matching improves the capability of capturing the semantic relationship among the triple (i.e. passage, question and answer), hence, the performance compared with uni-directional matching. However, such pairwise matching, though bi-directional, ignores the knowledge from one entity of the triplet, which has limited its ability to deal with cases where there are multiple evidence sentences in the passage with respect to the question and answer. Table I shows an example from the popular MCQA dataset (RACE [Lai et al.(2017)Lai, Xie, Liu, Yang, and Hovy]), where answering this cloze question depends on multiple evidence sentences in the passage as highlighted in the table.

To address this issue, this paper proposes to extend the DCMN+ [Zhang et al.(2020a)Zhang, Zhao, Wu, Zhang, Zhou, and Zhou] to Context-guided Triple M

atching algorithm (CTM). Specifically, a context is provided via the missing entity in performing the conventional pair-wise matching. In other words, CTM performs matching with respect to a context (an entity from the triple) to exploit the semantic relationship more specifically than DCMN+. In addition, a contrastive regularization (CR) is adopted in strengthening the learning of the semantic differences among answer candidates. This regularization follows a recently proposed self-supervised learning diagram,

i.e., contrastive learning, which helps in differentiating keywords in the candidate answers (such as “creative”) as illustrated Table I.

The contributions of the paper include:

• context is introduced into the matching process, and a context-guided triplet matching is proposed accordingly in order to improve the ability in effectively capturing semantic relationship from a passage, questions and answers; and

• contrastive regularization is developed to learn distinctive features among similar candidate answers; and

• extensive experiments are conducted on two widely used MCQA datasets to evaluate the proposed CTM, and state-of-the-art results are achieved in comparison with the existing methods.

Our code will be publicly available from Github.

Ii Related work

In this section, we provide background information on the study area, focusing on existing work on MCQA and the concept of contrastive learning.

Ii-a Mcqa

Multiple choice question answering (MCQA) is a long-standing research problem from machine reading comprehension, where the key is to determine one correct answer (from all candidates) given the background passage and question. Several models have been proposed which utilize deep neural networks with different

matching strategies.

Chaturvedi et al. first concatenate the question and candidate answer, and calculate the matching degree against the passage via attention [Chaturvedi, Pandit, and Garain(2018)]. The work [Wang et al.(2018)Wang, Yu, Jiang, and Chang] treats the question and a candidate answer as two sequences before matching them individually with the given passage. Then a hierarchical aggregation structure is constructed to fuse the previous co-matching representation to predict answers. Similarly, a hierarchical attention flow is proposed in [Zhu et al.(2018)Zhu, Wei, Qin, and Liu] to estimate the matching relationship based on the attention mechanism at different hierarchical levels. Zhang et al. propose a dual co-matching network in [Zhang et al.(2020a)Zhang, Zhao, Wu, Zhang, Zhou, and Zhou], which formulates the matching model among background passages, questions, and answers bi-bidirectionally.

Apart from the aforementioned matching-based work, another line of studies proposes to integrate with the auxiliary knowledge. For instance, a syntax-enhanced network is presented in [Zhang et al.(2020b)Zhang, Wu, Zhou, Duan, Zhao, and Wang] to combine syntactic tree information with the pre-trained encoder for better linguistic matching. Duan et al. utilize the semantic role labeling to enhance the contextual representation before modeling the correlation [Duan, Huang, and Wu(2021)]

. More recently, the off-the-shelf knowledge graph is leveraged to fine-tune the downstream MCQA task in

[Li et al.(2021)Li, Jiang, Wang, Lu, Zhao, and Chen].

Compared to existing matching work in [Wang et al.(2018)Wang, Yu, Jiang, and Chang, Zhang et al.(2020a)Zhang, Zhao, Wu, Zhang, Zhou, and Zhou], the proposed algorithm performs matching by introducing a context (an entity from the triple of passage, question and answer). This context serves as a background knowledge to exploit the semantic relationship with the remaining two entities.

Ii-B Contrastive learning

Contrastive learning (CL) has attracted a lot of research attention in the last several years, which has prohibited promising results in many downstream tasks, such as text clustering [Gao, Yao, and Chen(2021)],machine translation [Liang et al.(2021)Liang, Wu, Li, Wang, Meng, Qin, Chen, Zhang, and Liu], and knowledge graph completion [Qin et al.(2020)Qin, Wang, Chen, Zhang, Xu, and Wang], and etc.

The main idea is to leveraging input data itself for self-supervised training. In particular, for a given anchor sample , one encoder , and a pre-defined similarity function

, CL aims to optimize the following loss function:

 sim(f(xi),f(x+i))≫sim(f(xi),f(x−i)), (1)

where and are contrastive samples of . The subsequent training is to assign large values to positive samples and small values to the negative ones . In this paper, our motivation is to integrate CL with MCQA to capture and enhance the representation difference between the correct and distractive answers, that has not been explored before.

Iii Proposed method

The proposed method gradually identifies the best-matching answer by coordinating the loss from a Triple Matching (TM) module and a Contrastive Regularization (CR) simultaneously, as illustrated in Fig. 1.

Given an input triple of passage, question and answer, a pre-trained language model is first utilized for encoding textual contents. Then the Triple-Matching module enumerates this input triple and selects one entity as the background context. The semantic relationship is accordingly estimated using the remaining two entities with regard to this selected context. At last, the produced features from TM are utilized for answer selection, while the contrastive regularization ensures that the feature agreement between correct answers is maximized, by contrasting to the agreement between distractive ones.

Iii-a Encoding

Let , and be a passage, a question and a candidate answer, respectively. A pre-trained model (e.g. ) is adopted to encode each word from them into a fixed-length vector, yielding

 Hp=Enc(p),Hq=Enc(q),Ha=Enc(a), (2)

where ,,and are relevant representation of , , and , respectively, and is the dimension of the hidden state.

Iii-B Triple matching

To model the relationship among the triple of {, , }, in TM we introduce an context-oriented mechanism. That is, we select one component from the triple once (as the background context), and estimate its semantic correlation with the remaining two to further produce a context-guided representation. Note that this proposed module involve all three entities from the triple simultaneously, while existing methods adopt the pairwise strategy that involves only two entities once.

Taking the answer as an example, below we show how to model the representation for the answer(context)-guided passage-question matching. At first, given the encoder output of , and , we apply the bidirectional attention to calculate the answer-aware passage representation () and answer-aware question representation () as follows:

 Gaq=SoftMax(HaWHqT)Gap=SoftMax(HaWHpT)Ep=GapHp,Eq=GaqHq, (3)

where are learnable parameters, and and are the attention matrix between the answer-question, and the answer-passage, respectively.

Next, we further allow the third entity to be included by adopting the bidirectional attention again (to embed the question for and the passage for ). As a result, the core of triple matching becomes:

 Gpq=SoftMax(EpW1EqT)Gqp=SoftMax(EqW1EpT)Epqa=GpqHa,Eqpa=GqpHaSpqa=ReLU(EpqaW2)Sqpa=ReLU(EqpaW2), (4)

where , are learnable parameters, and , represent passage-question-aware answer representation and question-passage-aware answer representation, respectively. The final representation of answer-guided passage-question matching (i.e. ) is to aggregate the above as follows:

 Mpqa=MaxPooling(Spqa)Mqpa=MaxPooling(Sqpa)Ma––=[Mpqa;Mqpa], (5)

where the aggregated representation and

is computed via a row-wise max pooling operation from Eq. (

4).

Similarly, we enumerate the other two entities (that is, the question and passage ) to compute the related representation for the question-guided answer-passage matching (ie., ) and the passage-guided answer-question matching (ie., ), following the same procedure from Eq.(3) to Eq.(5). In sum, the proposed TM module is illustrated in Figure 2.

With the triple-matching representations , , , we further concatenate them as the final representation (ie., ). Let be the representation for the triplet. Accordingly, the selection loss can be computed as follows:

 LTM(p,q,ai)=−logexp(CiV)∑ni=1exp(CiV) (6)

where is a learnable parameter and is the number of answer options.

Iii-D Contrastive regularization

The aforementioned TM module is performed to extract semantic representation from one candidate triple. Yet, there could be trivial (word) difference between the correct and distractive answers (see Table I). To highlight this dissimilarity, we accordingly propose to utilize a contrastive regularization. The purpose is to maximize the agreement from correct answer(s) via pushing away the agreement against distractive ones.

To apply we first need to form contrastive (i.e. both positive and negative) samples for an anchor. MCQA is enjoyed owing to those distractive answers, which in nature play a role of negative samples against the correct answer. For positive ones, we adopt a similar strategy by following the dropout operation from [Gao, Yao, and Chen(2021), Liang et al.(2021)Liang, Wu, Li, Wang, Meng, Qin, Chen, Zhang, and Liu]. More precisely, given the correct triple of , we simply apply the TM module twice with different dropout masks to produce the anchor and positive representation as and , respectively. Then the contrastive-regularized learning objective can be defined as follows:

 LCR(p,q,ac)=−logexpcos(Cc,C+c)/τ∑ni=1expcos(Cc,Ci)/τ, (7)

where is the size of the mini-batch including one anchor, positive and negative samples, and is a pre-defined temperature.

Iii-E Loss function

With two losses from the answer selection and contrastive regularization, we propose to train the model using the joint loss as follows:

 L=LTM+λCRLCR, (8)

where is a penalty term111 There are another two training strategies, including pre-train and alternate. The former is to update the model first using before finetuning with , while the latter is to train the model with for iterations and switch to once, for every iterations. However, the experimental results show the joint training outperforms pre-train and alternate based model. .

Iii-F Discussion

Next, we analyze the relationship between the proposed method and existing pairwise algorithms. Previous studies measure the matching representation (i.e. from Eq. (6)) using the following estimation:

• CNN-Matching [Chaturvedi, Pandit, and Garain(2018)]:

 Hqa= Enc([q;a]);Hp=Enc(p); (9) M= Att(Hqa,Hp); C= Sim(Hqa,M).
• Co-Matching [Wang et al.(2018)Wang, Yu, Jiang, and Chang]:

 Hq= Enc(q);Ha=Enc(a);Hp=Enc(p); (10) Mqp= Att(Hq,Hp);Map=Att(Ha,Hp); C= [Sim(Mqp,Hp);Sim(Map,Hp)].
•  Hq= Enc(q);Ha=Enc(a);Hp=Enc(p); (11) Mqa= Att(Hq,Ha);Mqp=Att(Hq,Hp); Map= Att(Ha,Hp); C= [Gat(Mqa,Map);Gat(Mqp,Map); Gat(Mqa,Mqp)].

Within aforementioned methods, represents the encoder, stands for the attention operation, is for the similarity calculation, is a rest gate function, and is the vector concatenation. Note that existing methods adopted different implementation of , , and , etc. For instance, in [Chaturvedi, Pandit, and Garain(2018)] and [Wang et al.(2018)Wang, Yu, Jiang, and Chang] has been implemented as CNN and BERT, respectively.

Compared to the aforementioned methods, the proposed algorithm can be cast as their extension, with an additional consideration of triple matching and contrastively representing the correct answer(s). That is, the triple matching is to apply two attention layers to estimate the semantic relationship with regard to the selected context. As such, Eq.(3) to Eq.(5) can be equivalently represented as the following process:

 Mqa= Att(Hq,Ha);Mpa=Att(Hp,Ha); (12) Mpqa= Att(Att(Mpa,Mqa),Ha); Mqpa= Att(Att(Mqa,Mpa),Ha).

In addition, our method is also distinct from existing ones by further integrating the contrastive loss. That is, we aim to distinguish the correct answers via pulling its relevant representation away from distractive ones, which has been neglected by existing pairwise-matching approaches.

Iv Experiments

The proposed CTM method is evaluated on two widely used MCQA dataset and compared to the state-of-the-art methods.

Iv-a Datasets

The two datasets adopted in the experiments are RACE [Lai et al.(2017)Lai, Xie, Liu, Yang, and Hovy] and DREAM [Sun et al.(2019)Sun, Yu, Chen, Yu, Choi, and Cardie]. RACE is one of the widely used banchemark datasets developing and evaluating methods for multi-choice reading comprehension. It consists of subsets RACE-M and RACE-H that correspond to the reading-difficulty level of middle and high school, respectively.

DREAM is a dialogue-based examination dataset. It includes dialog passages as the background and three options associated with each individual question.

Table II shows the statistics of the two datasets, including total number of passages, number of questions, average number iof candidate answers and average number of words per candidate answer. In particular, we notice that the averaged length per answer over the three datasets, RACE-M, RACE-H and DREAM, is approximately 5.7 words.

Iv-B Implementation and settings

Two pre-trained language models, including the BERT

and BERT, are adopted as the encoder for word-embedding. BERT consists of 12-layer transformer blocks, 12 self-attention heads, and 768 hidden-size, whereas BERT consists of 24-layer transformer blocks, 16 self-attention heads, and 1024 hidden-size. They have 110M and 340M parameters, respectively. The dropout rate for each BERT layer is set as 0.1. The Adam optimizer with a learning rate setting of is adopted to train the proposed CTM.

During training, batch size is 4, number of training epoch is 3, and the max length of input sequences is set to 360 for RACE. For DREAM, batch size is 4 and number of training epochs is 6, and the max length of input sequences is set to 300. With passages of more than 360/300 words in RACE or DREAM, we follow the same sliding window strategy as that in

[Jin et al.(2020)Jin, Gao, Kao, Chung, and Hakkani-tur] to split the long passage into sub-passages of length 360/300 allowing overlapping content.

For the contrastive regularization, the dropout rate is 0.1 to produce positive samples, and the temperature . The CTM model is trained on a machine with four Tesla K80 GPUs. Accuracy is used to measure the performance, where represents the number of questions that the model selects the correct answer, and is the number of total questions.

Iv-C Results

We comared the performance of the proposed CTM with the methods, including the public models from the leaderboard (i.e. BERT) and state-of-the-arts (i.e. DCMN+[Zhang et al.(2020a)Zhang, Zhao, Wu, Zhang, Zhou, and Zhou]). To make a fair comparison, we are particularly interested in those implemented with the same BERT encoder. Yet, some public works (such as MegatronBERT [Shoeybi et al.(2020)Shoeybi, Patwary, Puri, LeGresley, Casper, and Catanzaro] and RoBerta [Liu et al.(2019)Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer, and Stoyanov]) may produce better performance. The improvement is likely because they use much larger model size (e.g., 3.9 billion parameters for MegatronBERT) or complex pre-training strategies (for RoBerta). As such the results are not strictly comparable to the proposed CTM method.222Given the availability of numerous pre-training models, we could simply replace the adopted BERTs with other more powerful encoders, such as [Liu et al.(2019)Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer, and Stoyanov], to improve the performance of CTM. Alternatively, to use the additional ground knowledge, such as the work [Zhang et al.(2020b)Zhang, Wu, Zhou, Duan, Zhao, and Wang], could also lead to a potential improvement for CTM. We leave these as the future work.

Results of the proposed CTM and comparing methods are shown in Table III. It can be seen that the proposed method achieves state-of-the-art performance on both RACE and DREAM datasets. Not surprisingly, the BERT method and methods using BERT achieve generally worse performance compared to the counterparts using BERT, which shows the contribution that a good pre-trained model would bring.

Although baseline performance (from BERTs) is further improved by using bidirectional matching [Zhang et al.(2020a)Zhang, Zhao, Wu, Zhang, Zhou, and Zhou] or external knowledge [Zhang et al.(2020b)Zhang, Wu, Zhou, Duan, Zhao, and Wang], these strategies do the pairwise among the passage, question and answer independently without considering the third entity in the triplet. By contrast, the proposed CTM method uses the third entity to provide a background context during the matching, so that learned features are geared towards this selected context. The use of contrastive regularization further strengthens the learning to differentiate the correct answer from semantically closed but wrong ones. As a result, the proposed CTM substantially outperforms these state-of-the-art methods.

Iv-D Ablation study

Experiments are conducted on the RACE dataset (with the BERT-base encoder) to validate the contributions from the proposed triple-matching (TM) module and the constrative regularization.

On triple-matching This experiment compares the performance of two different matching strategies, i.e. the proposed TM against existing dual one. The contrastive regularization in this experiment is disabled by setting .

The DCMN+ model [Zhang et al.(2020a)Zhang, Zhao, Wu, Zhang, Zhou, and Zhou] is adopted as the opponent, which achieves the state-of-the-art performance. It consists of three dual-matching components: question-answer pair (), question-passage pair (), and answer-passage pair (). By contrast, the proposed CTM includes three components, including , , , respectively (see Eq. ( 5)). Next, we carefully ablate those components by enumerating different combinations, and compare them with DCMN+ using the RACE-H dataset.

Table IV shows the results on the proposed TM and DCMN+ [Zhang et al.(2020a)Zhang, Zhao, Wu, Zhang, Zhou, and Zhou]. As observed, the component of contributes mostly in the answer selection, as it achieves the highest accuracy among all proposed components. This result suggests the importance of utilizing question(s) as the background context, rather than passage and/or answers, to address MCQA tasks.

On the other hand, the component of obtains the worst performance, which reveals the limitation of short answers. Note that the is to take the answer as the background context, and estimate its correlation (using attention) between passage/question. Yet, the attention-based correlation is insignificant compared to others, mainly due to the short sequence length from answers.

Additionally, we also notice that the combination of all three component work best with the encoder, which demonstrates a better matching outcome (65.7%) compared to that of DCMN+ (64.2%). The result not only indicates the necessity of utilizing all three proposed matching components, but also shows the superiority of the triple matching compared to existing dual matching.

On contrastive regularization The impact of the contrastive regularization is mainly controlled by the penalty term from Eq. (8). Concretely, different settings of influence the algorithm behavior. For instance, a bigger value of will be in favor of the model via awarding the answer difference. In the extreme case of , the model degrades to the TM module simply. Consequently, we evaluate the accuracy for CTM by setting different values to as [0, 0.5, 1, 1.5].

With the comparison result presented in Table V, it is found out that the proposed contrastive regularization helps in enhancing the matching capability. For instance, the CTM achieves the best result when , compared to that of . Note that the latter corresponds to the simple TM module. The comparison result clearly shows the advantage of utilizing the candidate difference to improve the model answering capability.

Yet, the increase of the value results in the inferior accuracy (in particular with ). The reason could be the compatibility between the learned features and the final classification. With a larger , the model tends to learn distinct features to separate answers, which might not be useful for selecting the correct answer.

Analysis In this section, the model capability is further analyzed based on the question complexity. We randomly select 10% samples (350 questions) from the testing set of RACE-H, and manually annotate them using question types of what, which, cloze and other333The “other” type including the rest question types, such as why, who, when, where, and how.. Additionally, we further tag them based on the number of sentence required to answer the question. The performance from two models is accordingly shown in Table VI.

The result clearly indicates the superiority of the proposed algorithm when answering complex questions, such as cloze test and more sentences involved. For instance, the cloze test requires more reasoning capability as the model needs to scan the entire passage according to the given question and all candidate answers. As such, the proposed triple matching is more suitable than the conventional dual-wise strategy. Additionally, as the cloze test needs to fill in missing item(s), the textual difference from candidate answers also plays an important role. As expected, the proposed contrastive regularization helps in identifying and further highlighting those difference, thereby achieving the improvement for the question answering.

Similarly, with complex questions that need to infer from (more than) 3 sentences, the result clearly reflects an improvement from CTM compared with DCMN+. With the increasing number of required sentences, the prediction accuracy from both models has been reduced. Yet, CTM performs much stable than its counterpart, which shows its robustness of handling cases with multiple evidence sentences. In conclusion, it can be empirically confirmed that the proposed CTM algorithm achieves comparative performance than dual-wise methods, in particular with complex question answering.

V Conclusion

The task of multiple choice question answering (MCQA) aims to identify a suitable answer from the background passage and question. Using the dual-based matching strategy, existing methods decouple the process into several pair-wise steps, that fail to capture the global correlation from the triple of passage, question and answer.

In this paper, the proposed algorithm introduces a context-guided triple matching. Concretely, a triple-matching module is used to enumerate the triple and estimate a semantic matching between one component (context) with the other two. Additionally, to produce more informative features, the contrasitve regularization is further introduced to encourage the latent representation of correct answer(s) staying away from distractive ones. Intensive experiments based on two benchmarking datasets are considered. In comparison to multiple existing approaches, the proposed algorithm produces a state-of-the-art performance by achieving higher accuracy. To our knowledge, this is the first work that explores a context-guided matching and contrasitve learning in multiple choice question answering. We will continue exploring inter/cross sentence matching as our future work.

References

• [Chaturvedi, Pandit, and Garain(2018)] Chaturvedi, A.; Pandit, O.; and Garain, U. 2018. CNN for Text-Based Multiple Choice Question Answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 272–277.
• [Duan, Huang, and Wu(2021)] Duan, Q.; Huang, J.; and Wu, H. 2021. Contextual and Semantic Fusion Network for Multiple-Choice Reading Comprehension. IEEE Access, 9: 51669–51678.
• [Gao, Yao, and Chen(2021)] Gao, T.; Yao, X.; and Chen, D. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv preprint arXiv:2104.08821.
• [Jin et al.(2020)Jin, Gao, Kao, Chung, and Hakkani-tur] Jin, D.; Gao, S.; Kao, J.-Y.; Chung, T.; and Hakkani-tur, D. 2020. MMM: Multi-Stage Multi-Task Learning for Multi-Choice Reading Comprehension.

Proceedings of the AAAI Conference on Artificial Intelligence

, 34(05): 8010–8017.
• [Joshi et al.(2020)Joshi, Chen, Liu, Weld, Zettlemoyer, and Levy] Joshi, M.; Chen, D.; Liu, Y.; Weld, D. S.; Zettlemoyer, L.; and Levy, O. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics, 8: 64–77.
• [Lai et al.(2017)Lai, Xie, Liu, Yang, and Hovy] Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; and Hovy, E. 2017. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

, 785–794.
• [Li et al.(2021)Li, Jiang, Wang, Lu, Zhao, and Chen] Li, R.; Jiang, Z.; Wang, L.; Lu, X.; Zhao, M.; and Chen, D. 2021. Enhancing Transformer-based language models with commonsense representations for knowledge-driven machine comprehension. Knowledge-Based Systems, 220: 106936.
• [Liang et al.(2021)Liang, Wu, Li, Wang, Meng, Qin, Chen, Zhang, and Liu] Liang, X.; Wu, L.; Li, J.; Wang, Y.; Meng, Q.; Qin, T.; Chen, W.; Zhang, M.; and Liu, T.-Y. 2021. R-Drop: Regularized Dropout for Neural Networks.
• [Liu et al.(2019)Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer, and Stoyanov] Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, abs/1907.11692.
• [Qin et al.(2020)Qin, Wang, Chen, Zhang, Xu, and Wang] Qin, P.; Wang, X.; Chen, W.; Zhang, C.; Xu, W.; and Wang, W. Y. 2020. Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 8673–8680.
• [Seonwoo et al.(2020)Seonwoo, Kim, Ha, and Oh] Seonwoo, Y.; Kim, J.-H.; Ha, J.-W.; and Oh, A. 2020. Context-Aware Answer Extraction in Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2418–2428.
• [Shoeybi et al.(2020)Shoeybi, Patwary, Puri, LeGresley, Casper, and Catanzaro] Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; and Catanzaro, B. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053.
• [Sun et al.(2019)Sun, Yu, Chen, Yu, Choi, and Cardie] Sun, K.; Yu, D.; Chen, J.; Yu, D.; Choi, Y.; and Cardie, C. 2019. DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension. Transactions of the Association for Computational Linguistics.
• [Wang et al.(2018)Wang, Yu, Jiang, and Chang] Wang, S.; Yu, M.; Jiang, J.; and Chang, S. 2018. A Co-Matching Model for Multi-choice Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 746–751. Association for Computational Linguistics.
• [Yang et al.(2019)Yang, Dai, Yang, Carbonell, Salakhutdinov, and Le] Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R. R.; and Le, Q. V. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
• [Zhang et al.(2020a)Zhang, Zhao, Wu, Zhang, Zhou, and Zhou] Zhang, S.; Zhao, H.; Wu, Y.; Zhang, Z.; Zhou, X.; and Zhou, X. 2020a. DCMN+: Dual Co-Matching Network for Multi-Choice Reading Comprehension. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05): 9563–9570.
• [Zhang et al.(2020b)Zhang, Wu, Zhou, Duan, Zhao, and Wang] Zhang, Z.; Wu, Y.; Zhou, J.; Duan, S.; Zhao, H.; and Wang, R. 2020b. SG-Net: Syntax-Guided Machine Reading Comprehension. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence.
• [Zhu et al.(2018)Zhu, Wei, Qin, and Liu] Zhu, H.; Wei, F.; Qin, B.; and Liu, T. 2018. Hierarchical Attention Flow for Multiple-Choice Reading Comprehension. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
• [Zhu, Zhao, and Li(2020)] Zhu, P.; Zhao, H.; and Li, X. 2020. DUMA: Reading Comprehension with Transposition Thinking. arXiv:2001.09415.