Improving Question Answering by Commonsense-Based Pre-Training

by   Wanjun Zhong, et al.

Although neural network approaches achieve remarkable success on a variety of NLP tasks, many of them struggle to answer questions that require commonsense knowledge. We believe the main reason is the lack of commonsense connections between concepts. To remedy this, we provide a simple and effective method that leverages external commonsense knowledge base such as ConceptNet. We pre-train direct and indirect relational functions between concepts, and show that these pre-trained functions could be easily added to existing neural network models. Results show that incorporating commonsense-based function improves the state-of-the-art on two question answering tasks that require commonsense reasoning. Further analysis shows that our system discovers and leverages useful evidences from an external commonsense knowledge base, which is missing in existing neural network models and help derive the correct answer.



There are no comments yet.



Semantic Categorization of Social Knowledge for Commonsense Question Answering

Large pre-trained language models (PLMs) have led to great success on va...

Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models

Neural language representation models such as Bidirectional Encoder Repr...

An Atlas of Cultural Commonsense for Machine Reasoning

Existing commonsense reasoning datasets for AI and NLP tasks fail to add...

Revisiting the Prepositional-Phrase Attachment Problem Using Explicit Commonsense Knowledge

We revisit the challenging problem of resolving prepositional-phrase (PP...

Benchmarking Knowledge-Enhanced Commonsense Question Answering via Knowledge-to-Text Transformation

A fundamental ability of humans is to utilize commonsense knowledge in l...

I Know What You Asked: Graph Path Learning using AMR for Commonsense Reasoning

CommonsenseQA is a task in which a correct answer is predicted through c...

Evaluating Commonsense in Pre-trained Language Models

Contextualized representations trained over large raw text data have giv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Commonsense reasoning is a major challenge for question answering [Levesque, Davis, and Morgenstern2011, Clark et al.2018, Ostermann et al.2018, Boratko et al.2018]. Take Figure 1 as an example. Answering both questions requires a natural language understanding system that has the ability of reasoning based on commonsense knowledge about the world.

Figure 1: Examples from ARC [Clark et al.2018] that require commonsense knowledge and reasoning.

Although neural network approaches have achieved promising performance when supplied with a large amount of supervised training instances, even surpassing human-level exact match accuracy on the Stanford Question Answering Dataset (SQuAD) benchmark [Rajpurkar et al.2016], it has been shown that existing systems lack true language understanding and reasoning capabilities [Jia and Liang2017], which are crucial to commonsense reasoning. Moreover, although it is easy for humans to answer the aforementioned questions based on their knowledge about the world, it is a great challenge for machines when there is limited training data.

In this paper, we leverage external commonsense knowledge, such as ConceptNet [Speer and Havasi2012]

, to improve the commonsense reasoning capability of a question answering (QA) system. We believe that a desirable way is to pre-train a generic model from external commonsense knowledge about the world, with the following advantages. First, such model has a larger coverage of the concepts/entities and can access rich contexts from the relational knowledge graph. Second, the ability of commonsense reasoning is not limited to the amount of training instances and the coverage of reasoning types in the end tasks. Third, it is convenient to build a hybrid system that preserves the semantic matching ability of the existing QA system, which might be a neural network-based model, and further integrates a generic model to improve model’s capability of commonsense reasoning.

We believe that the main reason why the majority of existing methods lack the commonsense reasoning ability is the absence of connections between concepts111In this work, concepts are words and phrases that can be extracted from natural language text [Speer and Havasi2012].. These connections could be divided into direct and indirect ones. Below is an example sampled from ConceptNet.

Figure 2: A sampled subgraph from ConceptNet with “driving” as the central word.

In this case, {“driving”, “a license”} forms a direct connection whose relation is “HasPrerequisite”. Similarly, {“driving”, “road”} also forms a direct connection. Moreover, there are also indirect connections here such as {“a car”, “getting to a destination”}, which are connected by a pivot concept “driving”. Based on this, people can learn two functions to measure direct and indirect connections between every pair of concepts. These functions could be easily combined with existing QA system to make decisions.

We take two question answering tasks [Clark et al.2018, Ostermann et al.2018] that require commonsense reasoning as the testbeds. These tasks take a question and optionally a context222The definitions of contexts in these tasks are slightly different and we will describe the details in the next section. as input, and select an answer from a set of candidate answers. We believe that understanding and answering the question requires knowledge of both words and the world [Hirsch2003]. Thus, we implement document-based neural network based baselines, and use the exact same way to improve the baseline systems with our commonsense-based pretrained models. Results show that incorporating pretrained models brings improvements on these two tasks and improve model’s ability to discover useful evidences from an external commonsense knowledge base.

Tasks and Datasets

In this work, we focus on integrating commonsense knowledge as a source of supportive information into the question answering task. To verify the effectiveness of our approach, we use two multiple-choice question answering tasks that require commonsense reasoning as our testbeds. In this section, we describe task definitions and the datasets coupled with two tasks.

Given a question of length and optionally a supporting passage of length , both tasks are to predict the correct answer from a set of candidate answers. The difference between these tasks is the definition of the supporting passage which will be described later in this section. Systems are expected to select the correct answer from multiple candidate answers by reasoning out the question and the supporting passage. Following previous studies, we regard the problem as a ranking task. At the test time, the model should return the answer with highest score as the prediction.

The first task comes from SemEval 2018 Task 11333 [Ostermann et al.2018], which aims to evaluate a system’s ability to perform commonsense reasoning in question answering. The dataset describes events about daily activities. For each question, the supporting passage is a specific document given as a part of the input, and the number of candidate answers is two. Answering substantial number of questions presented in this dataset requires inference from commonsense knowledge of diverse scenarios, which are beyond the facts explicitly mentioned in the document.

The second task we focus on is ARC, short for AI2 Reasoning Challenge, proposed by clark2018think clark2018think444 The ARC Dataset consists of a collection of scientific questions and a large scientific text corpus containing a large amount of science facts. Each question has multiple candidate answers (mostly 4-way multiple candidate answers). The dataset is separated into an easy set and a challenging set. The Challenging Set contains only difficult, grade-school questions including questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm, and have acquired strong reasoning ability of commonsense knowledge or other reasoning procedure [Boratko et al.2018]. Figure 1 shows two examples which need to be solved by common sense. We only use the challenge set in our experiment.

Commonsense Knowledge

This section describes the commonsense knowledge base we investigate in our experiment. We use ConceptNet555 [Speer and Havasi2012], one of the most widely used commonsense knowledge bases. Our approach is generic and could also be applied to other commonsense knowledge bases such as WebChild [Tandon, de Melo, and Weikum2017], which we leave as a future work. ConceptNet is a semantic network that represent the large sets of words and phrases and the commonsense relationships between them. It contains 657,637 instances and 39 types of relationships. Each instance in ConceptNet can be generally described as a triple . For example, the “IsA” relation (e.g. “car”, “IsA”, “vehicle”) means that “XX is a kind of YY”; the “Causes” relation (e.g. “car”, “Causes”, “pollution”) means that “the effect of XX is YY”; the “CapableOf” relation (e.g. “car”, “CapableOf”, “go fast”) means that “XX can YY”, etc. More relations and explanations could be found at speer2012representing speer2012representing.

Approach Overview

In this section, we give an overview of our framework to show our basic idea of solving commonsense reasoning problem. Details of each component will be described in the following sections.

At the top of our framework, we need to make a suggestion that we should select the candidate answer with the highest probability (highest score) as our final prediction. So we can tackle this problem by designing a scoring function that captures the evidences mentioned in the passage and retrieved from commonsense knowledge base.

Figure 3: An overview of our system for commonsense based question answering.

An overview of the QA system is given in Figure 3. We define the scoring function to calculate the score of a candidate answer , which can be calculated by the sum of document based scoring function and commonsense based scoring function


The calculation of the final score would consider the given passage, the given question, and a set of commonsense knowledge related to this instance.

In the next section we will detail the design and mathematical formulas of our commonsense knowledge based scoring function. We introduce the document based model in the following section.

Commonsense-based Model

In this section, we first describe how to pre-train commonsense-based functions in order to capture the semantic relationships between two concepts. Graph neural network [Scarselli et al.2009] is used to integrate context from the graph structure in an external commonsense knowledge base. Afterwards, we present how to use the pre-trained functions to calculate the relevance score between two pieces of text, such as a question sentence and a candidate answer sentence.

We model both direct and indirect relations between two concepts from commonsense KB, both of which are helpful when the connection between two sources (e.g. a question and a candidate answer) is missing merely based on the word utterances. Take direction relation involved in Figure 4 as an example.

Figure 4: An example from ARC dataset. The analysis of this example could be improved if it is given the fact {“electrons”, “HasA”, “negative charge”} in ConceptNet.

If a model is given the evidence from ConceptNet such that the concept “electrons” and the concept “negative charge” has direct relation, it would be more confident to distinguish between (B,D) and (A,C), thus has a larger probability of obtaining the correct answer (D). Therefore, it is desirable to model the relevance between two concepts. Moreover, ConceptNet could not cover all the concepts which potentially have direction relations. We need to model the direct relation for every two concepts.

Similarly, indirect relation also provides a strong evidence for prediction making. As shown in the example of Fig 2, the concept “a car” has an indirect relation to the concept “getting to a destination”, both of which have a direct connection to the pivot concept “driving”. With access to this information, a model would give a higher score to the answer containing “car” when questioned “how did someone get to the destination”.

Therefore, we model the commonsense-based relation between two concepts and as follows, where means element-wise multiplication, stands for an encoder that represents a concept

with a continuous vector.


Specifically, we represent a concept with two types of information, namely the words it contains and the neighbors connected to it in the structural knowledge graph. From the first aspect, since each concept might consist of a sequence of words, we encode it by a bidirectional LSTM [Hochreiter and Schmidhuber1997] over Glove word vectors [Pennington, Socher, and Manning2014], where the concatenation of hidden states at both ends is used as the representation. We denote it as .


From the second aspect, we represent each concept based on the representations of its neighbors and the relations that connect them. We get inspirations from graph neural network [Scarselli et al.2009]. We regard a relation that connects two concepts as the compositional modifier to modify the meaning of the neighboring concept. Matrix-vector multiplication is used as the composition function [Mitchell and Lapata2010]. We denote the neighbor-based representation of a concept as , which is calculated as follows, where is the specific relation between two concepts, stands for the set of neighbors of the concept , and are model parameters.


The final representation of a concept is the concatenation of both representations, namely .

We use a standard ranking-based loss function to train the parameters, which is given in Equation



In this equation, and form a positive instance, which means that they have a relationship with each other, while and form a negative instance. is the margin with value of 0.1 in the experiment. We can easily learn two functions to model direct and indirect relations between two concepts by having different definitions of what a positive instance is, and accordingly using different strategies to sample the training instances. For the direct relation, we set those directly adjacent entities pairs in the knowledge graph as positive examples, and randomly select entity pairs that have no direct relationship as negative examples. For the indirect relation, we select entity pairs that have a common neighbor as a positive instance and randomly select an equal number of entities pairs that have no one-hop or two-hop connected relations as negative instances. We denote the direct relation based function as , and the indirect relation based function as . The final commonsense-based score in Equation 1 is calculated by using one of these two functions, or using both of them through a weighted sum. We will show the results under different settings in the experiment section.

We detailed the commonsense-based functions to measure the direct and indirect connection of each pair of concepts. Here, we present how to calculate the commonsense based score of a question sentence and a candidate answer sentence. In our experiment, we retrieve commonsense facts from ConceptNet [Speer and Havasi2012]. As described above, each fact from ConceptNet can be represented as a triple, namely

. For each sentence (or paragraph), we retrieve a set of facts from ConceptNet. Specifically, we first extract a set of the n-grams from each sentence. We carry out an experiment with

-gram in our searching process, and then, we save the commonsense facts from ConceptNet which contain one of the extracted n-grams. We denote the retrieved facts for a sentence as .

Suppose we have obtained commonsense facts for a question sentence and a candidate answer, respectively, let us denote the outputs as and . We can calculate the final score by the following formula. The intuition is to select the most relevant concept of each concept in , and then aggregate all these scores by average.


In the experiments on ARC and SemEval datasets, we also apply the previous scoring function for a pair of paragraph and candidate answer, where and come from the supporting paragraph and the answer sentence, respectively. Furthermore, we also calculate an additional score for the answer-paragraph pair in the same way. For a paragraph-question pair, in order to guarantee the relevance of the candidate answer sentence, we filter out concepts from or , if they are not contained in the extracted concepts from the candidate answer.

Document-based Model

In this section, we describe document-based models, which are used as baseline methods in the three tasks and further combined with the commonsense-based models as described in the previous section, to make the final prediction.

We use state-of-the-art document-based models on these three tasks to verify whether a strong baseline model could benefit from our pre-trained commonsense-based models. We use TriAN [Wang et al.2018], the top-performing system in the SemEval evaluation [Ostermann et al.2018], as the document-based model for the SemEval dataset. Since the input and output of ARC and SemEval datasets are consistent, we also apply TriAN to the ARC dataset. We find that TriAN performs comparably to a recent state-of-the-art system [Zhang et al.2018], therefore we use it as our document-based model for ARC as well. To make this paper self-contained, we briefly describe the TriAN model in this part. Please refer to the original articles for more details.

In ARC and SemEval datasets, the task involves a passage as the evidence, a question, and several candidate answers as inputs. To select the correct answer, the model needs to comprehend each element and the interaction between them. The TriAN model [Wang et al.2018], short for Threeway Attentive Networks, is developed to achieve this goal. The model can be roughly divided into three components, including the encoding layer, the interaction/composition layer, and the output layer.

Specifically, the representation of each word does not only include its internal embedding including word, POS and NER embeddings, but also considers its relevance to the words from other input sources. Let us denote the internal embedding for a word as follows.


The question-aware representation of a passage word is calculated as follows with an attention function, where is the model parameter.


The final representation of each passage word is the concatenation of and the following question-aware representation. Similarly, the final representation of each word in a candidate answer is the concatenation of , question-aware representation, and passage-aware representation.

Afterwards, bidirectional LSTM is used to get the contextual vector for each word in the question, followed by a self-attention layer to get the final representation for the question, which can be represented by following formula.


The final representations of the candidate answer and the passage ( and ) are obtained in the same way. The ranking score of each candidate answer is calculated as follows, where

is the sigmoid function.



We conduct experiments on two question answering datasets, namely SemEval 2018 Task 11 [Ostermann et al.2018] and ARC Challenge Dataset [Clark et al.2018], to evaluate the effectiveness of our system. We report model comparisons and model analysis in this section.

Dataset Train Dev Test
SemEval 7,731 1,411 2,797
ARC 1,119 299 1,172
Table 1: Data splits of SemEval and ARC datasets.
Figure 5: Examples that require commonsense-based direct relations between concepts on ARC and SemEval datasets.

Model Comparisons and Analysis

On ARC and SemEval datasets, we follow existing studies and use accuracy as the evaluation metric. Table

1 gives the data statistics of these two datasets. Table 2 and Table 3 show the results on these two datasets, respectively. On the ARC dataset, we compare our model with a list of existing systems. On the SemEval dataset, we only report the results of TriAN, which is the top-performing system in the SemEval evaluation666During the SemEval evaluation, systems including TriAN report results based on model pretraining on RACE dataset [Lai et al.2017] and system ensemble. In this work, we report numbers on SemEval without pretrained on RACE or ensemble.. is our commonsense-based model for direct relations, and represents the commonsense-based model for indirect relations. According to the experiment, the nerghbor-based representation has no significant effect on the performance of the model for direct relations, so its dimension in direct-relation based model is set to zero. From the results, we can observe that both commonsense-based scores improve the accuracy of the document-based model TriAN, and combining both scores could achieve further improvements on both datasets. The results show that our commonsense-based models are complementary to standard document-based models.

Model Accuracy
IR [Clark et al.2018] 20.26%
TupleInference [Clark et al.2018] 23.83%
DecompAttn [Clark et al.2018] 24.34%
Guess-all [Clark et al.2018] 25.02%
DGEM-OpenIE [Clark et al.2018] 26.41%
BiDAF[Clark et al.2018] 26.54%
Table ILP[Clark et al.2018] 26.97%
DGEM [Clark et al.2018] 27.11%
KG [Zhang et al.2018] 31.70%
TriAN 31.25%
TriAN + 32.28%
TriAN + 32.96%
TriAN + + 33.39%
Table 2: Performances of different approaches on the ARC Challenge dataset.
Model Accuracy
TriAN 80.33%
TriAN + 81.58%
TriAN + 81.44%
TriAN + + 81.80%
Table 3: Performances of different approaches on the SemEval Challenge dataset.

To better analyze the impact of incorporating our commonsense based model, we give examples from ARC and SemEval datasets that are incorrectly predicted by the document-based model, while correctly solved by incorporating the commonsense-based models. Figure 5 shows two examples that require commonsense-based direct relations between concepts. The first example comes from ARC. We can see that the retrieved facts from ConceptNet provide useful evidences to connect question to candidate answers (B) and (D). By combining with the document-based model, which might favor candidates with the co-occurred word “fur”, the final system might give higher score to (D). The second example is from SemEval. Similarly, we can see that the retrieved facts from ConceptNet are helpful in making the correct prediction.

Figure 6 shows an example from SemEval that benefits from both direct and indirect relations from commonsense knowledge. Despite both the question and candidate (A) mention about “drive/driving”, the document-based model fails to make the correct prediction. We can see that the retrieved facts from ConceptNet help from difference perspectives. The retrieved fact {“driving”,“HasPrerequisite”,“license”} directly connects the question to the candidate (A), and both {“license”,“Synonym”,“permit”} and {“driver”,“RelatedTo”,“care”} directly connects candidate (A) to the passage. In addition, we also calculate for the question-passage pair, where the indirect relation between {“driving”,“permit”} could be further used as side information to do the prediction.

Figure 6: An example from SemEval 2018 that requires sophistic reasoning based on commonsense knowledge.

We further make comparisons by implementing different strategies to use the commonsense knowledge from ConceptNet. We implement three baselines as follows.

The first baseline is TransE [Bordes et al.2013], which is a simple yet effective method for KB completion that learns vector embeddings for both entities and relations on a knowledge base. We re-implement and train TransE model on ConceptNet. The commonsense-based score could be calculated by a dot-product between the embeddings of two concepts.

The second baseline is Pointwise Mutual Information (PMI), which has been used for commonsense inference [Lin, Sun, and Han2017]. Both TransE and PMI could be viewed as pretrained models from Conceptnet. The difference is that PMI scores are computed directly based on the co-occurred frequency between concepts in a knowledge base, without learning a embedding vector for each concept.

The third baseline is Key-Value Memory Network (KV-MemNet) [Miller et al.2016], which has been used in commonsense inference [Mihaylov and Frank2018]. It first retrieves supporting evidences from external KB, and then regards the knowledge as a memory and uses them with a key-value memory network strategy. We implement this by encoding a set of commonsense facts into a joint representation by KV-MemNet. Next, we train the doc-based model which is enhanced by the KV-MemNet component.

Model ARC SemEval
TriAN 31.25% 80.33%
TriAN + PMI 31.72% 80.50%
TriAN + TransE 30.59% 80.37%
TriAN + KV-MemNet 30.49% 80.59%
TriAN + + 33.39% 81.80%
Table 4: Performances of approaches with different strategies to use commonsense knowledge on ARC and SemEval 2018 Task 11 datasets.

From Table 4 we can see that learning direct and indirection connections based on contexts from word-level constituents and neighbor from knowledge graph performs better than TransE which is originally designed for KB completion. PMI performs well, however, its performance is limited by the information it can take into account, i.e. the word count information. The comparison between KV-MemNet and our approach further reveals the effectiveness of pretraining.


We analyze the wrongly predicted instances from both datasets, and summarize the majority of errors of the following groups.

The first type of error, which is also the dominant one, is caused by failing to highlight the most useful concept in all the retrieved ones. The usefulness of a concept should also be measured by its relevance to the question, its relevance to the document, and whether introducing it could help distinguish between candidate answers. For example, the question is “Where was the table set” is asked based on a document talking about dinner, according to which two candidate answers are “On the coffee table” and “At their house”. Although the retrieved concepts for the first candidate answer also being relevant, they are not relevant to the question type “where”. We believe that the problem would be alleviated by incorporating a context-aware module to model the importance of a retrieved concept in a particular instance, and combining it with the pretrained model to make the final prediction.

The second type of error is caused by the ambiguity of the entity/concept to be linked to the external knowledge base. For example, supposing the document talks about computer science and machine learning, the concept “

Micheal Jordan” in question should be linked to the machine learning expert rather than the basketball player. However, to achieve this requires an entity/concept disambiguation model, the input of which also considers the question and the passage.

Moreover, the current system fails to handle difficult questions which need logical reasoning, such as “How long do the eggs cook for” and “How many people went to the movie together”. We believe that deep question understanding, such as parsing a question based on a predefined grammar and operators in a semantic parsing manner [Liang2016], is required to handle these questions, which is a very promising direction, and we leave it to future work.

Related Work

Our work relates to the fields of question answering, the integration of knowledge base in neural network approaches for NLP tasks, and model pre-training. We will describe these directions one by one in this section.

With the revival of interest in neural network approaches, current top-performing methods in MRC datasets are dominated by neural models [Xu et al.2017, Yu et al.2018], some of which even achieve human-level performance on several particular datasets. Existing neural architectures typically consist of three components: the encoding layer, the interaction layer, and the output layer. The encoding layer maps tokens of an input (e.g. question or document) into the semantic vector space with word and contextual embeddings [Pennington, Socher, and Manning2014, Peters et al.2018]. The interaction layer models the information flow between different input sources, so that learns question-aware document representation, and vice versa. The design of the output layer depends on the desired output format of the task. A pointer-based layer is typically used for detecting the starting and ending indexes in a SQuAD-like dataset, and a ranking-type layer is more suitable for the task of selecting an answer from a set of candidates. Our commonsense-based model, which is pretrained on commonsense KB, is complementary to this line of work, and has proven effective in two question answering tasks through model combination.

Our work also relates to recent neural network approaches that incorporate side information from external and structured knowledge bases [Annervaz, Chowdhury, and Dukkipati2018, Weissenborn, Kočiskỳ, and Dyer2017]. Existing studies roughly fall into two groups, where the first group aims to enhance each basic computational unit (e.g. a word or a noun phrase) and the second group aims to support external signals at the top layer before the model makes the final decision. The majority of works fall into the first group. For example, yang2017leveraging yang2017leveraging use concepts from WordNet and NELL, and weighted average vectors of the retrieved concepts to calculate a new LSTM state. mihaylov2018knowledgeable mihaylov2018knowledgeable retrieve relevant concepts from external knowledge for each token, and get an additional vector with a solution similar to the key-value memory network. A similar idea has also been applied to conversation generation [Zhou et al.2018], answering complex questions [Khot, Sabharwal, and Clark2017], task-oriented dialog system [Madotto, Wu, and Fung2018], text entailment [Chen et al.2018b], language modeling [Ahn et al.2016], etc. We believe that this line might work well on a specific dataset; however, the model only learns overlapped knowledge between the task-specific data and the external knowledge base. Thus, the model may not be easily adapted to another task/dataset where the overlapped is different from the current one. Our work belongs to the second group. lin2017reasoning lin2017reasoning learn the correlation between concepts with pointwise mutual information. We explore richer contexts from the rational knowledge graph with graph-based neural network, and empirically show that the approach performs better on two question answering datasets.

Our work also relates to the field of model pretraining in NLP and computer vision fields

[Mahajan et al.2018]. In the NLP community, works on model pretraining can be divided into unstructured text-based and structured knowledge-based ones. Both word embedding learning algorithms [Pennington, Socher, and Manning2014] and contextual embedding learning algorithms [Peters et al.2018, Radford et al.2018, Yang et al.2018] belong to the text-based direction. Previous works on knowledge-based pretraining are typically validated on knowledge base completion or link prediction task [Bordes et al.2013, Socher et al.2013, Chen et al.2018a]. Our work belongs to the second line. We pretrain models from commonsense knowledge base and apply the approach to the question answering task. We believe that combining both structured knowledge graphs and unstructured texts to do model pretraining is very attractive, and we leave this for future work.


We work on commonsense based question answering tasks in this work. We present a simple and effective way to pretrain models to measure relations between concepts. Each concept is represented based on its internal information (i.e. the words it contains) and external context (i.e. neighbors in the knowledge graph). We use ConceptNet as the external commonsense knowledge base, and apply the retrained on two question answering tasks (ARC and SemEval) in the same way. Results show that the pretrained models are complementary to standard document-based neural network approaches and could make further improvement through model combination. Model analysis shows that our system could discover useful evidences from an external commonsense knowledge base. In the future, we plan to address the issues raised in the discussion part including incorporating a context-aware module for concept ranking and considering logical reasoning operations. We also plan to apply the approach to other challenging datasets that require commonsense reasoning [Zellers et al.2018].