Abductive Reasoning as Self-Supervision for Common Sense Question Answering

09/06/2019 ∙ by Sathyanarayanan N. Aakur, et al. ∙ 0

Question answering has seen significant advances in recent times, especially with the introduction of increasingly bigger transformer-based models pre-trained on massive amounts of data. While achieving impressive results on many benchmarks, their performances appear to be proportional to the amount of training data available in the target domain. In this work, we explore the ability of current question-answering models to generalize - to both other domains as well as with restricted training data. We find that large amounts of training data are necessary, both for pre-training as well as fine-tuning to a task, for the models to perform well on the designated task. We introduce a novel abductive reasoning approach based on Grenander's Pattern Theory framework to provide self-supervised domain adaptation cues or "pseudo-labels," which can be used instead of expensive human annotations. The proposed self-supervised training regimen allows for effective domain adaptation without losing performance compared to fully supervised baselines. Extensive experiments on two publicly available benchmarks show the efficacy of the proposed approach. We show that neural networks models trained using self-labeled data can retain up to 75% of the performance of models trained on large amounts of human-annotated training data. Code and evaluation data will be made available publicly upon acceptance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Language models such as BERT [7] and GPT [24]

have shown remarkable progress in many natural language processing tasks through vigorous pre-training for language models. They have achieved state-of-the-art performance on several natural language inference (NLI) task 

[32], even surpassing human-level performance on some benchmarks. Despite such great success, it appears that the underlying task, that of commonsense natural language inference is not yet solved. Unfortunately, there does seem to exist a strong correlation between the quantity and quality of training data available to these models and their ability to achieve high accuracy. Given the dependency on large amounts of expensive, annotated data, the ability of such models to generalize to a new domain, or even the same domain with adversarial artifacts remains limited. The absence of “common sense knowledge” such as that about the world, concepts and semantic relationships prevent the models demonstrating complete understanding their world and hence behaving reasonably in unforeseen situations [13].

Figure 1: Proposed Approach: Given unlabeled, domain-specific data, we run through a generalist, abductive reasoning framework to create noisy, self-labeled data. A domain-specific network is then trained on these labels to build specialists networks.

Motivated by the desire to address the limitation of current, highly supervised question answering models and motivated by their success in understanding semantic entailment, we propose a novel self-supervised, abductive reasoning approach that can extract knowledge from large-scale, noisy knowledge bases such as ConceptNet[17]

and provide supervision for deep learning models to perform commonsense question answering,

without any human-labeled, training data. In addition to performing open-domain question answering, our model can provide deep contextualized graph representations of the observed evidence for transparent, interpretable decision making.

We build upon the idea of abductive reasoning [20, 21] for knowledge distillation [buciluǎ2006model, 14] to provide an effective framework for combining prior knowledge in knowledge bases with the representation learning capability of current language models. While abductive reasoning has not been explored to a great extent in existing literature, knowledge distillation has been used to obtain significant improvements in model compression [5] and quantization [23], to name a few. Knowledge Distillation (KD) aims to transfer the dark knowledge from large, high performing model(s) to smaller, more compact networks. However, traditional approaches to KD still involve training the teacher models on large, annotated training data. We aim to transfer the general, commonsense knowledge from ConceptNet to domain-specific, training data through abductive reasoning.

To address these limitations, we propose a pattern theory-based abductive reasoning framework, which enables open domain question answering without using training annotations specific to any particular domain. A significant departure from current approaches to question answering, we construct a “contextualized interpretation” of the evidence (the question or context) and each of the provided hypotheses (the answer choices), expressed in a graph-like structure using pattern theory. We define an interpretation to be a connected representation that captures the semantic structure of the evidence. Similar to scene graphs [33] in images, an interpretation is a deeper, meaningful representation of observed concepts (actors, actions and actor-object interactions) as well as unobserved concepts (background knowledge of each concept used to express deeper semantics) or contextualization cues. We use such contextualized interpretations to perform “inference to the best explanation” to find the most plausible hypothesis (answer).

Contributions: We make the following contributions. We

  • address the problem of unsupervised commonsense question answering, using no human-annotated training data

  • introduce the notion of abductive reasoning to perform unsupervised commonsense question answering beyond identifying language entailment

  • show that knowledge distillation can be used to transfer knowledge encoded in large scale, general knowledge bases to train neural networks on domain-specific data

  • show that model trained using the noisy, self-labeled data can retain up to of the performance of models trained on large amounts of human-annotated training data.

Further, we evaluate the performance of current, state of the art question answering models under a “resource constrained” environment. Here, we limit the amount of training data available for five (5) strong baselines and analyze the effect of the reduced training data on their performance. We find that while BERT performs well, even given as few as 100 training examples, the initialization of the weights using the pre-training does not help generalize to out of domain questions.

Related Work

Question Answering has been studied to a great extent in literature. Broadly, there are four (4) types of question answering tasks in literature, namely reading comprehension [8, 25], community question answering [27], natural language inference (NLI) [31, 34, 35] and visual question answering [2]

. Approaches to each of these question-answering tasks can be classified into two broad categories - semantic similarity matching and relevance matching models. Similarity matching models typically involve the computation of semantic similarity between the question and answer representations, typically using a neural network model. The answer with the highest similarity is chosen as the predicted answer choice. Some of the common similarity matching methods include BERT  

[7], OpenAI GPT  [24], ESIM [6] and LSTM based approaches. Other approaches represent some of the early supervised models such as Bag of Words (BoW) and FastText [15] models. The other class of approaches attempts to match answers to the question by quantifying their mutual relevance. The general framework can be described as a compare, attend, and aggregate framework [19]

. The framework typically begins with a vector representation of the question and answer, computing the relevance of each part and finally aggregating the representation for the final prediction.

Knowledge Distillation was introduced by Caruana et al [buciluǎ2006model] and generalized by Hinton et al  [14] as a method to effectively transfer the learned knowledge from larger, more cumbersome models into smaller, more compact networks. It typically involves training the smaller network (the student) with the labels from the larger model (the teacher) presented as soft targets along with the one-hot ground truth annotations. This allows the soft labels (pseudo labels) to act as a regularizer and help the student learn more effective representations. The knowledge distillation framework has been explored for action recognition [36], quantization [23] and model compression [5] to name a few. We extend this idea by eliminating the use of ground truth targets and train exclusively with the pseudo labels as target along with a temperature-based cross-entropy function.

Abductive Reasoning has not been explored to a great extent in literature, especially from a computational viewpoint. Introduced by Peirce [20], abduction refers to “inference to the most plausible explanation for incomplete observations”. While deemed to be the source of reasoning used by humans in everyday situation [10, 1], there have been, surprisingly, very few computational models introduced. Many have been logic based abductive reasoning [9, 18, 28]. Recently Bhagavatula et al [3] have introduced the task of abductive NLI where the abductive reasoning task is framed as question answering.

Abductive Reasoning Framework

Abductive reasoning typically involves the inference to the most plausible hypothesis that completes observed evidence. This reasoning process typically begins with a set of observations (both complete and incomplete) and attempts to find the most likely explanation for the occurrence of these observation(s). At the core of this process is the use of contextual knowledge that allows for evaluating the plausibility of each hypothesis and identifying the hypothesis with maximum evidence to support its validity.

Formally, we define the abductive reasoning process to be an optimization process that aims to find the optimal hypothesis

to maximize the probability of occurrence conditioned upon the observed evidence

and contextual knowledge about the evidence, . This can be expressed as the optimization for


where represents the observed evidence from the input data at time . This optimization involves the empirical computation of the probability of occurrence for each hypothesis given the contextual knowledge .

As opposed to logic-based reasoning, we use natural language to express the data from the evidence and hypotheses. Hence, assigning a likelihood for any given hypothesis requires a complete understanding of the observed evidence, which requires interpreting the semantic structure that links the recognized actors, their actions, and interactions. Such understanding involves the modeling of the underlying pattern such as atomicity, regularity, and an inference methodology for using the knowledge of these fundamental properties of the pattern.

Representing Interpretations using Pattern Theory

We represent the evidence and hypothesis as an interpretation and express it in terms of Grenander’s canonical representation of general pattern theory [11]. An example interpretation of a given data is shown in Figure 2. Each of the interpretations of the observed evidence is conditioned by the contextual knowledge encoded in large-scale knowledge bases such as ConceptNet[17, 30]. The pattern theory formalism allows for a flexible, graphical representation of the observed concepts in the evidence and hypotheses. We define concepts as both the observed attributes of the evidence such as actors, actions, objects, and the actor-object interactions and the unobserved, contextual knowledge about these observed concepts. In the example in Figure 2, nodes in white represent the observed concepts, and the nodes in red represent the unobserved concepts.

Figure 2: An example of how raw data, in the form of natural language sentences, is expressed a contextualized interpretation in the pattern theory framework.

Concepts as Generators. In pattern theory, concepts are represented as generators , where is the collection of all generators required to express the semantics of a given environment. Each generator represents a single, atomic element that expresses the presence of each concept in the evidence. We allow for two different types of generators based on their provenance. Grounded generators () are concepts whose presence in the interpretation can be grounded to their presence in the evidence. Ungrounded generators (), on the other hand, are concepts that represent the essential, contextual knowledge about grounded generators. The term grounding is used to differentiate concepts based on their presence in the evidence. In Figure 2, the concepts person, instruments and music are the ungrounded generators, whereas the other concepts represent the grounded generators. While the ungrounded generators are not present in the input data, they are essential to understanding the semantic relationship between the actor (woman) and the object of interest (piano).

Expressing Associations using Bonds. Each of the concepts shares a semantic relationship with other generators. These associations can represent specific semantics such as spatial, temporal, and social, to name a few. We express these semantics in the pattern theory interpretation through links called bonds. The direction of the bonds signifies the semantics of a concept and the type of relationship shared with its bonded generator. For example, the generators piano and instruments are semantically related through the assertion that “a piano is an instrument”. The energy of a bond is used to quantify the strength of the semantic relationship expressed between two generators. The energy of a bond is given by the function:


where and represent the bonds from the generators and , respectively; is the strength of the assertion expressed in the bond. Note that we use to normalize the assertion strength to range from to . The normalization range of and also allows us to express negative assertions between two concepts that are not compatible and hence can reduce the validity of the contextualized evidence. We use the labeled assertions from ConceptNet as the source of these bonds, both for quantification as well as labels.

Interpretations as Configurations. The semantics of the given evidence can be expressed through complex structures called configurations, . Generators combine through their local bond structures. An example of a configuration is shown in Figure 2. Each configuration has an underlying graph topology, specified by a connector graph , where is the set of all available connector graphs. is also called the connection type and is allowed to follow the directed connections between elements of a Partially Ordered Set (). A prescribes a hierarchy for the relationships between the concepts in ConceptNet with ordering present between concepts at different levels of hierarchy. We allow for two levels of hierarchy in the generator space - one for the concepts present in the evidence and one for those in the hypotheses. The evidence generators are one level above those of the hypotheses generators, implying a natural order of connection.

Formally, we define a configuration to be a connector graph whose sites are populated by a collection of generators expressed as,


The semantic content of the configuration is defined by the choice of the generators . The configuration in Figure 2 can be expressed in terms of the evidence (raw data) and vice versa.

The probability of a given configuration can be computed by the energy of a configuration . The energy is defined to be the sum of the bond energies (Equation 2 formed by the bond connections between generators in the configuration. The energy is given by


and the probability of the configuration is given by . Hence lower the energy of the configuration, the higher its probability.

Knowledge Source: ConceptNet. To model the semantics of the interpretations, we propose the use of a large commonsense knowledgebase as the source of knowledge about concepts and their semantic associations. ConceptNet, proposed by Liu and Singh [17] and expanded to ConceptNet5 [30], is one such knowledge base that maps concepts and their semantic associations into a large scale, traversable semantic network. ConceptNet serves as the source of general human knowledge which encodes cross-domain semantic information in a hypergraph. Each node in the ConceptNet framework represents concepts which are connected by weighted edges, labeled as expressed by humans in natural language.

ConceptNet contains more than 3 million concepts, extracted from a variety of sources such as DBPedia, Wiktionary, OpenCyc, and WordNet, to name a few. There are more than 25 assertions (semantic relations) connecting the concepts, with each assertion specifying and quantifying the semantic relationship between the two concepts The weight of each edge determines the validity of the assertion based on the sources, with positive values indicating positive assertions and negative values indicating the opposite. In this work, we consider all the concepts in ConceptNet to be the generator space , as well as the source of knowledge for quantifying the bonds between generators. Hence, the weights of the assertions are used to populate the value of in Equation 2, which is also used to determine the validity of the contextualized evidence.

Building Contextualized Interpretations

At the core of our approach is the notion of “contextualization”. First defined by Gumperz [12], contextualization involves the use of relevant “presuppositions” from prior knowledge to maintain involvement in the current task. Here, it refers to the use of prior knowledge to aid in interpreting the observed evidence. Specifically, “presuppositions” refers to the inherent knowledge of a concept such as its properties, shared semantics with other concepts and background knowledge of concepts, their properties, and semantics. This use of prior knowledge allows us to go beyond what is observed and construct interpretations beyond simple, pairwise relationships. These presuppositions are also called “contextualization cues” and represent the “ungrounded” concept generators in the evidence interpretation.

The use of contextualization cues has two distinct advantages: (1) it allows us to capture semantic relationships among concepts whose co-occurrence has not been observed and (2) it will enable us to move towards an open world paradigm and hence bypass the need for annotated training data. Formally, let concepts be represented by for and let represent relations between two concepts, then contextualization cue, , satisfies the following expression . Hence, two concepts that do have a direct relationship can be correlated using contextualization cues. For example, in Figure 2, the use of contextualization cues person, music and interpretation allow us to establish a semantic association between the concepts woman and piano.

Hence, the task of constructing the contextualized evidence then becomes finding an optimal interpretation, , given the evidence generators , a set of hypothesis generators and the prior knowledge in terms of the ConceptNet graph, . We factor this probability into two parts: a likelihood term, and a prior, , normalized by the distribution over the evidence where is the combined set of both evidence and hypothesis generators. Hence constructing the contextualized interpretation then becomes finding the configuration that maximizes the probability given by


This probability can be captured using energy functions.


where represents the energy of the configuration that involve the grounded generators and the detected concepts. While, captures the energy of the ungrounded, prior, generators. The total energy of a configuration is the sum of these energies: , as defined in Equation 4. Each of the terms and is computed by only summing the energy of all bonds over the ungrounded generators and grounded generators, respectively.

IBE: Inference to the Best Explanation

Figure 3: The proposed Abductive Reasoning Process is illustrated here. Given an observed evidence and putative hypotheses, contextualized interpretations of the evidence is constructed. A pairwise comparison algorithm is then used to perform Inference to the Best Explanation and rank the hypotheses in terms of plausibility.

The final step in the abductive reasoning framework is inference to the best explanation. In our framework, this involves the construction of contextualized interpretations for each of the available hypotheses along with the observed evidence . Once the configurations have been constructed, the validity or rather the “plausibility” of the hypothesis can be obtained by computing the probability of the configuration as defined in Equation 4. Finding the highest-ranking hypothesis then becomes a matter of pairwise comparison between the available set of hypotheses. We use the premise from the Bradley–Terry model [4] to obtain the outcome of the pairwise comparison between two given configurations, as illustrated in Figure 3. The pairwise comparison between two contextualized configurations and is given by


where is the probability of the contextualized interpretation of the evidence and a given hypotheses . When performed with all available hypotheses , it becomes the optimization for the inference defined in Equation 1. Any case of indifference in the outcome of this test is decided by choosing the hypothesis with highest energy among grounded concept generators.

Knowledge Distillation for Domain Specialization

The knowledge from the abductive reasoning framework is distilled into a specialist neural network by presenting the hypothesis selected from IBE as targets for optimization. The probability of each of the hypotheses produced from the specialist network is given by



represents the logits layer for the given hypothesis

and its corresponding probability is given by . represents the temperature parameter which modulates the probability assigned to each of the target hypotheses. When , all hypotheses have uniform probability and represents the standard softmax function.

Experimental Evaluation


We evaluate the performance of the proposed abductive reasoning approach on two different commonsense NLI datasets in SWAG  [34] and HellaSWAG [35]. The use of adversarial filtering in both of these datasets ensure that the effect of annotation artifacts is reduced and hence allows us to evaluate the robustness of our approach. Additionally, the premise for the construction of both these datasets is the idea predicting which event is most likely to occur next in a video, given an observation of the current event. This premise offers two significant challenges: (1) answering these questions go beyond what is observed in natural language and requires reasoning across a variety of themes such as social, temporal and spatial to name a few; and (2) the language descriptions are grounded in vision, which makes the reasoning over the language concepts more susceptible to variations in the physical world. We use the official train, dev and test split for both datasets.

The SWAG [34] dataset consists of 113k multiple choice questions derived from captions of consecutive events of videos in the ActivityNet Captions  [16] and the Large Scale Movie Description Challenge (LSMDC) [26] datasets. The videos cover a variety of domains and hence requires reasoning across domains, temporal scales, and physical interactions to complete the task. Each question is accompanied by four (4) answer choices, with one being a human-verified “gold” ending and three (3) adversarial distractors.

The HellaSWAG [35] dataset is a commonsense question-answering dataset consisting of around 70k multiple-choice questions. It, like SWAG, is also grounded in vision by constructing the question-answer pairs from the captions of consecutive videos in ActivityNet. Additionally, a more challenging domain is introduced by populating question-answer pairs by completing how-to articles from WikiHow, an online how-to manual. The task is to choose the most plausible answer choice from a set of four (4) possible answer choices.

Evaluation Metrics and Baselines

We use several fully supervised and weakly supervised baselines to evaluate our approach. We ensure that we have a balanced mix of neural network-based approaches as well as classic approaches. The fully supervised baselines include GPT [24], BERT [7], fastText [15], ESIM  [6] and an LSTM-based approach. For comparison with weakly supervised approaches, we evaluate with models trained on the SNLI task for producing to obtain a probability set for entailment, neutral, and contradiction. A bilinear model is trained only to convert SNLI probabilities to answer probabilities. We essentially learn only the correlations between the entailment categories of SNLI to the probability space in the QA task. We evaluate all approaches by computing the accuracy of the predictions. We use the official scoring protocols provided by the authors of SWAG and HellaSWAG for a fair comparison with the other methods.

Quantitative Evaluation

We first evaluate our approach on the SWAG dataset and compare against fully supervised and weakly supervised. The results are presented in Table 1

. We show the performance of fastText to highlight the importance of commonsense knowledge and the abductive reasoning process. FastText models a given text as a bag of n-grams and predicts the probability of each ending being correct or incorrect. This approach is heavily reliant on word embeddings and does not generalize well to this task. It is also interesting to note that our approach outperforms all weakly supervised baselines such as the dual bag of words approach and the SNLI-based approaches. These approaches are the closest related to our approach since they are not trained directly on the SWAG training split.

Supervision Approach Val. Test
Acc. Acc.
Full fastText 29.4 28.0
LSTM + GloVe 43.1 43.6
DecompAttn. + GloVe 47.4 47.6
ESIM + GloVe 51.9 52.7
ESIM + ELMO 59.1 59.2
OpenAI GPT 77.6 77.9
BERT 86.6 86.3
Weak DualBoW+GloVe 34.5 34.7
SNLI + DecompAttn. 35.8 35.8
SNLI + ESIM 36.4 36.1
None Ours PT (No Training) 38.4 38.2
PT + BERT 39.7 39.5
Table 1: Performance on SWAG data set

BERT currently has the best performance on the validation and test sets on the SWAG test set. It should be noted that the fully supervised approaches required significantly more training data - both in the form of labels as well as training epochs. We can successfully transfer the knowledge onto the BERT architecture using our abductive reasoning approach with just

epoch of training while retaining of the original model’s performance without any human annotations.

We also evaluate our approach on the tougher HellaSWAG dataset. The adversarial filtering technique on this dataset has been improved to increase the perplexity of BERT-like models on this task. The effect of this approach can be seen in Table 2. Again, we compare against the same baselines as in SWAG and find that our approach offers competitive performance to the fully supervised baselines. We find that using our abductive reasoning approach on BERT allows the model to retain up to of its performance as compared to a model trained directly on the annotations.

Supervision Approach Val. Test
Acc. Acc.
Full FastText 30.9 31.6
ESIM + ELMO 33.6 33.3
LSTM + GloVe 31.9 31.7
OpenAI GPT 41.9 41.7
BERT 46.7 47.3
None PT + BERT 27.8 28.1
Ours PT (No Training) 28.3 26.7
Table 2: Performance on HellaSWAG data set

It should be noted that the use of ConceptNet-based contextualization on HellaSWAG has one significant drawback: the answers are too similar in their use of semantics, and hence the potential for indifference goes up. We observe that the number of examples where the second-best hypothesis had the same energy as the top hypothesis was when the correct answer was in the top 2. This indifference could, arguably, be diminished by grounding the concepts in vision or other modalities. This is not an unreasonable assumption since the pre-training data used in BERT contains articles from Wikipedia which have vision-based explanations which constrain the semantics to those observed in vision.

We evaluate the performance of current QA systems under a Low Resource environment. We limit the amount of training data available to these models for a fair comparison with our approach. The results are shown in Figure 4. We plot the accuracy of the approaches on the SWAG validation data vs. the number of training samples available to them. We present the average result over runs with randomly sampled examples from the training data. We can see that most of the current approaches perform well when presented with increasingly large amounts of data. BERT seemingly performs well when presented with as few as training samples, achieving on the validation set.

Figure 4: Comparison of current QA models’ performance as a function of number of available training questions under a low resource setting on the SWAG dataset.

However, given their ability to be rapid surface learners (CITE HellaSWAG), we evaluate the ability to generalize to new domains by evaluating the models on the HellaSWAG validation data while training on the SWAg data. As can be seen from Table 3, there is a significant gap in generalization ability of BERT from SWAG to HellaSWAG, especially in a low resource constraint. We need at least 10,000 training examples on SWAG for it to outperform our model on HellaSWAG, although both datasets are derived from the same source.

Number of Samples Approach Accuracy
100 BERT 23.4
ESIM + ELMO 19.6
1,000 BERT 26.1
ESIM + ELMO 20.3
5,000 BERT 29.0
ESIM + ELMO 20.7
10,000 BERT 30.7
ESIM + ELMO 21.5
25,000 BERT 32.8
ESIM + ELMO 23.9
All (73,000) BERT 34.6
ESIM + ELMO 26.7
None Our Approach 28.3
Table 3: Generalization from SWAG to new domains and vocabulary introduced in HellaSWAG

We also perform ablative studies to evaluate our approach by varying the student model and the source of semantic knowledge. We train two student networks other than BERT. We use two strong models as student networks, namely ESIM and Unary LSTM models, which achieve and respectively on SWAG. We also vary the source of semantic knowledge by using GloVe [22] representations and ConceptNet NumberBatch [29] and using dot-product to compute the likelihood of co-occurrence of two concepts. We also test the effectiveness of using contextualized interpretations by relying only on the direct semantic relationships between two concepts in the ConceptNet semantic network. As can be seen from Table 4, the use of ConceptNet is essential for robust abductive reasoning.

Approach Val.
Ours PT (NB only) 25.9
Ours PT (GloVe only) 26.3
Ours PT (GloVe + NB) 28.1
Ours PT (No Contextualization) 33.6
Ours PT (Full Model) 38.4
PT + LSTM+GloVe 32.4
PT + ESIM+GloVe 39.4
PT + BERT 39.7
Table 4: Ablative Studies for our model, where we compare against variations of our approach. We compare different source of knowledge and different student networks.

Using representations learned from pre-computed embeddings such as GloVe or Numberbatch without using ConceptNet do not generalize to the QA with adversarial filtering. Additionally, the use of contextualization to construct interpretations also has a positive effect on the robustness of our approach. We can see improvements of absolute percentage points in accuracy through the use of contextualization. We do not see any increase in performance by combining all three sources of knowledge.

Discussion and Future Work

We present one of the first works on abductive reasoning for commonsense question answering. We show that the use of a global source of knowledge can be used to distill commonsense knowledge and reasoning to neural networks without large amounts of annotations. We demonstrated the use of pattern theory to express semantics in the evidence in a highly interpretable, contextualized interpretation for validating the plausibility of natural language expressions without training highly expensive models. Extensive experiments demonstrate the applicability of the approach to different domains and its highly competitive performance. We aim to ground the contextualized interpretations in vision to enable automatic discovery of concepts without the need for relearning or large amounts of training data.


  • [1] A. Aliseda (2006) Abductive reasoning. Vol. 330, Springer. Cited by: Related Work.
  • [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In

    Proceedings of the IEEE international conference on computer vision

    pp. 2425–2433. Cited by: Related Work.
  • [3] C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, S. W. Yih, and Y. Choi (2019) Abductive commonsense reasoning. arXiv preprint arXiv:1908.05739. Cited by: Related Work.
  • [4] R. A. Bradley and M. E. Terry (1952) Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3/4), pp. 324–345. Cited by: IBE: Inference to the Best Explanation.
  • [5] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker (2017) Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, pp. 742–751. Cited by: Introduction, Related Work.
  • [6] Q. Chen, X. Zhu, Z. Ling, D. Inkpen, and S. Wei (2017) Neural natural language inference models enhanced with external knowledge. arXiv preprint arXiv:1711.04289. Cited by: Related Work, Evaluation Metrics and Baselines.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction, Related Work, Evaluation Metrics and Baselines.
  • [8] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161. Cited by: Related Work.
  • [9] C. Elsenbroich, O. Kutz, and U. Sattler (2006) A case for abductive reasoning over ontologies.. In OWLED, Vol. 216. Cited by: Related Work.
  • [10] H. R. Fischer (2001) Abductive reasoning as a way of worldmaking. Foundations of Science 6 (4), pp. 361–383. Cited by: Related Work.
  • [11] U. Grenander (1996) Elements of pattern theory. JHU Press. Cited by: Representing Interpretations using Pattern Theory.
  • [12] J. J. Gumperz (1992) Contextualization and understanding. Rethinking context: Language as an interactive phenomenon 11, pp. 229–252. Cited by: Building Contextualized Interpretations.
  • [13] D. Gunning (2018) Machine common sense concept paper. arXiv preprint arXiv:1810.07528. Cited by: Introduction.
  • [14] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: Introduction, Related Work.
  • [15] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: Related Work, Evaluation Metrics and Baselines.
  • [16] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles (2017) Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pp. 706–715. Cited by: Data.
  • [17] H. Liu and P. Singh (2004) ConceptNet—a practical commonsense reasoning tool-kit. BT technology journal 22 (4), pp. 211–226. Cited by: Introduction, Representing Interpretations using Pattern Theory, Representing Interpretations using Pattern Theory.
  • [18] J. Meheus and D. Batens (2006) A formal logic for abductive reasoning. Logic Journal of IGPL 14 (2), pp. 221–236. Cited by: Related Work.
  • [19] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit (2016)

    A decomposable attention model for natural language inference

    arXiv preprint arXiv:1606.01933. Cited by: Related Work.
  • [20] C. S. Peirce (1931) Collected papers of charles sanders peirce. Harvard University Press. Cited by: Introduction, Related Work.
  • [21] C. S. Peirce (1965) Pragmatism and pragmaticism. Vol. 5, Belknap Press of Harvard University Press. Cited by: Introduction.
  • [22] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Quantitative Evaluation.
  • [23] A. Polino, R. Pascanu, and D. Alistarh (2018) Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: Introduction, Related Work.
  • [24] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: Introduction, Related Work, Evaluation Metrics and Baselines.
  • [25] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: Related Work.
  • [26] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele (2017) Movie description. International Journal of Computer Vision 123 (1), pp. 94–120. Cited by: Data.
  • [27] A. Rücklé, N. S. Moosavi, and I. Gurevych (2019) COALA: a neural coverage-based approach for long answer selection with small data. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 6932–6939. Cited by: Related Work.
  • [28] P. Singla and R. J. Mooney (2011) Abductive markov logic for plan recognition. In Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: Related Work.
  • [29] R. Speer, J. Chin, and C. Havasi (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: Quantitative Evaluation.
  • [30] R. Speer and C. Havasi (2013) ConceptNet 5: a large semantic network for relational knowledge. In The People’s Web Meets NLP, pp. 161–176. Cited by: Representing Interpretations using Pattern Theory, Representing Interpretations using Pattern Theory.
  • [31] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) Superglue: a stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537. Cited by: Related Work.
  • [32] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: Introduction.
  • [33] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. arXiv preprint arXiv:1701.02426. Cited by: Introduction.
  • [34] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) Swag: a large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326. Cited by: Related Work, Data, Data.
  • [35] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. arXiv preprint arXiv:1905.07830. Cited by: Related Work, Data, Data.
  • [36] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang (2016) Real-time action recognition with enhanced motion vector cnns. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2718–2726. Cited by: Related Work.