Log In Sign Up

Can Language Models perform Abductive Commonsense Reasoning?

Abductive Reasoning is a task of inferring the most plausible hypothesis given a set of observations. In literature, the community has approached to solve this challenge by classifying/generating a likely hypothesis that does not contradict with a past observation and future observation. Some of the most well-known benchmarks that tackle this problem are aNLI and aNLG (pronounced as alpha-NLI and alpha-NLG). In this report, I review over some of the methodologies that were attempted to solve this challenge, re-implement the baseline models, and analyze some of the weaknesses that current approaches have. The code and the re-implemented results are available at this link.


page 1

page 2

page 3

page 4


Abductive Commonsense Reasoning

Abductive reasoning is inference to the most plausible explanation. For ...

PInKS: Preconditioned Commonsense Inference with Minimal Supervision

Reasoning with preconditions such as "glass can be used for drinking wat...

Visual Abductive Reasoning

Abductive reasoning seeks the likeliest possible explanation for partial...

Towards Reasoning in Large Language Models: A Survey

Reasoning is a fundamental aspect of human intelligence that plays a cru...

Back to Square One: Bias Detection, Training and Commonsense Disentanglement in the Winograd Schema

The Winograd Schema (WS) has been proposed as a test for measuring commo...

Learning to Acquire Information

We consider the problem of diagnosis where a set of simple observations ...

Doing Good or Doing Right? Exploring the Weakness of Commonsense Causal Reasoning Models

Pretrained language models (PLM) achieve surprising performance on the C...

1 Introduction

Reasoning can be divided into three different categories; (1) Abduction is a process of both generating hypotheses and selecting some for further pursuit; (2) Deduction draws out their testable consequences; (3) Induction evaluates them (Kapitan, 1992). People unconsciously and constantly perform these three different types of reasoning within daily lives. Reasoning allows us to be intelligent coupled with abstraction capabilities (Chollet, 2019). In order to accomplish the grand goal of building Artificial General Intelligence(AGI), reasoning is one of the biggest obstacles that must be fully understood, but there is less progress made.

Within the research community, there has been various attempts to build systems that could mimic the reasoning capabilities a human might have. More specifically, researchers have paid a lot of attention to Commonsense Reasoning, a field that deals with knowledge and inferences that we encounter in everyday situations. Different line of works tried to build benchmarks that test whether a system posses a particular commonsense reasoning type (Qin et al., 2019; Zhang et al., 2020; Bisk et al., 2020; Sap et al., 2019; Zellers et al., 2019; Zhou et al., 2019; Talmor et al., 2019). Others tried to build methodologies that are efficient in solving these tasks (Klein and Nabi, 2019; Wang et al., 2020; Rajani et al., 2019; Majumder et al., 2020).

ARI is a large-scale benchmark dataset that specifically tackles abductive commonsense reasoning (Bhagavatula et al., 2019). In this setting, two text inputs are given; (1) a observation that is made in the past; (2) a observation that is made in the future. Given the input, the goal of aNLI task is to classify a more plausible hypothesis that might have occurred between the past - future observations given two candidate hypotheses, while the goal of aNLG is to generate this hypothesis. For example, when a person locked up all the windows and doors before leaving the house and observed that the house was set open with all the furniture broken, it is more likely that a thief might have stole from his/her house instead of a bird flying through the house and making a mess.

Generating a good hypothesis is considered more challenging than selecting a better hypothesis since it tests the model has to generate a plausible text output based on the commonsense knowledge it has and map with with the given observations. It could be viewed of a more general form of abductive reasoning. Among the two tasks, I focus on the generative task (aNLG), in this report.

2 Related Works and Methods

2.1 Baseline Model

As a baseline model, Bhagavatula et al. (2019) proposed a model that conditionally generates by simply concatenating the past observation and future observation back and forth, and divide it with special tokens <o1>, </o1>, <o2>, </o2>. The loss is formalized as a negative log-likelihood as in a typical text generation setting.


In this report, I re-implemented the inference procedure using this model and compare with the reported scores of Bhagavatula et al. (2019).

2.2 Using a Commonsense Inference Model for assistance

Bhagavatula et al. (2019) proposed another model that conditions on an additional background knowledge . Specifically, the authors used a commonsense knowledge inference model COMET (Bosselut et al., 2019). COMET is a transformer-based model that generates a commonsense inference when a narrative text and its corresponding relation (e.g., xWant, xNeed, xIntent) is given. With an additional background knowledge , the loss can be formalized as follows.


The additional commonsense knowledge can be provided in 2 different methods. First, the text output of the commonsense knowledge can explicitly be provided as additional input where the text generation model can potentially use it to generate a plausible hypothesis. Second, the COMET embedding output of the commonsense inferences can be given as additional input by concatenating with the token embedding of the observations. Bhagavatula et al. (2019) use a setting of appending eighteen additional embeddings (corresponding to nine relations for each observation) and explains that this allows the model to learn each token’s representation while attending to the COMET embeddings to effectively integrate background commonsense knowledge into a language model.

In this report, I re-implemented the inference procedure using this model and compare with the reported scores of Bhagavatula et al. (2019).

2.3 Enhanced Decoding Methods

The approach of finetuning a Pre-trained Language Model on a particular task has gained success in different subfields within Natural Language Processing. However, supervised approaches still perform considerably worse than humans and are subject to developing superficial strategies such as repeatedly generating the same phrase or memorizing prevalent surface patterns specific in the dataset. A number of research tackles this problem and come up with some advanced form of text decoding methods to alleviate this problem.

Qin et al. (2020) proposed a decoding method, DELOREAN that is suitable for abductive reasoning and counter-factual reasoning. Instead of simply concatenating the past and future observations when they are given as input like Equation 1, 2

, DELOREAN generates the hypothesis based only on the past observation. Then, by conditioning on the past observation and the hypothesis candidate, the model generates a plausible future observation. Note that while there is a ground truth future observation, a new future observation is additionally generated. The cross entropy loss between the existing future observation and the newly generated future observation is used to compute the gradients of the hypothesis using backpropagtion algorithm. Lastly, the model samples possible tokens step-by-step using both the logits of the forward pass and the backward pass. Therefore, DELOREAN isa backprop-based decoding algorithm.

More recently, Qin et al. (2022) proposed a different decoding algorithm, COLD which is designed for abductive reasoning, counter-factual reasoning, and constrained reasoning. COLD treats text generation as sampling from an energy-based distribution (Hinton, 2002; LeCun et al., 2006), which allows flexibly composing constraints based on the task at hand. It uses Langevin dynamics which is capable of mapping continuous sample into discrete, fluent text. It uses the gradient of the energy function which can specifically be designed based on the constraints that exist in the underlying task on-the-fly without the need of fine-tuning.

In this report, I did not re-implemented this methods, but compare with the reported scores of Qin et al. (2020, 2022).

Model Bleu-4 METEOR ROUGE-L Cider Bert-score
GPT2 (unsupervised) (Bhagavatula et al., 2019) 0.00 9.29 9.99 3.34 36.69
GPT2 + DELOREAN (unsupervised) (Qin et al., 2020) 1.60 / 19.06 7.88 41.74
GPT2 + COLD (unsupervised) (Qin et al., 2022) 1.79 / 19.50 10.68 42.67
GPT2 (supervised) (Bhagavatula et al., 2019) 2.23 16.71 22.83 33.54 48.74
GPT2 (supervised; re-implemented) 3.06 18.61 24.49 33.42 48.70
GPT2 + COMET-txt (Bhagavatula et al., 2019) 2.29 16.73 22.51 31.99 48.46
GPT2 + COMET-txt (re-implemented) 3.13 18.63 24.50 32.66 48.50
GPT2 + COMET-emb (Bhagavatula et al., 2019) 3.03 17.66 22.93 32.00 48.52
GPT2 + COMET-emb (re-implemented) 4.13 19.87 23.81 30.86 48.32
Table 1: Automatic evaluation on aNLG benchmark in terms of different metrics

3 Details of re-Implementation

I have used the implementation222 provided by AllenAI to reproduce the results and test the fine-tuned models.

I have used GPT-2 XL, a 1.5B parameter model as a baseline model for the re-implementation. The inference procedure uses beam-search based decoding with temperature of 1.0, top_p value of 0.9. For COMET-text and COMET-emb based models, the COMET relation oEffect, oReact, oWant, xAttr, xEffect, xIntent, xNeed, xReact, xWant were used. More details of the fine-tuning procedure is presented in Bhagavatula et al. (2019).

4 Evaluation of re-Implementation

The results of the re-implemented versions and the scores reported in each corresponding papers are shown in Table 1. Since it is not fair to directly compare supervised approaches and unsupervised approaches, we should compare independently.

Using Enhanced decoding approaches show a significant amount of better performance in ROUGE-L and Cider metric, gaining more than twice better performance. However, its performance is still not comparable with fine-tuning methods implying that improving decoding strategies itself could not be a substitute fine-tuning yet. This result addresses open-questions to whether a prominent strategy could be used in abductive reasoning instead of the current fine-tuning approaches in future works.

Comparing the re-implemented version and the reported scores of the supervised methods, the re-implemented versions achieved better scores in 11 out of 15. This proves that my re-implementation of inferencing on fine-tuned methods was successful. Also, comparing the scores between different approaches, the methods that reached the highest scores among different metrics differs implying that using COMET text or embedding isn’t very effective compared to the baseline model. This also opens room of future research of how we could inject additional knowledge for abductive reasoning.

5 Analysis on Failure Cases

In this section, we look at the failure cases each unsupervised and supervised methods made in solving the aNLG task. Some of the failure cases are presented in Table 2.

Past Observation Future Observation
I really love to play video games. I couldn’t play video games until I bought a new one.
Answer Hypothesis
Sadly my playstation broke after a flood.
GPT2 (supervised)
I had to quit playing my favorite video games after a week.
GPT2 + COMET-txt
I had to order new games from my local game store.
GPT2 + COMET-emb
I had a cracked computer.

Amy had a roommate named Sue.
They didn’t speak to each other for weeks.
Answer Hypothesis
Sue was very messy.
GPT2 (supervised)
Amy asked Sue to come to class on time.
GPT2 + COMET-txt
Amy asked Sue to come to work on time, but she arrived late.
GPT2 + COMET-emb
Sue had to go to work on time.

Sam always wanted to save up and buy a computer.
Sam was devastated when he came home to an empty box.
Answer Hypothesis
The computer was delivered on the front porch.
GPT2 (supervised)
Sam didn’t save enough money for the computer.
GPT2 + COMET-txt
Sam didn’t save enough money.
GPT2 + COMET-emb
Sam didn’t save up enough money to but one.

In Fort Worth, we have an event call hit the bricks.
However, thanks to my training the run was easy.
Answer Hypothesis
We had to run several miles.
GPT2 (supervised)
I trained myself for the run.
GPT2 + COMET-txt
I trained myself to run very fast.
GPT2 + COMET-emb

I took a deep learning class to teach myself how to run.

Nikki wanted candy.
Nikki was very mad at her mother.
Answer Hypothesis
Her mother wouldn’t buy candy because it’s bad for Nikki’s teeth.
GPT2 (supervised)
Nikki got candy from her mom.
GPT2 + COMET-txt
Nikki got out of the deli.
GPT2 + COMET-emb
She went to the store for some candy.

The cat sunned itself where the light passed through the window.
The cat slept there fore the next two hours.
Answer Hypothesis
The cat made it self comfortable in the sunlight.
GPT2 (supervised)
The cat was sleeping in a shady spot.
GPT2 + COMET-txt
The cat was sleeping in a hard pile.
GPT2 + COMET-emb
The cat always woke up in the dark.

Table 2: Failure examples of generated hypotheses from GPT2 and its variants

5.1 Language Models do not capture the Causal Relationship between events

Majority cases of Language Models failing to capture the Causal Relationship between events were found. For example, given a past observation such that ’Sam always wanted to save up and buy a computer’ and a future observation such that ’Sam was devastated when he came home to an empty box’, we could easily suspect that Sam saved money, bought a computer, but he was scammed and got an empty box. The reasoning chain requires 4 hops in total which is very challenging for a Language Model to learn. Therefore, GPT2, GPT2 + COMET-txt, GPT2 + COMET-emb all generated a hypothesis that ’Sam didn’t save enough money’.

In order to overcome this issue, it is likely that more advanced learning methods that could incorporate causal reasoning should be developed. Simply showing the answer hypothesis is not enough to enforce the model to learn the mapping patterns because it is well known that Deep Learning methods are likely to learn the shortcuts (Geirhos et al., 2020). Although increasing the model size and data size is the top-trend within the community of Natural Language Processing (Brown et al., 2020), there is room of improvement that could be accomplished by establishing reasoning based learning methods (Luo et al., 2020). Simply adding the text outputs or embeddings of commonsense inference models did not resolve the issue of enforcing the model to capture the causal relationships.

5.2 Language Models are weak in Negation Logic

Negation is an important property in many language understanding tasks, such as sentiment analysis, question answering, knowledge base completion and natural language inference 

(Hosseini et al., 2021). However, a number of examples generated by GPT2 did not seem to capture the crucial negation logic between the past and future observations. For example, given a past observation such that ’Nikki wanted candy’ and a future observation such that ’Nikki was very mad at her mother’, we could conjecture that Nikki did not get candy from her mother. However, all of the generated results included the information that Nikki got candy from her mother.

I conjecture there are two possible reasons why GPT2 did not successfully generate an plausible hypothesis. First, it did not understand that in order for Nikki to be angry at her mother, she wouldn’t have got what she wanted. This means that the model was unable to capture the negation logic in a social situation. Second, the model might not have the commonsense knowledge to inference the hypothesis. If the model possessed commonsense knowledge that mothers typically do not give candy to their children, it is more likely to have generated a more acceptable result.

5.3 Language Models generate open-bounded results in Open-Domain Tasks

The ability to control the generation results is crucial in open-domain settings. There were various methods developed to control the desired output in response generation (Yang et al., ) and text-generation (Keskar et al., 2019; Dathathri et al., 2019). In abductive commonsense reasoning, the scope of possible outputs is very large, and I observed a lot of cases where the model generates a awkward output. For example, given a past observation such that ’Amy had a roommate named Sue’ and a future observation such that ’They didn’t speak to each other for weeks’ it is likely that some part of Sue upset Amy or vice versa. However, the result in which GPT2 generated is that Amy or Sue did not come to work in time. Although it is possible that Amy and Sue might also be coworkers, a more appropriate answer should be constrained into an event that might have happened inside their room. The models did not generate a similar output to the answer which is ’Sue was very messy’.

6 Conclusion

In this report, I have summarized a novel commonsense reasoning task of abductive reasoning. I explained some of the state-of-the-art methodologies in solving the task and re-implemented the supervised methods and measured their scores in terms of automatic evaluation metrics. I have analyzed some of the weaknesses by introducing some failure cases and suggested how the community could develop more advanced learning methods to make models that could perform abductive reasoning. I have included the code I have used for re-implementation in the report.


This report is submitted to the final project of BigData course(CSI4121.01) of Yonsei University. All the experiment code and report was written by the author.


  • C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, S. W. Yih, and Y. Choi (2019) Abductive commonsense reasoning. arXiv preprint arXiv:1908.05739. Cited by: §1, §2.1, §2.1, §2.2, §2.2, §2.2, Table 1, §3.
  • Y. Bisk, R. Zellers, J. Gao, Y. Choi, et al. (2020) Piqa: reasoning about physical commonsense in natural language. In

    Proceedings of the AAAI conference on artificial intelligence

    Vol. 34, pp. 7432–7439. Cited by: §1.
  • A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and Y. Choi (2019)

    COMET: commonsense transformers for automatic knowledge graph construction

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4762–4779. Cited by: §2.2.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in neural information processing systems 33, pp. 1877–1901. Cited by: §5.1.
  • F. Chollet (2019) On the measure of intelligence. arXiv preprint arXiv:1911.01547. Cited by: §1.
  • S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu (2019) Plug and play language models: a simple approach to controlled text generation. arXiv preprint arXiv:1912.02164. Cited by: §5.3.
  • R. Geirhos, J. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann (2020)

    Shortcut learning in deep neural networks

    Nature Machine Intelligence 2 (11), pp. 665–673. Cited by: §5.1.
  • G. E. Hinton (2002)

    Training products of experts by minimizing contrastive divergence

    Neural computation 14 (8), pp. 1771–1800. Cited by: §2.3.
  • A. Hosseini, S. Reddy, D. Bahdanau, R. D. Hjelm, A. Sordoni, and A. Courville (2021) Understanding by understanding not: modeling negation in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1301–1312. Cited by: §5.2.
  • T. Kapitan (1992) Peirce and the autonomy of abductive reasoning. Erkenntnis 37 (1), pp. 1–26. Cited by: §1.
  • N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) Ctrl: a conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: §5.3.
  • T. Klein and M. Nabi (2019) Attention is (not) all you need for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4831–4836. Cited by: §1.
  • Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang (2006) Predicting structured data. A tutorial on energy-based learning. MIT Press, Cambridge. Cited by: §2.3.
  • Y. Luo, J. Peng, and J. Ma (2020) When causal inference meets deep learning. Nature Machine Intelligence 2 (8), pp. 426–427. Cited by: §5.1.
  • B. P. Majumder, H. Jhamtani, T. Berg-Kirkpatrick, and J. McAuley (2020)

    Like hiking? you probably enjoy nature: persona-grounded dialog with commonsense expansions

    In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9194–9206. Cited by: §1.
  • L. Qin, A. Bosselut, A. Holtzman, C. Bhagavatula, E. Clark, and Y. Choi (2019) Counterfactual story reasoning and generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5043–5053. Cited by: §1.
  • L. Qin, V. Shwartz, P. West, C. Bhagavatula, J. D. Hwang, R. Le Bras, A. Bosselut, and Y. Choi (2020) Back to the future: unsupervised backprop-based decoding for counterfactual and abductive commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 794–805. Cited by: §2.3, §2.3, Table 1.
  • L. Qin, S. Welleck, D. Khashabi, and Y. Choi (2022) COLD decoding: energy-based constrained text generation with langevin dynamics. arXiv preprint arXiv:2202.11705. Cited by: §2.3, §2.3, Table 1.
  • N. F. Rajani, B. McCann, C. Xiong, and R. Socher (2019) Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4932–4942. Cited by: §1.
  • M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y. Choi (2019) Social iqa: commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4463–4473. Cited by: §1.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158. Cited by: §1.
  • P. Wang, N. Peng, F. Ilievski, P. Szekely, and X. Ren (2020) Connecting the dots: a knowledgeable path generator for commonsense question answering. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4129–4140. Cited by: §1.
  • [23] H. Yang, X. Yao, Y. Duan, J. Shen, J. Zhong, and K. Zhang Progressive open-domain response generation with multiple controllable attributes. Cited by: §5.3.
  • R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800. Cited by: §1.
  • H. Zhang, X. Zhao, and Y. Song (2020) WinoWhy: a deep diagnosis of essential commonsense knowledge for answering winograd schema challenge. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5736–5745. Cited by: §1.
  • B. Zhou, D. Khashabi, Q. Ning, and D. Roth (2019) “Going on a vacation” takes longer than “going for a walk”: a study of temporal commonsense understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3363–3369. Cited by: §1.