Implicit Premise Generation with Discourse-aware Commonsense Knowledge Models

09/11/2021 ∙ by Tuhin Chakrabarty, et al. ∙ Georgia Institute of Technology Columbia University 0

Enthymemes are defined as arguments where a premise or conclusion is left implicit. We tackle the task of generating the implicit premise in an enthymeme, which requires not only an understanding of the stated conclusion and premise but also additional inferences that could depend on commonsense knowledge. The largest available dataset for enthymemes (Habernal et al., 2018) consists of 1.7k samples, which is not large enough to train a neural text generation model. To address this issue, we take advantage of a similar task and dataset: Abductive reasoning in narrative text (Bhagavatula et al., 2020). However, we show that simply using a state-of-the-art seq2seq model fine-tuned on this data might not generate meaningful implicit premises associated with the given enthymemes. We demonstrate that encoding discourse-aware commonsense during fine-tuning improves the quality of the generated implicit premises and outperforms all other baselines both in automatic and human evaluations on three different datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In argumentation theory, an enthymeme is defined as an incomplete argument found in discourse, where some components are explicit, but other propositions are left implicit and need to be filled in as premises or conclusions to fully understand what the argument is Walton and Reed (2005). In many instances the missing proposition is a premise. The well-cited example of the Silver Blade case from one of Sherlock Holmes’ stories Walton and Reed (2005) presents such as an incomplete argument

A dog was kept in the stable, and yet, though someone had been in and fetched out a horse, he had not barked enough to rouse the two lads in the loft. Obviously, the midnight visitor was someone whom the dog knew well.

The missing premise in this case is the generalization “Dogs generally bark when a person enters an area unless the dog knows the person well."

While there has been work on identification (i.e., classification) and reconstruction of implicit premises in enthymemes Rajendran et al. (2016); Habernal et al. (2018); Reisert et al. (2015); Boltužić and Šnajder (2016); Razuvayevskaya and Teufel (2017), to our knowledge, automatically generating an implicit premise from a given enthymeme is a new task. There are two main challenges that need to be addressed: 1) lack of large scale data of incomplete arguments together with annotated missing premises needed to train a sequence-to-sequence model (the largest such set contains 1.7K instances Habernal et al. (2018)); and 2) the inherent need to model commonsense or word knowledge.

We propose an approach for generating an implicit premise given a incomplete argument that aims to address these two challenges. Our contributions are three fold.

Reason Vaccinations save lives
Vaccination should be mandatory
for all children
ZeroShot Vaccines save lives, they save money
Fine-tuned on
Vaccinations are the best way to
protect children.
Fine-tuned on
Vaccinations are the best way to
prevent childhood diseases.
Table 1: Implicit Premise Generation by BART Lewis et al. (2020) in three different setting for an input enthymeme from dataset by Habernal et al. (2018)

A new task of generating an implicit premise given an incomplete argument (enthymeme). Given an enthymeme consisting of a stated conclusion and a stated premise, generate the implicit/missing premise. As the backbone sequence-to-sequence architecture we use BART Lewis et al. (2020).

Leverage abductive reasoning as an auxiliary task. To address the first challenge, we rely on an observation from argumentation theory that incomplete arguments in naturally occurring discourse, more often than not, require abductive reasoning (plausible explanations) rather than the more strict form of reasoning based on deductive logic Walton and Reed (2005); Sabre (1990). The Silver Blaze case is such an example. We leverage the Abductive Reasoning in Narrative Text (ART) dataset introduced by Bhagavatula et al. (2020) to fine-tune a BART model. ART consists of pairs of observations together with the plausible explanation to be generated (Section 3).

Encoding discourse-aware common sense knowledge. To address the second challenge, we rely on PARA-COMET Gabriel et al. (2021), a discourse-aware knowledge model that incorporates paragraph-level information to generate coherent commonsense inferences from narratives. We encode the outputs of PARA-COMET during fine-tuning BART on our auxillary dataset (ART) (Section 4). We show on three different datasets (Section 3) that this knowledge-enhanced model performs best both in automatic and human-based evaluations (Section 5).

Table 1 shows an example of an enthymeme consisting of a stated premise and conclusion and the generated implicit premise by a BART model (zero-shot), by a BART model fine-tuned on ART dataset, and a BART model fine-tuned on ART augmented with discourse-aware commonsense knowledge derived from PARA-COMET. We make the code available at

2 Related Work

Prior work on enthymeme reconstruction has focused primarily on the identification (i.e., classification) of implicit premises in enthymemes Rajendran et al. (2016); Habernal et al. (2018); Reisert et al. (2015); Boltužić and Šnajder (2016); Razuvayevskaya and Teufel (2017). Boltužić and Šnajder (2016) study how to identify enthymemes in online discussions, while Habernal et al. (2018) present the task of identifying the correct warrant given two candidates warrants in order to reconstruct an enthymeme. Rajendran et al. (2016)

introduce an approach to classify the stance of a statement as implicit or explicit, as a first step towards the long term goal of enthymeme reconstruction. Unlike these works which propose discriminative approaches to identify an enthymeme or the (correct) implicit premises, we focus on generative models that aim to

generate an implicit premise given an enthymeme, using abductive reasoning and discourse-aware commonsense knowledge.

Alshomary et al. (2020)

introduce a closely related task of generating an argument’s conclusion from its premises. Specifically, they focus on the subtask of inferring the conclusion’s target from the premises. They develop two complementary target inference approaches: one ranks premise targets and selects the top-ranked target as the conclusion target, the other finds a new conclusion target in a learned embedding space using a triplet neural network. Unlike this paper, our work focuses on a new task of generating an implicit premise given an enthymeme that consists of a stated conclusion and a stated premise.

3 Datasets

Training dataset.

Based on the theoretical connection between enthymemes and abductive reasoning, we use the Abductive Reasoning in narrative Text (ART) data developed for the abductive NLG task Bhagavatula et al. (2020) to train our models. The task is framed as: given two observations (O1 and O2) from a narrative, generate the most plausible explanation (hypothesis) (Table 2). The observations O1, O2 in ART are drawn from the ROCStories Mostafazadeh et al. (2016) dataset, a large collection of short, manually curated five sentence stories. The beginning and ending of each story maps to the first (O1) and second (O2) observations in ART, respectively. Bhagavatula et al. (2020) presented O1 and O2 as narrative context to crowdworkers and prompted them to generate plausible and implausible Hypotheses (H) to explain the observations. To avoid annotation artifacts, Bhagavatula et al. (2020) applied an adversarial filtering step to retain one challenging pair of plausible and implausible hypotheses that are hard to distinguish between. The ART training set consists of 50481 instances, while the validation and test set consist of 7252 and 14313 instances, respectively. As can be seen in Table 2 the observations O1 and O2 could be "mapped" to the stated Premise and the stated Claim in an enthymeme, while the hypothesis H is mapped to the implicit premise we try to generate.

O1 Alex had his heart set on an ivy league college
Alex ended up achieving his dream
of getting into the school.
H Alex applied to Harvard
Table 2: Instances from the ART dataset.

Test datasets.

We test our models on three different datasets of incomplete arguments (enthymeme) annotated with human-generated implicit/missing premises. First, we use the Argument Reasoning Comprehension Task dataset released by Habernal et al. (2018) (D1), which contains 1654 {claim, premise, warrant(implicit premise)} triples. Second, we used the dataset introduced by Boltužić and Šnajder (2016), which contains 494 enthymemes from an online debate forum with human annotated implicit premises (D2). Third, we use the dataset introduced by Becker et al. (2020) (D3), which contains implicit premises annotated for each arguments from the MicroText Corpus Peldszus and Stede . For D3, we focus only arguments that are in a support relation since this corresponds to our task. Moreover, we choose the cases where there is only one implicit premise, rather than a chain of linked premises. This results in a total of 112 enthymemes for D3. For all datasets, we apply automatic filtering to keep only full-formed sentences as claim and premises (e.g., remove cases where the stated premise/claim consists of a noun-phrase, a partial clauses, or many sentences).

Amy was looking through her
mother’s old scrapbooks.  [SEP] Amy
realized her mother had dated her
history professor.
Input +
Amy was looking through her
mother’s old scrapbooks.  [SEP] to
find something  [SEP] Amy realized
her mother had dated her history
Amy was looking through her
mother’s old scrapbooks. And since
Amy found pictures of her history
professor and mother together. Amy
realized her mother had dated her
history professor.
Table 3: Encoder input in two settings: fine-tuning on ART and fine-tuning on ART + PARA-COMET (the green text between [SEP]). For decoder’s output every hypothesis is prepended by And since in bolded blue.

4 Method

For our generation model, we use BART Lewis et al. (2020), a pre-trained conditional language model that combines bidirectional and auto-regressive transformers. It is implemented as a sequence-to-sequence model with a bidirectional encoder over corrupted text and a left-to-right auto-regressive decoder.

Fine-tuning BART on Art.

To fine-tune BART on the ART dataset (Section 3), we concatenate O1 and O2 with a special delimiter [SEP] as input to BART encoder as shown in Table 3 Row 1. For decoding, we focus on reconstructing the entire argument given an enthymeme. To encourage fluency and coherence in our generated argument, we prepend the plausible hypothesis (implicit premise) with a discourse marker And since (Table 3 Row 3) during fine-tuning.

Fine-tuning BART on PARA-COMET enhanced Art.

Adapted knowledge models such as COMET Bosselut et al. (2019)

have been shown to generate implicit commonsense inferences along several dimensions (depending on what knowledge graphs they were pre-trained on). PARA-COMET

Gabriel et al. (2021), is an extension of COMET pre-trained on ATOMIC Sap et al. (2019) that is able to generate discourse-aware common sense knowledge. ATOMIC is a knowledge graph that contains 9 relations related to social commonsense knowledge, including dynamic aspects of events such as causes and effects, if-then conditional statements, and mental states. Given a text with T sentences , PARA-COMET generates a set of commonsense inferences for the 9 inferential relations from ATOMIC for each sentence , which are consistent with the entire narrative. Following PARA-COMET’s input format, we create a discourse of two sentences containing [O1,O2] from ART. We then feed this as an input to the trained PARA-COMET model and obtain 9 commonsense relations for both O1 and O2. Given the causal nature of the implicit premises for this work we use only the relation xIntent. Given an event (e.g., “X compliments Y"), xIntent states the likely intents of person X (e.g., “X wants to be nice"). We only consider xIntent returned for O1 (Premise on our task). We experimented with other relations as well as xIntent for both O1 and O2 but the results were not better. After obtaining discourse-aware commonsense, we concatenate {O1, commonsense, O2} in a sequential order as shown in Table 3 Row 2 and pass it to BART’s encoder for fine-tuning. For decoding, we use the same process as before (Table 3 Row 3).

Inference-time decoding.

For generation on our task and test sets, we concatenate the {Premise, Claim} or {Premise, commonsense, Claim} in a given enthymeme in the same way as shown in Table 3 and pass as an input to the encoder of fine-tuned BART. The fine-tuned BART model then generates the entire argument along with the implicit premise auto-regressively. We use beam search with a beam width of 5 for generation. Post decoding, we split the argument into 3 individual sentences and treat the middle sentence starting with And since as the implicit premise after removing the artificially added discourse marker.

For zero-shot setting, we use the pre-trained BART (bart-large) model. We use the format {Premise. And since [MASK]. Claim} and let the language model generate an implicit premise.

5 Evaluation and Results

We evaluate three setups: 1) directly use pre-trained BART (Zero-shot); 2) fine-tune BART on ART; 3) fine-tune BART on ART+PARA-COMET.

Automatic Evaluation Setup.

We use BLEU Papineni et al. (2002), one of the most widely used automatic metrics for generation tasks to compute BLEU-1 and BLEU-2 scores between the system output and the human written gold implicit premise. We also report F1-Score of BERTScore, a metric for evaluating text generation using contextualized embeddings.

Human evaluation setup.

We select 50 enthymemes from each test set (total of 150 enthymemes) and the output of our fine-tune BART models (with or without PARA-COMET). We hired crowdworkers on the Amazon Mechanical Turk platform. Given an enthymemes they were asked if the generated implicit premises were plausible or not (agreement: 0.56 based on Krippendorff’s ). Each enthymeme was judged for plausibility by 3 distinct Turkers (50 crowdworkers overall). As it was a binary judgement, we took majority voting which means if 2/3 of the annotators thought it was plausible we marked it as plausible. Plausibility judgement considers whether the generated premise was grammatical, relevant to the argument, coherent with our commonsense and completes the argument.


While pre-trained language models often contain structured commonsense Davison et al. (2019); Zhou et al. (2020) Table 4 shows that pre-trained BART cannot generate plausible implicit premises. Fine-tuning on the ART dataset improves the results significantly. Finally, the model that encodes discourse-aware commonsense outperform all baselines on all test datasets (D1, D2 and D3). Human evaluation further demonstrates that encoding commonsense knowledge leads to better implicit premise generation (Table 5).


We notice that adding commonsense beams from PARA-COMET makes the generated implicit premise more plausible. For instance, for the stated claim and premise from D3 in Table 6, we see that PARA-COMET adds a beam to feel better. Similarly it adds a beam to learn more for the stated claim and premise from D1 for both examples shown in Table 6. We posit that adding these in combination with the stated claim and premise, leads our model to infer more plausible implicit premises compared to the ones generated by BART fine-tuned on ART. Finally, given that D3 has been annotated with argument schemes Musi et al. (2018), we can explore their role in enthymeme reconstruction. We notice that most of the generated plausible implicit premises belong to enthymemes annotated with Practical Evaluation argument scheme, where “the premise is an evaluation about something being ‘good’ or ‘bad’, while the claim expresses a recommendation/advice about stopping/continuing an action" (Table 6 ).

Data System BLEU1 BLEU2 BS
D1 ZeroShot 6.02 2.17 42.88
ART 9.16 3.11 48.35
10.56 3.90 50.22
D2 ZeroShot 28.24 15.13 46.96
ART 37.77 18.76 60.63
44.12 24.14 67.75
D3 ZeroShot 12.58 6.25 44.64
ART 14.89 6.34 51.78
15.56 7.50 53.38
Table 4: Automatic evaluation of implicit premise generation by BART in 3 settings based on BLEU1, BLEU2 and BertScore(BS). Difference is significant, via Wilcoxon signed-rank test.
Data System Plausibility
D1 ART 50%
D2 ART 48%
D3 ART 38%
Table 5: Human evaluation results our finetuned BART models in two settings.
D1 St Premise
Deaf students need more specialized
St Claim States need special schools for the deaf
Their parents can’t always enroll them
in a deaf private school
Zero-shot We can’t afford it, we shouldn’t
The deaf students are not getting
enough education.
Deaf students are not being served
well in the schools
D1 St Premise
Understanding other culture is
more important now than ever before.
St Claim Colleges need humanities programs
More people now fail to understand
other cultures
Zero-shot It’s the humanities, we need them
The humanities are the most important
subjects in college.
There is a lot of misinformation
out there about other cultures
D2 St Premise
Bush new spending in 8 years? $5.07
TRILLION Obama total New Spending
(projected out for the next 8 years)?
$1.44 TRILLION. And of that total,
only $430 billion is non-recession
St Claim Fixed the economy
Gold Obama spends less money than Bush.
Zero-shot We are talking about the economy
The Obama administration has spent
$1 trillion.
The Obama’s spending is much less
than Bush’s.
D3 St Premise
The morning-after pill has a
number of side effects.
St Claim The morning-after pill should only be prescribed after counselling by a physician or pharmacist.,
Gold Physicians and pharmacists inform about side effects.
Morning-after pills are not FDA
approved, they should be avoided .
The morning- after pill can
cause depression.
The side effects can be very serious.
Table 6: Enthymeme generation for a given stated Premise and Claim by BART in 3 settings: zero-shot; fine-tuned on ART; and fine-tuned on ART + PARA-COMET. Text bolded in green displays how generations are more plausible due to incorporation of discourse aware commonsense.

6 Conclusions

We propose an end-to-end approach for a new task of automatically generating an implicit premise given an enthymeme. We show how leveraging abductive reasoning as an auxiliary task improves over zero-shot performance of a state-of-the-art generative language model. Finally, we build a knowledge-enhanced model by encoding discourse-aware commonsense that outperforms all existing baselines in terms of automatic metrics as well as plausibility judgements from crowdworkers. Future work includes exploring other sources for commonsense knowledge, experimenting with improved decoding techniques, as well as studying the role of argument schemes in enthymemes reconstruction.

7 Ethical Considerations

Although we use language models trained on data collected from the Web, which have been shown to have issues with bias and abusive language Sheng et al. (2019); Wallace et al. (2019), the inductive bias of our models should limit inadvertent negative impacts. Unlike model variants such as GPT, BART is a conditional language model, which provides more control of the generated output. Finally, we finetune our model on the ART dataset, which is built on five sentence short stories which is devoid of harmful and toxic text especially targeted at marginalized communities.

While dual-use concerns are certainly possible here, we think that open-sourcing this technology will help to facilitate understanding of arguments with more balanced and better reasoning. The technology should be used responsibly, particularly making sure the generation is controllable by providing the stated premise, claim and any commonsense knowledge pertaining to the enthymeme in textual form. Finally, we pay the Turkers $15/hour, complying with minimum wage standards in US.


  • M. Alshomary, S. Syed, M. Potthast, and H. Wachsmuth (2020) Target inference in argument conclusion generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4334–4345. External Links: Link, Document Cited by: §2.
  • M. Becker, K. Korfhage, and A. Frank (2020) Implicit knowledge in argumentative texts: an annotated corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 2316–2324 (English). External Links: Link, ISBN 979-10-95546-34-4 Cited by: §3.
  • C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, W. Yih, and Y. Choi (2020) Abductive commonsense reasoning. In International Conference on Learning Representations, External Links: Link Cited by: Implicit Premise Generation with Discourse-aware Commonsense Knowledge Models, §1, §3.
  • F. Boltužić and J. Šnajder (2016) Fill the gap! analyzing implicit premises between claims from online debates. In Proceedings of the Third Workshop on Argument Mining (ArgMining2016), Berlin, Germany. External Links: Link, Document Cited by: §1, §2, §3.
  • A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and Y. Choi (2019) COMET: commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4762–4779. External Links: Link, Document Cited by: §4.
  • J. Davison, J. Feldman, and A. Rush (2019) Commonsense knowledge mining from pretrained models. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    Hong Kong, China, pp. 1173–1178. External Links: Link, Document Cited by: §5.
  • S. Gabriel, C. Bhagavatula, V. Shwartz, R. Le Bras, M. Forbes, and Y. Choi (2021) Paragraph-level commonsense transformers with recurrent memory. In AAAI, Cited by: §1, §4.
  • I. Habernal, H. Wachsmuth, I. Gurevych, and B. Stein (2018) The argument reasoning comprehension task: identification and reconstruction of implicit warrants. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1930–1940. External Links: Link, Document Cited by: Implicit Premise Generation with Discourse-aware Commonsense Knowledge Models, Table 1, §1, §2, §3.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Link Cited by: Table 1, §1, §4.
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016) A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 839–849. Cited by: §3.
  • E. Musi, M. Stede, L. Kriese, S. Muresan, and A. Rocci (2018) A multi-layer annotated corpus of argumentative text: from argument schemes to discourse relations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §5.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §5.
  • [13] A. Peldszus and M. Stede An annotated corpus of argumentative microtexts. Cited by: §3.
  • P. Rajendran, D. Bollegala, and S. Parsons (2016) Contextual stance classification of opinions: a step towards enthymeme reconstruction in online reviews. In Proceedings of the Third Workshop on Argument Mining (ArgMining2016), Berlin, Germany, pp. 31–39. External Links: Link, Document Cited by: §1, §2.
  • O. Razuvayevskaya and S. Teufel (2017) Finding enthymemes in real-world texts: a feasibility study. Argument Comput. 8, pp. 113–129. Cited by: §1, §2.
  • P. Reisert, N. Inoue, N. Okazaki, and K. Inui (2015) A computational approach for generating toulmin model argumentation. In Proceedings of the 2nd Workshop on Argumentation Mining, Denver, CO, pp. 45–55. External Links: Link, Document Cited by: §1, §2.
  • R. M. Sabre (1990) Peirce’s abductive argument and the enthymeme. Vol. 26, pp. 363–372. External Links: ISSN 00091774, 15589587, Link Cited by: §1.
  • M. Sap, R. Le Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, and Y. Choi (2019) ATOMIC: an atlas of machine commonsense for if-then reasoning.

    Proceedings of the AAAI Conference on Artificial Intelligence

    33 (01), pp. 3027–3035.
    External Links: Link, Document Cited by: §4.
  • E. Sheng, K. Chang, P. Natarajan, and N. Peng (2019) The woman worked as a babysitter: on biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3407–3412. External Links: Link, Document Cited by: §7.
  • E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh (2019) Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2153–2162. External Links: Link, Document Cited by: §7.
  • D. Walton and C.A. Reed (2005) Argumentation schemes and enthymemes. Synthese 145, pp. 339–370. Cited by: §1, §1.
  • X. Zhou, Y. Zhang, L. Cui, and D. Huang (2020) Evaluating commonsense in pre-trained language models. ArXiv abs/1911.11931. Cited by: §5.