Guided Generation of Cause and Effect

07/21/2021 ∙ by Zhongyang Li, et al. ∙ Harbin Institute of Technology Johns Hopkins University 0

We present a conditional text generation framework that posits sentential expressions of possible causes and effects. This framework depends on two novel resources we develop in the course of this work: a very large-scale collection of English sentences expressing causal patterns CausalBank; and a refinement over previous work on constructing large lexical causal knowledge graphs Cause Effect Graph. Further, we extend prior work in lexically-constrained decoding to support disjunctive positive constraints. Human assessment confirms that our approach gives high-quality and diverse outputs. Finally, we use CausalBank to perform continued training of an encoder supporting a recent state-of-the-art model for causal reasoning, leading to a 3-point improvement on the COPA challenge set, with no change in model architecture.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


CausalBank dataset from our IJCAI 2020 paper "Guided Generation of Cause and Effect"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Causal knowledge acquisition is crucial for various Artificial Intelligence tasks, such as causal event graph construction, reading comprehension and future event prediction. We propose an approach for acquiring causal knowledge through generating multiple plausible causes (reasons, explanations) and effects (results, consequences) for a provided input sentence. As exemplified in Figure 

1, we develop two conditional decoders, one per causal direction. To train such models we mine a large-scale corpus of causal expressions from open domain web text, at a scale greatly surpassing prior work. Our goal is to generate multiple distinct possible causes and effects, where each generated sentence is not intended to be a paraphrase of other candidates. To support this output diversity when conditioned on a single shared input sentence, we turn to lexically-constrained decoding [40, 23], which allows for efficiently forcing a model to produce output containing one or more provided phrases. Our constraints are derived from a resource we construct for this work, replicating a prior effort in lexicalized causal knowledge graph construction [32]. This graph captures causal relations as a mapping across lexical types, lemma-to-lemma, but our goal is to generate naturalistic sentences with appropriately inflected morphology: we therefore develop an approach for disjunctive positive lexical constraints, where a decoder’s output must contain one of a set of provided words or phrases. In our case, these are morphological variants of the same base lemma, but our approach should benefit other applications of lexically-constrained decoding.

While there is recent work in generating story endings conditioned on a context [18, 60, 31], such work does not require generated sentences to be strictly causes or effects. The ability to propose explanations for an input sentence by generating multiple causes and effects complements this emerging line of research. To our knowledge, this is the first work to consider open-ended generation of causal sentences at a large scale.

We evaluate through carefully designed human evaluation by comparing outputs from various baselines and our proposed model, finding that our model’s outputs are preferred. We further demonstrate the usefulness of our new resource by taking a recent state-of-the-art causal reasoning system and boosting its results on the COPA test set by 3 points, relying only on continued training of the model’s encoder. Our models and resources are made publicly available.222

Figure 1: Possible causes and effects generated by our model, conditioned on the input sentence “babies cry”. Tokens in blue are constraint keywords derived from our Cause Effect Graph, which are forced to be included in the outputs by constrained decoding.
Figure 2: Our approach for generating plausible causes and effects.

In this paper, we make the following contributions:

  • [leftmargin=*]

  • proposing the task of open causal generation: producing possible causes and effects for any free-form textual event;

  • construction of a causal corpus (CausalBank) containing 314 million CE (cause-effect) pairs;

  • an extension to lexically-constrained decoding that supports disjunctive positive constraints (DPC);

  • human and automatic evaluations illustrating our method can generate high-quality and diverse causes and effects.

2 Approach

As shown in Figure 2, our proposed approach for open-ended causal generation includes a data collection module (Section 2.1), a Cause Effect Graph (Section 2.2), and two DPC (disjunctive positive constraint) decoding based Transformer encoder-decoder models (Section 2.3).

2.1 CausalBank: A Sentential Causal Corpus

Existing causal corpora were not built to support our goal for open-ended causal generation given any free-form textual input: as in neural machine translation (NMT), we need a large training set with millions of examples. Thus we harvest a large causal dataset from the preprocessed large-scale English Common Crawl corpus (5.14 TB)

[5]. The key guidelines of our dataset are as follows: 1) The causal relation is explicitly expressed in text with a causal pattern e.g. ‘because’; 2) The ‘cause’ and ‘effect’ arguments must both appear in the same sentence; 3) The ‘cause’ and ‘effect’ arguments can be of any length of contiguous text without overlaps between them; 4) Negative causal relations are filtered.

Causal Pattern
as, as a consequence/result of, as long as, because,
because of, caused by, due/owing to, in response to,
on account of, result from
accordingly, consequently, bring on/about, give rise to,
induce, in order to, lead to, result in, prevent/stop…from,
and for this reason, cause, for the purpose of, if…then,
,_so, so that, thereby, therefore, thus, hence
Table 1: Causal patterns (their morphological variants are ignored) used to get the CausalBank corpus. The first row of patterns belong to the EPC category, while the second row belong to the CPE category.

We do not rely on a supervised text extractor to pick out specific sub-spans of a sentence that represent a cause-effect pairing between propositions.333We found poor annotator agreement on span boundaries in an initial investigation on crowdsourcing data for such a system; we intend to return to this in future work, investigating improvements to our results via trained extraction models for corpus pre-processing. We instead curate a series of patterns from previous studies [33, 32, 15]

. These patterns can be classified into two categories, according to how they are mostly used in language to convey a causal relation: 1. EPC (effect-pattern-cause) category:

I am very sad because I lost my phone; 2. CPE (cause-pattern-effect) category: The earthquake resulted in many deaths. For EPC patterns, we simply take the text on the left of the pattern as effect, and take the text on the right of the pattern as cause. The case is reversed for CPE category patterns. These patterns (shown in Table 1) were applied to the Common Crawl corpus, followed by post-filtering: duplicate removal; filtering explicitly negated relations and verbs in passive voice; and restricting the cause and effect to each contain at least two tokens. This results in our CausalBank corpus, denoted here as , with 133 M EPC + 181 M CPE = 314 M ( refers to cause and refers to effect) pairs in total. We manually evaluated 1,000 randomly sampled sentences from the corpus and found that 95% conveyed a meaningful causal relation.

2.2 Cause Effect Graph: A Lexical Causal KB

Figure 3: Cause Effect Graph: A lexical causal knowledge base.

Following the method described in luo2016commonsense luo2016commonsense for creating a causal lexical knowledge base, we reproduce a variant of their CausalNet using the Common Crawl corpus [5]. Given a sentence such as “The storm caused a tremendous amount of damage on the landing beaches.”, this approach will harvest the lexical pairs (storm, tremendous), (storm, amount), (storm, damage), (storm, landing), and (storm, beach) as causal evidence. Stop words are removed and only pairs involving nouns, verbs, adjectives and adverbs are retained. The extracted lexical pairs form a directed network of posited causal relations, where nodes in the network are lemmatized terms, and a directed edge between two terms indicates a causal relation, weighted by co-occurrence frequency. For comparison, Figure 3 gives a similar illustration as Figure 1 in luo2016commonsense luo2016commonsense. We refer to our artifact as a Cause Effect Graph (CEG); Table 5 illustrates CEG contains more causal relations than CausalNet,44489.1M in contrast to 13.3M, with relations with a frequency of 5 or lower removed. owing to the larger (5.14TB) and cleaner corpus used for extraction [5].

2.3 Guided Generation

We use Sockeye [21] to train Transformer-based [59] conditional generation models, one for causes, one for effects. Sockeye supports decoding via N-best (each step greedily chooses the top best N words in beam search based on the generated tokens) and random sampling (each step randomly sampling N words from the softmax distribution based on the generated tokens). The training data (CausalBank) is processed through Byte Pair Encoding [54] to reduce vocabulary size.

2.3.1 Disjunctive Positive Constraints Decoding

Unlike in NMT, our intended outputs for a given input are diverse in meaning: we wish to generate multiple semantically distinct possible causes or effects. We induce diversity through hard lexical requirements during decoding, using causal keywords from our CEG as positive constraints on the output. A positive constraint forces the decoder to produce a sequence of tokens that contain the constrained sequence, which is achieved through a constrained beam search proposed by post2018fast post2018fast and made efficient by hu2019improved hu2019improved.

Unfortunately, those prior works are restricted to conjunctive positive constraints: all items provided to the decoder must be present in the output. This is problematic in our case: our CEG maps lemmas to lemmas, and thus lemmas will form our constraints, but at generation time we do not require specific morphological inflections of our constrained terms. We wish not to constrain the decoder to a particular lemma, but to allow it to choose the best morphological form as appropriate in its context. For example, when generating a cause for “I brought an umbrella” with rain as the cause keyword, some valid cause sentences, e.g., “It rained” or “It was a rainy day.”, would not be permitted based on prior work. One may circumvent this limitation by enumerating all morphological variants of a term, then apply each in turn as a positive constraint in distinct decoding passes. However, this approach does not scale, as its run-time grows exponentially in the number of initial constraints, each with multiple morphological variants.

Here we propose a solution of disjunctive positive constraint decoding, where each constraint is represented by a set of token sequences, and the decoder needs to include only one sequence from each set of constraints in the final output. We modify the algorithm from hu2019improved hu2019improved to allow the decoder to explore the disjunctively constrained space in a single forward sequence, without significant computational overhead. In that work, constraints are represented in a trie, where each constraint is represented by a path from the root to a leaf. One or more state pointers are used to track how many tokens have been generated for each constraint, and tokens that induce more progress are prioritized in a modified beam search proposed by post2018fast post2018fast. When a constraint is satisfied, the algorithm prunes the path representing that constraint. The distinguishing property of a disjunctive constraint is that once a sequence in a disjunctive set is satisfied, others in the set are also removed and no longer constraints.

For decoding with disjunctive constraints, we represent all constrained sequences, whether they are from the same disjunctive set or not, on a single trie. When a sequence is generated, we prune all sequences in the set as opposed to just the generated sequence. This modification gives us an efficient algorithm for applying disjunctive constraints, as illustrated in Algorithm 1 and Figure 4. While here we use morphological variants in our disjunctive set, our algorithm is broadly applicable for constraining on a set of synonyms or different subword segmentations of the same sequence.

  input: a set of disjunctive constraint sets , for each set in , and where is the token in , one of the sequences of the disjunctive constraint set
  output: a token sequence
  while EOS and  do
     if  finishes the sequence  then
        for  in  do
        end for
        Remove from
     end if
  end while
Algorithm 1 Decoding with Disjunctive Positive Constraints. We consider the generation of one sentence with a beam size of for simplicity. Note that while a beam size of reduces the constrained beam search, the handling of DPC is not affected.
Figure 4: Trie states in positive constraint and disjunctive positive constraint, after generating the token ‘rained’ in beam search.
Outputs Reranking

While DPC decoding supports arbitrary number of disjunctive constraints in one beam search process, in practice only a few preferred constraints under the model will dominate any N-best output. To encourage diversity we first select a set of candidate constraint tokens from CEG, generate outputs per constraint, then merge and rerank the results. For example, if generating causes for the input sentence “babies cry”, we lemmatize each word in the sentence (baby and cry). These terms map to a set of lemmas via CEG, each associated with an observed frequency; we take the -most frequent (highest weighted) such candidates: . For each token in , such as ‘love’, we get a set of its morphological variants {‘love’, ‘loves’, ‘loved’, ‘loving’} via the python package patterns.en, and pass as a DPC, keeping the top outputs. In total we derive (=300 and =5) sentences via beam search decodings. These sentences are ranked by their associated negative log-likelihood scores, and we return the top .

Method Dataset Cause Effect
Per Acc Per Acc
RNN-LSTM CB_10M 66.0 29.6 55.2 32.2
RNN-GRU CB_10M 67.6 29.5 48.0 33.7
CNN CB_10M 37.6 36.1 39.5 35.4
Conv-Transformer CB_10M 29.5 38.9 31.1 38.2
Transformer CB_10M 28.3 39.1 29.9 38.4
Transformer CB_all 31.4 38.0 27.6 39.7
Transformer_BIG CB_all 29.9 38.5 26.4 39.8
Table 2: Dev-set results: perplexity (Per), word accuracy (Acc (%)).

3 CausalBERT

Previous studies [38, 30] have shown that applying intermediate auxiliary task training to an encoder such as BERT can improve performance on a target task. We designed an intermediate task for BERT using CausalBank , employing margin loss [27, 28] in the objective function: , where is the score of true CE pair given by BERT model, is the score of corrupted CE pair by replacing or with randomly sampled negative cause or effect from other examples in .

is the margin loss function parameter, which is set to 0.3.

is the set of BERT model parameters. is the parameter for L2 regularization, which is set to 0.00001.

By training BERT with this intermediate supervised task, we expect the model to acquire enhanced knowledge about the meaning of a causal relation, and can have better performance on downstream causal inference tasks.

4 Evaluation

We evaluate our proposed causal generation approach by both human and automatic metrics, and evaluate CausalBank by applying CausalBERT to COPA, which requires the model to choose the correct cause or effect from two candidates.

4.0.1 Model Selection

We first experiment on a small subset of our CausalBank corpus (CB_10M) – 10 million CE pairs from the causal pattern ‘because’ – considering different NMT encoder and decoder architectures (LSTM, CNN, Conv-Transformer [13], and Transformer).555

Each of these models’ encoder and decoder use the same architecture, e.g. both are 6-layer LSTMs, with a hidden size and embedding size of 512. All models are trained for 10 epochs. The vocabulary size is 10,000.

For the cause generation model, is used as the source and is used as the target, which is reversed in training the effect model. Perplexity (Per) and word accuracy (Acc) are used to evaluate the model’s performance. We find that Transformer constantly achieves the best performance (Table 2).

Method Cause Effect
P@1 P@3 H Div P@1 P@3 H Div
TrainSub KNN 89.0 67.3 0.85 0.11 98.0 71.3 0.90 0.02
GPT-2 31.0 22.3 0.39 0.13 8.0 9.3 0.30 0.11
N-Best 59.0 45.3 0.53 0.15 63.0 42.7 0.53 0.11
Random 68.0 59.3 0.66 0.11 74.0 61.7 0.70 0.09
CN-Cons 72.0 71.3 0.79 0.02 66.0 67.0 0.76 0.02
Gold-Cons 78.0 75.3 0.83 0.12 71.0 73.0 0.80 0.10
COPA_Dev KNN 10.0 8.0 0.53 0.10 4.0 2.7 0.26 0.01
GPT-2 40.0 34.0 0.45 0.12 38.0 32.0 0.46 0.10
Random 66.0 53.7 0.65 0.09 62.0 46.7 0.57 0.08
N-Best 69.0 65.0 0.77 0.08 72.0 68.0 0.82 0.07
CN-Cons 74.0 70.0 0.81 0.02 72.0 72.0 0.87 0.02
Gold-Cons 73.0 73.0 0.87 0.09 72.0 71.3 0.87 0.09
Table 3: Human evaluation results of cause and effect generation.

Then we train two versions of Transformer on the whole CausalBank corpus (CB_all). The small model’s encoder and decoder both have 6 layers, with a hidden size and embedding size of 512. The big model’s encoder and decoder have 12 layers and 4 layers, with a hidden size and embedding size of 768, leading to 134M parameters in total. The vocabulary size is 15,000. The training is stopped when the validation loss stagnates for 20,000 batches. For the cause generation model, and from only the EPC category pairs are used as the source and target. For the effect generation model, and from only the CPE category pair is used as the source and target. This setting always generates the right part of the sentence conditioned on the left part, which we find to give more reasonable outputs than the above architecture exploration experiments. The bottom of Table 2 shows the large Transformer model constantly achieves the best performance on development set, which contains 5,000 CE pairs.

4.0.2 Evaluating Generation

We evaluate the large Transformer model via human assessment, on two kinds of test sets. The first kind of test sets (TrainSub) contains 100 randomly sampled input examples from the model’s training data. The second kind of test sets (COPA_Dev) contains 100 randomly sampled examples from the development set of COPA [48] dataset, which are manually created gold sentences and never seen during the model’s training stage.

The compared methods include a simplified KNN method (when the input is “babies cry”, we match sentences exactly containing the input as the retrieved neighbors, e.g. “those babies cry loudly”, and get the corresponding causes and effects), the GPT-2 124M language model [42] which can generate continuations conditioned on a start sentence (e.g. “babies cry because”), random sampling based decoding, N-best decoding, DPC decoding with constraint tokens from CEG (CN-cons), and DPC decoding with gold answer as constraint tokens (Gold-cons).

Method Acc (%)
PMI [25] 58.8
PMI_EX [16] 65.4
CS [32] 70.2
CS_MWP [51] 71.2
Google T5-base [45] 71.2
BERT-base [27] 75.4
CausalBERT-base (ours) 78.6
Google T5-11B [45] 94.8
Table 4: Results on COPA-Test, contrasting prior results to a model by Li et al. built atop BERT-base. This model is improved by 3 points through adoption of CausalBERT.

Four graduate students from the NLP field were used in annotation. Each was asked to give a score from

for the generated {input, cause/effect} pair, where the guidelines are (take cause generation for example): if the generated answer does not make sense or can never be a reasonable cause, reason or explanation for the input event, give a score of 0; if the generated answer has grammatical errors but can be a reasonable cause, reason or explanation for the input event under some rare conditions (or beyond commonsense), give a score of 1; if the generated answer is a fluent sentence and can be a reasonable cause, reason or explanation with high probability, give a score of 2. Each pair was labeled by two annotators, and we average the judgments over two annotators per pair. The cohen’s kappa score is 0.53.

Table 3 shows the human evaluation results. Three metrics are adopted: Precision at 1 P@1 (an average score of 1.5 or above is seen as a valid causal answer); P@3; and average human score for each evaluated pair (H). For the TrainSub test set, the KNN method shows strong performance, especially for P@1 and the human scores. However, KNN performs worse for P@3, due to the absence of many possible answers for the same input. Meanwhile, our two versions of DPC decoding strategies (CN-cons, Gold-Cons) also show relatively better performance compared to other generation methods (GPT-2, Random and N-best decoding). KNN performs poorly on the COPA dev set, because most of the inputs never appear in the training data. However, CN-Cons and Gold-Cons can still achieve good performance.

Lexical Diversity

We used a modified BLEU score to evaluate lexical diversity (Div in Table 3) where a lower score means a greater lexical diversity. Specifically, we calculate the associated BLEU-1 score between the gold answers and the generated top 3 outputs without brevity penalty. This modification ensures that we don’t reward shorter outputs. In most cases, CN-Cons gets the lowest Div scores, showing that our DPC decoding and constraint tokens from CEG together, allows us to explore more in the causes and effects space, and generate more diverse outputs. Also we find that all of these BLEU scores are very low, compared with the BLEU scores in previous text generation studies [24, 59]. This is because our generation task is open-ended (as illustrated in Figure 1).

Evaluating CausalBank

Table 4 shows our CausalBERT results on COPA test. Compared with prior strong knowledge-driven baseline methods, a BERT-base model trained with a margin-based loss [27] achieved good performance. Following the experimental settings of li2019learning li2019learning, when training the BERT-base model with additional CE pairs from CausalBank, we get an improvement of 3.2%, from 75.4% to 78.6%, showing that our corpus successfully augments BERT base to make it better for causal inference, which is a sign the corpus contains useful causal knowledge. We find that the number of CE pairs in the intermediate task matters: performance first improves and then decreases, with more training data added. 666This was not observed in related studies [38, 30], where all training examples from the Multi-NLI dataset were used as an intermediate task. Similar behavior was observed in NMT in continued training for domain adaptation [58]. We believe ours to be a similar setting, where the “in-domain” causal data overwhelms the benefits of pretraining; adapting strategies from Thompson et al. is an avenue for future work. We get the best performance of 78.6% with 40 K training CE pairs. Though our result still has a big gap from the current SOTA performance on COPA (94.8% from the largest google T5-11B model), the intent of our experiment is just to illustrate how the only difference was in altering the pre-training with CausalBank. One could possibly get a SOTA model based on our corpus and the google T5 model, if publicly available.

5 Related Work

Conditional Text Generation

Such efforts cover a large body of work, including machine translation, response generation and paraphrase generation. Most related is conditional story generation [18, 60, 31, 29], which aims to generate story continuations based on a given context. These works do not require generated sentences to be strictly causes or effects.

For causal generation, rashkin2018event2mind rashkin2018event2mind aimed to generate the likely intents and reactions of the event’s participants, given a short free-form textual event. sap2019atomic sap2019atomic trained a multi-task model for fine-grained kinds of If-Then commonsense reasoning. However, the causal semantics considered in their work are restricted to a narrow space, and their models are trained on no more than one million examples. Further, their resource was based-on crowdsourcing, which carries risks of human bias [49, 39]. We harvest a significantly larger, open coverage causal corpus,777While we avoid pitfalls of elicitation, we acknowledge that like any corpus-extracted resource ours may suffer from reporting bias [17]: some types of causes or effects that are known to humans but rarely or ever explicitly stated. related in approach to DisSent [35] but larger, focused on causality, and aimed primarily at generation rather than sentence representation learning.

Of various efforts in guided generation [1, 57, 8, 24], lexically-constrained decoding [22] is a modification of beam search originating in neural machine translation which allows the user to specify tokens that must (or must not) appear in the decoder’s output.

post2018fast post2018fast proposed a variant of lexically-constrained decoding that reduced complexity from linear to constant-time, which was made more efficient by hu2019improved hu2019improved. We introduce an extension to lexically-constrained decoding that supports disjunctive positive constraints for multiple optional constraint keywords.

Sentential Causal Resource # CE Pairs
TCR [36] 172
SemEval-2007 Task4 [14] 220
Causal-TimeBank [33] 318
CaTeRS [34] 488
EventCausalityData [9] 580
RED [37] 1,147
SemEval2010 Task8 [19] 1,331
BECauSE 2.0 [12] 1,803
EventStoryLine [6] 5,519
PDTB 2.0 [41] 8,042
Altlex [20] 9,190
PDTB 3.0 [61] 13 K
DisSent [35] 167 K
CausalBank (Ours) 314 M
Causal Knowledge Graph # CE Edges
Event2mind [46] 25 K
ConceptNet 5.7 [55] 473 K
ASER Core [64] 494 K
Atomic [50] 877 K
CausalNet [32] 13.3 M
Cause Effect Graph (Ours) 89.1 M
Table 5: Contrasting size with example prior works: only the causal portion of these corpora are listed. The top are sentential causal corpora, while the bottom are graph-structure causal knowledge bases.
Sentential Causal Resources

Existing causal corpora differ in their annotation guidelines and how they are constructed: (1) whether they consider only explicit or also implicit causal relations; (2) whether they consider only intra-sentence relations or if relations can cross sentences; (3) whether the annotation unit is word level or sentence level; and (4) whether the corpus is constructed automatically or by human effort. Ours is concerned with explicit only relations, within a single sentence, relating one part of a sentence to another, and employs constructed patterns but not sentence-level human annotation.

Already mentioned are recent crowdsourcing efforts [46, 50]. More related are PDTB [41] and BECauSE [12], but where our resource goal is a much larger corpus, for the purpose of training a neural text generation model. Most related would be the extractive approach of DisSent [35], but where we focus specifically on causality, and derive a much larger corpus. [3] tagged a small corpus of event pairs conjoined with “and” as causal or not causal. CaTeRS [34] included causal relations from a commonsense reasoning standpoint. Richer Event Description [37] integrates real-world temporal and causal relations between events into a unified framework. Table 5 contrasts the size of causal portion of prior resources with our own.

Lexical Causal Resources

Lexical semantic resources may encode causal properties on verbs (e.g., [53, 4]) and prepositions (e.g., [52]). Force dynamics theory [56] from cognitive psychology posits three primary kinds of causal semantics [63] – CAUSE, ENABLE and PREVENT – which were lexicalized as causal verbs [62]. The annotation scheme of dunietz2017because dunietz2017because distinguishes three types of causal semantics: CONSEQUENCE, MOTIVATION, and PURPOSE. In PDTB 2.0 [41], “CONTINGENCY” has two subtypes (“Cause” and “Condition”). FrameNet [2] represents causal relations through a variety of unrelated frames (e.g., CAUSATION and THWARTING) and frame roles (e.g., PURPOSE and EXPLANATION). These efforts motivate our own causal patterns, categorized into: CAUSE (e.g. cause, result in, lead to), EXPLANATION (e.g. because, due to), CONDITION (e.g. if-then, as long as), PURPOSE (e.g. in order to, for the purpose of), and PREVENTION (e.g. stop/prevent-from).

Causal Knowledge Acquisition

Causal knowledge acquisition [43, 44]

is crucial for many AI systems, and it is often acquired via text. hashimoto2014toward hashimoto2014toward and kruengkrai2017improving kruengkrai2017improving applied supervised learning techniques using a benchmark training data with over 100K human-annotated CE pairs. dasgupta2018automatic dasgupta2018automatic explored general causal extraction using 5,000 labelled sentences. do2011minimally do2011minimally is an example of a minimally supervised approach. Recent studies

[11, 10] explored new supervised approaches on the BECauSE 2.0 [12] corpus.

church1990word  church1990word proposed the use of pointwise mutual information (PMI) for mining patterns via text co-occurrence. Many works have followed this strategy, e.g.  [7, 47, 16, 9, 32]. Others have mined patterns via discourse patterns in the form of ‘A led to B’, ‘if A then B’, etc., e.g., [26, 15, 65]). See asghar2016automatic asghar2016automatic for review. Such efforts relate most closely to our CEGraph component, rather than our overall framework. Our concern is the generation of diverse potential causes and effects as natural language statemnts.

6 Conclusion

We investigate open causal generation for free-form textual input, and build a large sentential causal corpus which we used to train a generative model. We introduced a novel extension to lexically-constrained decoding that supports disjunctive positive constraints, where generated output is forced to contain one of a set of candidates. Automatic and human evaluations show that our method can generate high-quality and diverse causes and effects for new inputs.


We acknowledge the support of the National Key Research and Development Program of China (SQ2018AAA0101901), the National Natural Science Foundation of China (NSFC) via Grant 61976073 and 61702137; the China Scholarship Council; and DARPA KAIROS (Hu and Van Durme).


  • [1] P. Ammanabrolu, E. Tien, W. Cheung, Z. Luo, W. Ma, L. Martin, and M. Riedl (2019) Guided neural language generation for automated storytelling. In Storytelling Workshop, Cited by: §5.
  • [2] C. Baker (2014)

    FrameNet: a knowledge base for natural language processing

    In Frame Semantics, Cited by: §5.
  • [3] S. Bethard and J. H. Martin (2008) Learning semantic links from a corpus of parallel temporal and causal relations. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 177–180. Cited by: §5.
  • [4] C. Bonial, J. Bonn, K. Conger, J. D. Hwang, and M. Palmer (2014) PropBank: semantics of new predicate types.. In LREC, Cited by: §5.
  • [5] C. Buck, K. Heafield, and B. van Ooyen (2014) N-gram counts and language models from the common crawl. In LREC, External Links: Link Cited by: §2.1, §2.2.
  • [6] T. Caselli and P. Vossen (2017) The event storyline corpus: a new benchmark for causal and temporal relation extraction. Cited by: Table 5.
  • [7] N. Chambers and D. Jurafsky (2008) Unsupervised learning of narrative event chains. In ACL, pp. 789–797. Cited by: §5.
  • [8] E. Clark, Y. Ji, and N. A. Smith (2018) Neural text generation in stories using entity representations as context. In NAACL, pp. 2250–2260. Cited by: §5.
  • [9] Q. Do, Y. Chan, and D. Roth (2011) Minimally supervised event causality identification. In EMNLP, pp. 294–303. Cited by: §5, Table 5.
  • [10] J. Dunietz, J. G. Carbonell, and L. Levin (2018) DeepCx: a transition-based approach for shallow semantic parsing with complex constructional triggers. In EMNLP, Cited by: §5.
  • [11] J. Dunietz, L. Levin, and J. Carbonell (2017) Automatically tagging constructions of causation and their slot-fillers. TACL, pp. 117–133. Cited by: §5.
  • [12] J. Dunietz, L. Levin, and J. Carbonell (2017) The because corpus 2.0: annotating causality and overlapping relations. Cited by: §5, §5, Table 5.
  • [13] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In ICML, Cited by: §4.0.1.
  • [14] R. Girju, P. Nakov, V. Nastase, S. Szpakowicz, P. T., and D. Y. (2007) Semeval-2007 task 04: classification of semantic relations between nominals. Cited by: Table 5.
  • [15] R. Girju (2003) Automatic detection of causal relations for question answering. Cited by: §2.1, §5.
  • [16] A. S. Gordon, C. A. Bejan, and K. Sagae (2011) Commonsense causal reasoning using millions of personal stories. In AAAI, Cited by: Table 4, §5.
  • [17] J. Gordon and B. Van Durme (2013) Reporting bias and knowledge acquisition. In AKBC, Cited by: footnote 7.
  • [18] J. Guan, Y. Wang, and M. Huang (2019) Story ending generation with incremental encoding and commonsense knowledge. In AAAI, Cited by: §1, §5.
  • [19] I. Hendrickx, S. N. Kim, et al. (2009) Semeval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In SE, Cited by: Table 5.
  • [20] C. Hidey and K. McKeown (2016) Identifying causal relations using parallel wikipedia articles. In ACL, pp. 1424–1433. Cited by: Table 5.
  • [21] F. Hieber, T. Domhan, et al. (2017) Sockeye: a toolkit for neural machine translation. arXiv. Cited by: §2.3.
  • [22] C. Hokamp and Q. Liu (2017) Lexically constrained decoding for sequence generation using grid beam search. In ACL, Cited by: §5.
  • [23] J. E. Hu, H. Khayrallah, et al. (2019) Improved lexically constrained decoding for translation and monolingual rewriting. In NAACL, Cited by: §1.
  • [24] J. E. Hu, R. Rudinger, M. Post, and B. Van Durme (2019) ParaBank: monolingual bitext generation and sentential paraphrasing via lexically-constrained neural machine translation. In AAAI, Cited by: §4.0.2, §5.
  • [25] S. Jabeen, X. Gao, and P. Andreae (2014) Using asymmetric associations for commonsense causality detection. In PRICAI, Cited by: Table 4.
  • [26] C. S. Khoo, S. Chan, and Y. Niu (2000) Extracting causal knowledge from a medical database using graphical patterns. In ACL, Cited by: §5.
  • [27] Z. Li, T. Chen, and B. Van Durme (2019) Learning to rank for plausible plausibility. In ACL, Cited by: §3, §4.0.2, Table 4.
  • [28] Z. Li, X. Ding, and T. Liu (2018) Constructing narrative event evolutionary graph for script event prediction. In IJCAI, pp. 4201–4207. Cited by: §3.
  • [29] Z. Li, X. Ding, and T. Liu (2018) Generating reasonable and diversified story ending using sequence to sequence model with adversarial training. In Coling, Cited by: §5.
  • [30] Z. Li, X. Ding, and T. Liu (2019) Story ending prediction by transferable bert. In IJCAI, Cited by: §3, footnote 6.
  • [31] F. Luo, D. Dai, P. Yang, T. Liu, B. Chang, Z. Sui, and X. Sun (2019) Learning to control the fine-grained sentiment for story ending generation. In ACL, Cited by: §1, §5.
  • [32] Z. Luo, Y. Sha, K. Q. Zhu, S. Hwang, and Z. Wang (2016) Commonsense causal reasoning between short texts. In Knowledge Representation and Reasoning, Cited by: §1, §2.1, Table 4, §5, Table 5.
  • [33] P. Mirza, R. Sprugnoli, et al. (2014) Annotating causality in the tempeval-3 corpus. In CAtoCL, Cited by: §2.1, Table 5.
  • [34] N. Mostafazadeh, A. Grealish, et al. (2016) CaTeRS: causal and temporal relation scheme for semantic annotation of event structures. In Events, Cited by: §5, Table 5.
  • [35] A. Nie, E. Bennett, and N. Goodman (2019) DisSent: learning sentence representations from explicit discourse relations. In ACL, Cited by: §5, §5, Table 5.
  • [36] Q. Ning, Z. Feng, et al. (2018) Joint reasoning for temporal and causal relations. In ACL, Cited by: Table 5.
  • [37] T. O’Gorman, K. Wright-Bettner, and M. Palmer (2016) Richer event description: integrating event coreference with temporal, causal and bridging annotation. In CNS, Cited by: §5, Table 5.
  • [38] J. Phang, T. Févry, and S. R. Bowman (2018) Sentence encoders on stilts: supplementary training on intermediate labeled-data tasks. arXiv:1811.01088. Cited by: §3, footnote 6.
  • [39] A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme (2018) Hypothesis only baselines in natural language inference. In LCS, Cited by: §5.
  • [40] M. Post and D. Vilar (2018) Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In NAACL, Cited by: §1.
  • [41] R. Prasad, N. Dinesh, et al. (2008) The penn discourse treebank 2.0.. In LREC, Cited by: §5, §5, Table 5.
  • [42] A. Radford, J. Wu, et al. (2019) Language models are unsupervised multitask learners. Cited by: §4.0.2.
  • [43] K. Radinsky, S. Davidovich, and S. Markovitch (2012) Learning causality for news events prediction. In WWW, Cited by: §5.
  • [44] K. Radinsky and E. Horvitz (2013) Mining the web to predict future events. In WSDM, Cited by: §5.
  • [45] C. Raffel, N. Shazeer, et al. (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    arXiv:1910.10683. Cited by: Table 4.
  • [46] H. Rashkin, M. Sap, E. Allaway, N. A. Smith, and Y. Choi (2018) Event2Mind: commonsense inference on events, intents, and reactions. In ACL, Cited by: §5, Table 5.
  • [47] M. Riaz and R. Girju (2010) Another look at causality: discovering scenario-specific contingency relationships with no supervision. In Semantic Comp., Cited by: §5.
  • [48] M. Roemmele, C. A. Bejan, and A. S. Gordon (2011) Choice of plausible alternatives: an evaluation of commonsense causal reasoning. In AAAI, Cited by: §4.0.2.
  • [49] R. Rudinger, C. May, and B. Van Durme (2017) Social bias in elicited natural language inferences. In Ethics in NLP, Cited by: §5.
  • [50] M. Sap, R. Le Bras, et al. (2019) Atomic: an atlas of machine commonsense for if-then reasoning. In AAAI, Cited by: §5, Table 5.
  • [51] S. Sasaki, S. Takase, et al. (2017)

    Handling multiword expressions in causality estimation

    Cited by: Table 4.
  • [52] N. Schneider, V. Srikumar, J. D. Hwang, and M. Palmer (2015) A hierarchy with, of, and for preposition supersenses. In Linguistic Annotation, Cited by: §5.
  • [53] K. K. Schuler (2005)

    VerbNet: a broad-coverage, comprehensive verb lexicon

    Cited by: §5.
  • [54] R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In ACL, Cited by: §2.3.
  • [55] R. Speer, J. Chin, and C. Havasi (2017) ConceptNet 5.5: an open multilingual graph of general knowledge. In AAAI, Cited by: Table 5.
  • [56] L. Talmy (1988) Force dynamics in language and cognition. Cognitive science. Cited by: §5.
  • [57] J. Tang, T. Zhao, et al. (2019) Target-guided open-domain conversation. In ACL, Cited by: §5.
  • [58] B. Thompson, J. Gwinnup, et al. (2019) Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In NAACL, Cited by: footnote 6.
  • [59] A. Vaswani, N. Shazeer, et al. (2017) Attention is all you need. In NIPS, Cited by: §2.3, §4.0.2.
  • [60] T. Wang and X. Wan (2019)

    T-cvae: transformer-based conditioned variational autoencoder for story completion

    In IJCAI, Cited by: §1, §5.
  • [61] B. Webber, R. Prasad, A. Lee, and A. Joshi (2019) The pdtb 3.0 annotation manual. Cited by: Table 5.
  • [62] P. Wolff and G. Song (2003) Models of causation and the semantics of causal verbs. CP. Cited by: §5.
  • [63] P. Wolff (2007) Representing causation.. Journal of experimental psychology: General. Cited by: §5.
  • [64] H. Zhang, X. Liu, et al. (2019) ASER: a large-scale eventuality knowledge graph. arXiv. Cited by: Table 5.
  • [65] S. Zhao, Q. Wang, S. Massung, et al. (2017) Constructing and embedding abstract event causality networks from text snippets. In WSDM, Cited by: §5.