CausalBERT: Injecting Causal Knowledge Into Pre-trained Models with Minimal Supervision

07/21/2021 ∙ by Zhongyang Li, et al. ∙ Harbin Institute of Technology 0

Recent work has shown success in incorporating pre-trained models like BERT to improve NLP systems. However, existing pre-trained models lack of causal knowledge which prevents today's NLP systems from thinking like humans. In this paper, we investigate the problem of injecting causal knowledge into pre-trained models. There are two fundamental problems: 1) how to collect various granularities of causal pairs from unstructured texts; 2) how to effectively inject causal knowledge into pre-trained models. To address these issues, we extend the idea of CausalBERT from previous studies, and conduct experiments on various datasets to evaluate its effectiveness. In addition, we adopt a regularization-based method to preserve the already learned knowledge with an extra regularization term while injecting causal knowledge. Extensive experiments on 7 datasets, including four causal pair classification tasks, two causal QA tasks and a causal inference task, demonstrate that CausalBERT captures rich causal knowledge and outperforms all pre-trained models-based state-of-the-art methods, achieving a new causal inference benchmark.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Different levels of causal inference tasks: red colored options are the correct answers.

Pre-trained language models like GPT 

Radford et al. (2018), BERT Devlin et al. (2019), XLNet Yang et al. (2019), and RoBERTa Liu et al. (2019) have shown that a two-stage framework — pre-training a language model on large-scale unlabeled corpora and fine-tuning on target tasks — can bring promising improvements to various natural language understanding tasks, such as reading comprehension Radford et al. (2018) and natural language inference Devlin et al. (2019).

In this paper, we study a series of natural language causal inference (NLCI) problems with the form of “Could X cause Y?”, where X and Y can be single words (“explosion”, “damage”), general phrases without any constraints (“weak economic environment”, “rising unemployment rates”), complete sentences (“The man lost his balance on the ladder.”, “He fell off the ladder.”), and even formulized as a complicated reading comprehension task with a long passage, a question and several candidate answers (as shown in Figure 1).

Despite the great success of the pre-trained language models in NLP systems, recent studies show that models learned in such an unsupervised manner struggle to capture causal knowledge and cannot achieve a satisfactory performance in NLCI tasks Hassanzadeh et al. (2019); Li et al. (2019a). Enabling pre-trained models capable of causal inference with rich causal knowledge drives us to achieve human-like AI. However, to the best of our knowledge, few studies have explored this problem, mainly due to the following two main challenges:

1. The difficulty of creating different granularities of causal pairs datasets with reasonable quality and coverage;

2. How to effectively inject causal knowledge into a unified model for solving different levels of causal inference tasks and tellingly alleviate the problem of catastrophic forgetting?

In this paper, we try to solve these two issues for better pre-trained causal inference models. Existing causal inference datasets suffer from the problem of small scale Roemmele et al. (2011); Hassanzadeh et al. (2019) or low quality Sharp et al. (2016); Xie and Mu (2019). For the dataset challenge, apart from borrowing causal knowledge from existing high-quality resources like ConceptNet Speer et al. (2016) and CausalBankLi et al. (2020), we also make use of the cheap supervision from the linguistic patterns Nie et al. (2019); Zhou et al. (2020). Specifically, we collect several causal knowledge resources, including both the word level and sentence level cause-effect pairs, based on a series of causal patterns curated from previous studies Mirza et al. (2014); Luo et al. (2016). Besides, we adopt the causal word embedding techniques Xie and Mu (2019) to automatically learn word pairs with strong causal relations. These together lead to a large-scale causal resource, which we believe can stimulate a lot of future causal inference studies.

Li et al. (2020) illustrated that a encoder trained from a corpus of causal pair constructions (CausalBank) benefitted causal inference as measured in COPARoemmele et al. (2011). This study is a downstream follow-up of the CausalBERTLi et al. (2020) idea and investigate its effectiveness on additional tasks. Specifically, we devise an additional causal pair classification or ranking task for the pre-trained models based on various granularities of causal resources we collected, equipping them with causal inference abilities. Then the models can be directly applied to the causal pair classification or COPA test sets, simulating a zero-shot setting Xian et al. (2019), or further fine-tuned on the downstream causal inference tasks. To alleviate the catastrophic forgetting Kirkpatrick et al. (2017) problem on some complicated causal QA datasets Sharp et al. (2016); Huang et al. (2019), we adopt a regularization-based method to preserve the previous knowledge with an extra regularization term Kirkpatrick et al. (2017).

We conduct extensive experiments on seven benchmark datasets across three causal knowledge driven tasks, i.e., causal pair classification, causal question answering and causal inference (COPA). Experiments show that CausalBERT consistently performs better than all pre-trained models-based baselines. Especially, on the well-known COPA causal inference challenge, we achieve 93.5% accuracy with our causal knowledge enhanced ALBERT model, very close to the performance of the largest T5-11B model (94.8%) Raffel et al. (2019). Surprisingly, our CausalBERT-based ALBERT-xxlarge model achieves 86.4% accuracy on the COPA test set under the zero-shot setting, which outperforms some strong pre-trained models fine-tuned on the COPA dataset Li et al. (2019a).

2 Method

Figure 2: The three-stage CausalBERT framework.

As shown in Figure 2

, our CausalBERT is a three-stage sequential transfer learning framework

Li et al. (2019b); Phang et al. (2018), including an unsupervised pre-training stage with the language modeling objective111We don’t pre-train a language model from scratch, but use the publicly available pre-trained BERT Devlin et al. (2019), RoBERTa Liu et al. (2019) and ALBERT Lan et al. (2019)., a second self-supervised pre-training stage using the different levels’ causal pairs resources we collect (Section 2.1), with the proposed causal pair classification or ranking pre-training tasks (Section 2.2), and the regularization technique for overcoming catastrophic forgetting (Section 2.3). Finally, in the third stage the model can be directly applied to the causal pair classification and COPA test sets, or further fine-tuned on the target tasks’ training sets.

2.1 Where is the Causal Knowledge from?

In order to inject causal knowledge into the pre-trained language models, we collect a large-scale and high-quality causal resource, either from previous resources or using precise causal patterns. We consider two categories of causal knowledge: sentence (phrase) level cause-effect pairs, and word level cause-effect pairs.

2.1.1 Sentence Level Causal Knowledge

CausalBank. CausalBankLi et al. (2020) is a large-scale sentential causal pairs dataset222 extracted from the preprocessed English Common Crawl corpus Buck et al. (2014). It contains 314 M cause-effect pairs in total. Though very large in size, in experiment we don’t use the whole corpus for causal knowledge injection but use some subsets. Future studies can decide how to use this very large causal knowledge base for tasks like causal generation Rashkin et al. (2018); Sap et al. (2019), or pre-training a CausalBERT model from scratch.

ConceptNet. ConceptNet Speer et al. (2016)

is a knowledge graph that connects words and phrases of natural language with labeled edges. The knowledge was collected from many sources that include experts-created resources, crowd-sourcing, and games with a purpose. It uses a closed class of selected relations such as

IsA, UsedFor, and CapableOf. We obtain 22 K phrase level cause-effect pairs from the Causes relation, such as “A big game” causes “watch television”.

2.1.2 Word Level Causal Knowledge

As we also evaluate our method on word level causal pair classification tasks, we need to collect a word level causal pairs training set. We consider the following three approaches.

Cheap Supervision from Precise Template Matching. Inspired by Girju and Moldovan (2002), we propose to use some low ambiguity and precise causal patterns to extract word level cause-effect pairs from the preprocessed English Common Crawl corpus (5.14 TB) Buck et al. (2014). Specifically, we find that some grammatical structures in English like NP1-verb NP2, where the verb can be ‘caused’, ‘causing’, ‘induced’ and ‘inducing’, explicitly express a causal relation between NP1 and NP2. For example, “virus-caused infection” implies virus causes infection, and “sleep-inducing pills” implies pills cause sleep. We totally collect 558 K word level causal pairs by using this simple but effective template matching approach.

CausalNet. We reproduced a variant of CausalNet Luo et al. (2016) in our CausalBankLi et al. (2020) study333, please refer to that work for details. For experiment we keep the top 1.96 M word pairs with the highest necessity causal strengthLuo et al. (2016), computed with: , where is a constant penalty exponent value, penalizing high-frequency response terms. , and

are the probabilities that

is a cause word in the corpus, is an effect word in the corpus, and has a causal relation with .

Causal Embedding. Apart from the above template matching based methods, we further adopted the causal word embedding techniques proposed by Xie and Mu (2019) to automatically learn word pairs with strong causal relations, by running their Max-Matching

model on our sentence level CausalBank corpus. We used 100 word embedding size, with a 113 K cause word vocabulary and 92 K effect word vocabulary, running 11 epochs to reach convergence. Then we harvest the word pairs with embedding inner product similarity scores above 0, obtaining 120 K word level causal pairs.

2.2 Pre-training Tasks For CausalBERT

Previous studies Li et al. (2019b); Phang et al. (2018) have shown that applying intermediate auxiliary task training to an encoder such as BERT can improve performance on a target task. In this paper, we further extend the CausalBERTLi et al. (2020, 2019a) idea from previous studies to conduct experiments on more datasets to evaluate its effectiveness. In order to inject causal knowledge into pre-trained language models, we devise the following two pre-training tasks to further train the models:

(1). Causal Pair Classification with Cross Entropy Loss: For each positive cause-effect pairs, we randomly sample some false causes or effects from other relations, to get the negative training examples. Then we pre-train the models such as BERT using a binary classification task, with a cross entropy objective.

(2). Causal Pairs Ranking with Margin-based Loss: Like the above classification task, we also use negative sampling to get the negative training examples. Instead of doing a binary classification, we rank the positive cause-effect pairs to have a prediction score above the negative pairs, employing a margin-based loss Li et al. (2019a, 2018) in the objective function: , where is the score of true CE pair given by BERT model, is the score of corrupted CE pair by replacing or with randomly sampled negative cause or effect from other examples.

is the margin loss function parameter.

is the set of pre-trained models’ parameters. is the parameter for L2 regularization.

For causal pair classification test sets (section 3.1), we use the first pre-training task; for multiple choices test sets (section 3.2, 3.3), we use the second ranking-based pre-training task following the suggestions from Li et al. (2019a). By training pre-trained models with these tasks, we expect the model to acquire specific domain knowledge about the meaning of a causal relation, and perform better on downstream causal inference tasks.

In practice, for each positive cause-effect pair we randomly samples negative pairs. The label imbalance in training data may lead to a biased model. We use weight adjustment to fix this. Specifically, we apply weight adjustment to the total loss with a weight factor calculated as the observed label’s count relative to the number of all instances.

2.3 Overcoming Catastrophic Forgetting

In order to alleviate the catastrophic forgetting issue seen in our sequential transfer learning method, we adopt a regularization-based method Chen et al. (2020) to preserve the previous knowledge with an extra regularization term. Specifically, we apply an L2 regularization term on the pre-trained models’ parameters when injecting causal knowledge into them. This simple technique helps the model don’t deviate too far from the already learned language modeling parameters, while learning new causal knowledge.

2.4 Model Details For Reproduction

Our models have the same number of parameters as the original models in Devlin et al. (2019); Liu et al. (2019); Lan et al. (2019)

. We use the following hyperparameters: 1e-5 learning rate, 8, 40, 80 or 200 batch size, 8, 21, 50, 150 or 192 max sequence length, running for at most 3 epochs, evaluating for every 50 or 300 optimization steps. Our code, running scripts with used hyperparameters and sample data of our collected causal resources are uploaded as supplementary materials. The base version of our models run for about 1 to 3 hours, while the large version run for 10 to 30 hours, according to the training data size. Each of our model uses one NVIDIA 2080ti GPU with 11GB memory. For each positive causal pair, we randomly samples two negative examples.

3 Causal Inference Benchmark Datasets

3.1 Causal Pair Classification

In this following we describe the four causal pair classification test sets. These datasets are created by human experts and the later three are from real-world risk management applications ( We use the four datasets for direct evaluation of our causal knowledge enhanced language models, as they are in the same form with our pre-training tasks used for injecting causal knowledge.

SemEval. This is the same data set used by Sharp et al. (2016) for the sake of comparison with state-of-the-art methods. The data set is derived from the SemEval 2010 Task 8 Hendrickx et al. (2009), originally a classification of semantic relations between nominals (words). The dataset consists of 1,730 word pairs, out of which 865 (half) are marked as causal, and the rest are a random subset of non-causal relations. A positive cause-effect example is ‘vaccine’ and ‘fever’, while a negative pair can be ‘aliens’ and ‘space’.

NATO-SFA. This dataset comes from a report that examines the main trends of global change and the resultant defense and security implications. This report is a result of a deep understanding of various trends and conditions throughout the world by a large number of human experts. The title text of each trend was used as a cause and the text of the implication as the effect. Random sampling was used to generate non-causal pairs, leading to totally 118 cause-effect pairs.

Risk Models. This causal pairs dataset was created from the models designed by expert analysts for a decision support system Sohrabi et al. (2018a, b, 2019). The models can be seen as graphs where nodes are short descriptions of conditions or events (e.g. High Inflation Rate or Increase in Corruption) and edge simply a causal relation. These models are based on the experts’ domain knowledge in enterprise risk management. The result is a set of 804 cause-effect pairs.

CE Pairs. As another collection of cause-effect pairs targeting the use case in risk management, this dataset was manually extracted where either the cause or effect is from the node labels in the above risk models, but another phrase comes from online news or other documents. This dataset contains 160 causal and 160 non-causal pairs.

For comparison, we adopted state-of-the-art baselines from Hassanzadeh et al. (2019): the word occurrences-based PMI; CEA, a modification of the PMI that multiplies other factors; DCC, a search based method on a large corpus; DCC-embed, a word embedding based method that treats each phrase as a word; NLM-BERT uses BERT Devlin et al. (2019)

to encode sentences “X may cause Y” and its top k most similar causal sentences, and compute the average cosine similarity score for making a prediction. We also report precision, recall, F1 and accuracy.

3.2 Causal Inference

COPA. (Choice of Plausible Alternatives, Roemmele et al. (2011)) is a causal inference task in which a system is given a premise sentence and must determine either the cause or effect of the premise from two possible choices. All examples are handcrafted. It has 500 examples in the training set and 500 in the test set. Following the original work, we evaluate using accuracy. Balanced COPA (BCOPA) Kavumba et al. (2019) extended COPA with one additional mirrored instance for each original training instance to overcome the superficial cues in COPA, that may be exploited by models like BERT. This mirrored instance uses the same alternatives as the corresponding original instance, but introduces a new premise which matches the wrong alternative of the original instance, leading to another 500 training examples.

3.3 Causal Question Answering

CausalQA. This dataset contains 3,031 causal questions extracted from the Yahoo! Answers corpus Sharp et al. (2016), split into 60%, 20% and 20% for train, dev and test. The questions are with simple surface patterns such as “What causes …” and “What is the result of …”. All the answers are generated by the community, and one of them is voted as the top answer. Each causal question has 7.7 candidate answers on average, ranging from 4 to 93. The task is to identify the top answer from the alternatives.

CosmosQA. CosmosQA Huang et al. (2019) is a large-scale dataset of 35,600 problems that require causal commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people’s everyday narratives, asking questions like “what might be the possible reason of …?”, or “what would have happened if …” that require causal inference beyond the exact text spans in the context.

Method P R F1 Acc P R F1 Acc P R F1 Acc P R F1 Acc
SemEval Risk Models NATO-SFA CE Pairs
PMI 50.0 100.0 66.6 52.9 50.0 100.0 66.7 53.1 50.0 100.0 66.7 60.2 50.0 100.0 66.6 50.9
CEA 50.0 100.0 66.6 54.0 50.0 100.0 66.7 54.3 50.0 100.0 66.7 55.1 50.0 100.0 66.7 54.1
DCC 72.1 68.1 70.0 72.0 50.0 100.0 66.7 50.8 50.0 100.0 66.7 66.1 50.0 100.0 66.7 55.9
DCC-embed 66.5 77.9 71.8 73.4 50.0 100.0 66.7 52.0 50.0 100.0 66.7 66.9 50.0 100.0 66.7 57.5
NLM-BERT 50.2 99.8 66.8 61.9 52.0 93.8 66.9 55.7 52.3 98.3 68.2 55.9 50.5 100.0 67.1 56.3
RoBERTa-base +1 50.0 100 66.7 55.4 68.3 84.6 75.6 73.0 66.2 83.1 73.7 72.0 65.1 93.1 76.6 74.1
RoBERTa-base +13 50.1 99.9 66.8 55.5 70.1 83.3 76.1 73.9 68.5 84.7 75.8 72.9 64.2 93.1 76.0 70.6
RoBERTa-base +4 81.9 83.0 82.4 82.6 52.4 92.5 66.9 60.8 50.0 100.0 66.7 61.9 53.6 91.9 67.7 62.5
RoBERTa-base +45 63.9 76.2 69.5 67.0 53.4 92.3 67.6 62.6 52.8 94.9 67.9 63.6 56.4 85.6 68.0 65.0
RoBERTa-base +46 76.5 80.3 78.4 77.8 53.2 91.0 67.2 61.8 52.9 93.2 67.5 65.3 50.0 100.0 66.7 65.0
RoBERTa-base +14 83.0 82.1 82.6 82.7 67.4 88.8 76.6 73.5 63.5 91.5 75.0 72.0 66.4 87.5 75.5 71.6
RoBERTa-base +134 84.8 82.3 83.5 83.8 67.2 86.8 75.8 72.9 70.0 83.1 76.0 73.7 67.8 85.6 75.7 72.5
RoBERTa-base +245 64.2 78.3 70.6 67.4 69.5 86.6 77.1 74.3 70.6 81.4 75.6 73.7 69.9 85.6 77.0 74.4
RoBERTa-base +234 83.3 85.1 84.2 84.0 68.0 87.6 76.5 74.1 67.5 88.1 76.5 74.6 69.8 88.1 77.9 75.0
BERT-large +134 84.6 82.2 83.4 83.7 62.1 88.6 73.0 69.7 67.5 88.1 76.5 74.6 63.8 85.0 72.9 68.4
RoBERTa-large +134 86.9 80.8 83.8 84.3 69.5 88.8 77.9 76.1 65.5 93.2 76.9 72.9 73.7 87.5 80.0 78.4
RoBERTa-large +234 83.1 81.3 82.2 83.0 69.6 87.8 77.7 75.6 71.1 91.5 80.0 77.1 71.2 91.2 80.0 77.2
ALBERT-xxlarge +134 88.8 77.3 82.7 83.8 72.1 86.8 78.8 77.2 76.8 89.8 82.8 81.4 74.9 85.6 79.9 78.4
ALBERT-xxlarge +234 85.8 82.0 83.9 84.4 72.2 88.3 79.4 77.1 76.4 93.2 84.0 82.2 72.4 90.0 80.2 77.8
Table 1: Direct evaluation results (accuracy, %) on the four human experts created causal pair classification datasets: the first SemEval is word level test set, while the other three are phrase or sentence level test sets. Top: baseline methods from Hassanzadeh et al. (2019). Bottom: CausalBERT-based approaches with various pre-trained models and knowledge source combinations. The meaning of numbers in method description: 1: 0.1 M CausalBank; 2: 1 M CausalBank; 3: 22 K ConceptNet; 4: 558 K Precise Template Matching; 5: 2 M CausalNet; 6: 120 K Causal Embedding. All numbers are size of positive cause-effect pairs from the corresponding dataset.
Figure 3: Training curves for F1 score with the number of training steps, from ALBERT-xxlarge + 234 model. Red: max baseline F1 score. Blue: with label weight adjustment. Yellow: no label weight adjustment.

4 Results and Analysis

Results for Causal Pair Classification. Table 1 shows the direct evaluation results on the four human experts created causal pair classification datasets. We conducted extensive experiments with various pre-trained models and knowledge source combinations to study their individual impacts, using the causal pair classification pre-training task. For fair comparison with the unsupervised baselines from Hassanzadeh et al. (2019), we actually use the four datasets as development sets, monitoring the performance on them during training. The meaning of numbers in method description is clarified in the caption of Table 1.

By comparing the results from different methods, we can get the following observations: (1) Our methods consistently outperform the best baseline results on the four test sets, with large F1 and accuracy improvements, ranging from 10.9 to 21.5 absolute points. This demonstrates great power of CausalBERT in causal inference. (2) Word level causal knowledge from Precise Template Matching are very effective for word level SemEval test set (82.4% F1). Adding more sentence level causal knowledge from CausalBank and ConceptNet further improve the results (83.5% and 84.2% F1). However, causal knowledge from CausalNet and Causal Embedding hurts the performance on SemEval, mainly due to the noise introduced in them. (3) Clean causal knowledge from CausalBank, ConceptNet and Precise Template Matching are most useful for causal pair classification, leading to the best F1 performances (79.4%, 84.0%, 80.2%) on the later three sentence level tasks with ALBERT-xxlarge model Lan et al. (2019) (RoBERTa-base Liu et al. (2019)

gets the best F1 vaule (84.2%) on SemEval). (4) We achieved a unified strong classifier (Causal ALBERT-xxlarge) for word, phrase and sentence level causal pair classification.

Pre-trained models like BERT and RoBERTa are generally considered as sentence encoders to represent the meanings of relatively longer sentences. Our results on SemEval show that pre-trained models are also good at representing the meanings of short word pairs, which doesn’t suffer from the superficial cues issues Kavumba et al. (2019); Gururangan et al. (2018); McCoy et al. (2019). Figure 3 shows training curves for F1 value with our causal ALBERT-xxlarge model. We find label weight adjustment is very crucial for the word level SemEval.

BERT-large Sap et al. (2019) 75.0 -
BERT-base Li et al. (2019a) 75.4 -
BERT-large Kavumba et al. (2019) 76.5 74.5
RoBERTa-large Kavumba et al. (2019) 87.7 89.0
RoBERTa-large (Leaderboard) 90.6 -
BERT-base (our imple.) 74.5 76.3
BERT-large (our imple.) 77.8 80.0
RoBERTa-base (our imple.) 80.5 81.3
RoBERTa-large (our imple.) 90.3 90.2
ALBERT-large (our imple.) 80.1 79.7
ALBERT-xxlarge (our imple.) 92.1 92.3
BERT-base + CB (0.1 M) 78.6 78.6
BERT-large + CB (0.1 M) 79.3 80.6
RoBERTa-base + CB (0.1 M) 85.4 83.8
RoBERTa-large + CB (0.1 M) 90.9 90.5
ALBERT-large + CB (0.1 M) 82.1 81.5
ALBERT-xxlarge + CB (0.1 M) 92.6 93.5
Table 2: Accuracy (%) results on COPA and BCOPA test set: fine-tuning the whole model.

Results on COPA and BCOPA. Table 2 and Table 3 show the accuracy results on COPA test set (BCOPA use the same test set) under two different settings: fine-tuning the whole model including the pre-trained models’ and the output layer’ parameters on the training set, pre-training the models on 0.1 M CausalBank (CB) without fine-tuning on the COPA training set (zero-shot setting). Compared with the current SOTA models, our causal knowledge enhanced models consistently achieve better results. We achieve 93.5% accuracy with the causal knowledge enhanced ALBERT-xxlarge model, very close to the performance of the largest google-T5-11B model (94.8%) Raffel et al. (2019). Surprisingly, our causal knowledge enhanced ALBERT-xxlarge model achieves 86.4% accuracy on the COPA test set under the zero-shot setting (Table 3), which outperforms some strong pre-trained models fine-tuned on the COPA dataset Li et al. (2019a). This implies that CausalBank contains rich causal knowledge.

Method COPA
BigramPMI Goodwin et al. (2012) 63.4
PMI Gordon et al. (2011) 65.4
CausalNet + PMI Luo et al. (2016) 70.2
Multiword + PMI Sasaki et al. (2017) 71.4
BERT-base + CB (0.1 M) 67.8
BERT-large + CB (0.1 M) 70.2
RoBERTa-base + CB (0.1 M) 74.0
RoBERTa-large + CB 77.8
RoBERTa-large + CB (0.1 M) 82.2
ALBERT-base + CB 62.0
ALBERT-large + CB (0.1 M) 68.2
ALBERT-xlarge + CB 72.4
ALBERT-xxlarge + CB (0.1 M) 86.4
Table 3: Accuracy (%) on COPA: zero-shot setting.

Results on CausalQA. Table 4 shows the results on CausalQA. With the causal knowledge from CausalBank (0.1M CB), BERT-large Devlin et al. (2019) improves accuracy from 38.5% to 39.1%, while RoBERTa-large shows decreased performance from 39.5% to 37.8%. We don’t observe very large performance improvements on the CausalQA task like in the previous experiments. We guess this is because CausalQA dataset comes from online QA forum, with many user-generated unregular contents. And the task format is more complicated with 7.7 candidate answers on average (93 at most). On the other hand, the model may suffer from catastrophic forgetting Kirkpatrick et al. (2017) in our sequential transfer learning process. We adopted two kinds of techniques to alleviate this: an L2 regularization based approach, and a parameter isolation-based method (KA, Wang et al. (2020)). Results show that they are both effective in overcoming catastrophic forgetting, leading to consistent performance improvements. Finally, we get the best accuracy of 40.1% with CausalBERT and K-adapter Wang et al. (2020).

Method CausalQA
Causal Embedding Sharp et al. (2016) 37.3
Causal Embedding Xie and Mu (2019) 37.9
BERT-large (our imple.) 38.5
RoBERTa-large (our imple.) 39.5
BERT-large + CB (0.1 M) 39.1
RoBERTa-large + CB (0.1 M) 37.8
BERT-large + L2 + CB (0.1 M, L2=0.01) 39.5
RoBERTa-large +L2 + CB (L2=0.1) 39.1
RoBERTa-large + L2 + CB (0.1 M, L2=0.01) 39.8
BERT-base+concat + CB 37.2
RoBERTa-base +concat + CB 36.8
RoBERTa-large +KA + CB (0.1 M) 36.8
BERT-large +KA + CB (0.1 M) 40.1
Table 4: Accuracy results (%) on CausalQA test set.

Results on CosmosQA. Table 5 (middle) shows the results on CosmosQA development set (Labels are not released for the test set examples), which is a more complicated causal question answering task, with a long passage context in each example. From Table 5 we can get similar observations to CausalQA that the improvements are rather minimal after injecting causal knowledge into the pre-trained models. With more causal pairs from CausalBank (1M CB), CausalBERT cannot always get better results. Even with the K-adapter (KA, Wang et al. (2020)) based technique, our methods don’t get significant improvements. We guess this is because CosmosQA requires the system to first understand the passage and the question, and then choose the correct answer based on indirect causal inference (not a direct causal relation classification). Thus, the relatively simple causal pair classification task brings very little benefits for this complicated causal inference task. We leave it as future work to explore more effective causal knowledge injection approach for CosmosQA. The good thing is that our implementation of the ALBERT-xxlarge model gets very high accuracy of 85.0%, then to 85.8% with causal knowledge injection. This outperforms the previous SOTA result of 81.8% with a large margin. We further conduct experiments on CosmosQA without the passage information. Results are shown at the bottom of Table 5. We find that our CausalBERT consistently outperforms the original pre-trained models, demonstrating the effectiveness of our method.

Method CosmosQA
BERT-large Huang et al. (2019) 66.2
BERT-large +SWAG Huang et al. (2019) 67.8
BERT-large Multiway Huang et al. (2019) 68.3
RoBERTa-large Wang et al. (2020) 80.6
RoBERTa-large +MT Wang et al. (2020) 81.2
RoBERTa-large +KA Wang et al. (2020) 81.8
BERT-large (our imple.) 66.5
RoBERTa-large (our imple.) 80.7
ALBERT-xxlarge (our imple.) 85.0
BERT-large + CB (0.1 M) 67.6
RoBERTa-large + CB (0.1 M) 81.0
BERT-large + KA + CB (0.1 M) 65.5
BERT-large + KA + CB (1M) 66.0
RoBERTa-large + KA + CB (0.1 M) 80.8
RoBERTa-large + KA + CB (1M) 80.1
ALBERT-xxlarge + CB (0.1 M) 85.8
BERT-large (our imple.) 58.7
RoBERTa-large (our imple.) 63.1
BERT-large + CB (0.1 M) 59.1
RoBERTa-large + CB (0.1 M) 64.9
Table 5: Accuracy results (%) on CosmosQA dev set.

5 Related Work

Injecting Domain Knowledge Into Pre-trained Models. This paper relates to the studies of injecting specific domain knowledge into pretrained language models such as BERT. TacoLM Zhou et al. (2020) proposes exploiting explicit and implicit mentions of temporal common sense extracted from a large corpus to build a temporal common sense language model. GenBERT Geva et al. (2020) injects numerical reasoning skills into pre-trained language models by generating large amounts of numerical and textual data, and training in a multi-task setup. ERNIE Zhang et al. (2019) injects a knowledge graph into BERT by aligning entities from Wikipedia sentences to fact triples in WikiData. SenseBERT Levine et al. (2019) injects word-supersense knowledge by predicting the WordNet supersense of the masked word in the input. KnowBERT Peters et al. (2019) incorporates knowledge bases into BERT using Knowledge attention and recontextualization, where the knowledge comes from synset-synset and lemma-lemma relationships in WordNet. In this paper, we propose to inject causal knowledge into pre-trained models, which none of previous work has explored.

Causal Reasoning Tasks in NLP. COPA Roemmele et al. (2011) is a causal inference task in which a system is given a premise and must determine either the cause or effect of the premise from two candidate answers. Sharp et al. (2016) was the first work to train causal word embeddings for causal question answering. Hassanzadeh et al. (2019) investigated a series of unsupervised methods for answering binary causal questions. Xie and Mu (2019) proposed three causal word embedding models which can map the labels of sentence level cause-effect pairs to word level causal relations. Then the learned causal word embeddings were used in word pairs classification and causal question answering. Huang et al. (2019) proposed a machine reading comprehension style causal commonsense reasoning dataset. In this work we gather most of the current causal inference datasets to form a novel causal inference benchmark.

6 Conclusion

In this paper, we extend the idea of CausalBERTLi et al. (2020) from previous studies, and conduct experiments on various datasets to evaluate its effectiveness. Extensive experiments show that CausalBERT is effective in solving different text level’s causal inference tasks, and achieves new SOTA or comparable results on seven causal inference benchmark datasets.


  • C. Buck, K. Heafield, and B. van Ooyen (2014) N-gram counts and language models from the common crawl. In LREC, External Links: Link Cited by: §2.1.1, §2.1.2.
  • S. Chen, Y. Hou, Y. Cui, W. Che, T. Liu, and X. Yu (2020) Recall and learn: fine-tuning deep pretrained language models with less forgetting. ArXiv abs/2004.12651. Cited by: §2.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.4, §3.1, §4, footnote 1.
  • M. Geva, A. Gupta, and J. Berant (2020) Injecting numerical reasoning skills into language models. ArXiv abs/2004.04487. Cited by: §5.
  • R. Girju and D. I. Moldovan (2002) Text mining for causal relations. In FLAIRS Conference, Cited by: §2.1.2.
  • T. Goodwin, B. Rink, K. Roberts, and S. M. Harabagiu (2012) UTDHLT: copacetic system for choosing plausible alternatives. In SemEval@NAACL-HLT, Cited by: Table 3.
  • A. S. Gordon, C. A. Bejan, and K. Sagae (2011) Commonsense causal reasoning using millions of personal stories. In AAAI, Cited by: Table 3.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 107–112. External Links: Link, Document Cited by: §4.
  • O. Hassanzadeh, D. Bhattacharjya, M. Feblowitz, K. Srinivas, M. Perrone, S. Sohrabi, and M. Katz (2019) Answering binary causal questions through large-scale text mining: an evaluation using cause-effect pairs from human experts. IJCAI. Cited by: §1, §1, §3.1, Table 1, §4, §5.
  • I. Hendrickx, S. N. Kim, Z. Kozareva, P. Nakov, D. Ó. Séaghdha, S. Padó, M. Pennacchiotti, L. Romano, and S. Szpakowicz (2009) SemEval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. ArXiv abs/1911.10422. Cited by: §3.1.
  • L. Huang, R. L. Bras, C. Bhagavatula, and Y. Choi (2019) Cosmos qa: machine reading comprehension with contextual commonsense reasoning. In EMNLP/IJCNLP, Cited by: §1, §3.3, Table 5, §5.
  • P. Kavumba, N. Inoue, B. Heinzerling, K. Singh, P. Reisert, and K. Inui (2019) When choosing plausible alternatives, clever hans can be clever. In

    Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

    Hong Kong, China, pp. 33–42. External Links: Link, Document Cited by: §3.2, Table 2, §4.
  • J. N. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2017)

    Overcoming catastrophic forgetting in neural networks

    Proceedings of the National Academy of Sciences of the United States of America 114 13, pp. 3521–3526. Cited by: §1, §4.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: a lite bert for self-supervised learning of language representations

    arXiv preprint arXiv:1909.11942. Cited by: §2.4, §4, footnote 1.
  • Y. Levine, B. Lenz, O. Dagan, O. Ram, D. Padnos, O. Sharir, S. Shalev-Shwartz, A. Shashua, and Y. Shoham (2019) SenseBERT: driving some sense into bert. ArXiv abs/1908.05646. Cited by: §5.
  • Z. Li, T. Chen, and B. Van Durme (2019a) Learning to rank for plausible plausibility. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4818–4823. External Links: Link, Document Cited by: §1, §1, §2.2, §2.2, §2.2, Table 2, §4.
  • Z. Li, X. Ding, T. Liu, J. E. Hu, and B. Van Durme (2020) Guided generation of cause and effect. In

    Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20

    , C. Bessiere (Ed.),
    pp. 3629–3636. Note: Main track Cited by: §1, §1, §2.1.1, §2.1.2, §2.2, §6.
  • Z. Li, X. Ding, and T. Liu (2018) Constructing narrative event evolutionary graph for script event prediction. In IJCAI, pp. 4201–4207. Cited by: §2.2.
  • Z. Li, X. Ding, and T. Liu (2019b) Story ending prediction by transferable bert. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 1800–1806. External Links: Document, Link Cited by: §2.2, §2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §2.4, §4, footnote 1.
  • Z. Luo, Y. Sha, K. Q. Zhu, S. Hwang, and Z. Wang (2016) Commonsense causal reasoning between short texts. In KR, Cited by: §1, §2.1.2, Table 3.
  • T. McCoy, E. Pavlick, and T. Linzen (2019)

    Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3428–3448. External Links: Link, Document Cited by: §4.
  • P. Mirza, R. Sprugnoli, S. Tonelli, and M. Speranza (2014) Annotating causality in the tempeval-3 corpus. In EACL 2014, Cited by: §1.
  • A. Nie, E. Bennett, and N. Goodman (2019) DisSent: learning sentence representations from explicit discourse relations. In ACL, Cited by: §1.
  • M. E. Peters, M. Neumann, I. RobertLLogan, R. Schwartz, V. Joshi, S. Singh, and N. A. Smith (2019) Knowledge enhanced contextual word representations. EMNLP. Cited by: §5.
  • J. Phang, T. Févry, and S. R. Bowman (2018) Sentence encoders on stilts: supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088. Cited by: §2.2, §2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: §1, §4.
  • H. Rashkin, M. Sap, E. Allaway, N. A. Smith, and Y. Choi (2018) Event2Mind: commonsense inference on events, intents, and reactions. In ACL, Cited by: §2.1.1.
  • M. Roemmele, C. A. Bejan, and A. S. Gordon (2011) Choice of plausible alternatives: an evaluation of commonsense causal reasoning.. In AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, Cited by: §1, §1, §3.2, §5.
  • M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, and Y. Choi (2019) ATOMIC: an atlas of machine commonsense for if-then reasoning. In AAAI, Cited by: §2.1.1.
  • M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y. Choi (2019) Social IQa: commonsense reasoning about social interactions. In EMNLP, Hong Kong, China, pp. 4462–4472. External Links: Link, Document Cited by: Table 2.
  • S. Sasaki, S. Takase, N. Inoue, N. Okazaki, and K. Inui (2017)

    Handling multiword expressions in causality estimation

    In IWCS, Cited by: Table 3.
  • R. Sharp, M. Surdeanu, P. Jansen, P. Clark, and M. Hammond (2016) Creating causal embeddings for question answering with minimal supervision. In EMNLP, Cited by: §1, §1, §3.1, §3.3, Table 4, §5.
  • S. Sohrabi, M. Katz, O. Hassanzadeh, O. Udrea, M. D. Feblowitz, and A. Riabov (2019) IBM scenario planning advisor: plan recognition as ai planning in practice. AI Commun. 32, pp. 1–13. Cited by: §3.1.
  • S. Sohrabi, M. Katz, O. Hassanzadeh, O. Udrea, and M. D. Feblowitz (2018a) IBM scenario planning advisor: plan recognition as ai planning in practice. In IJCAI, Cited by: §3.1.
  • S. Sohrabi, A. Riabov, M. Katz, and O. Udrea (2018b) An ai planning solution to scenario generation for enterprise risk management. In AAAI, Cited by: §3.1.
  • R. Speer, J. Chin, and C. Havasi (2016) ConceptNet 5.5: an open multilingual graph of general knowledge. In AAAI, Cited by: §1, §2.1.1.
  • R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, J. Ji, C. Cao, D. Jiang, and M. Zhou (2020) K-adapter: infusing knowledge into pre-trained models with adapters. ArXiv abs/2002.01808. Cited by: Table 5, §4, §4.
  • Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata (2019) Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, pp. 2251–2265. Cited by: §1.
  • Z. Xie and F. Mu (2019) Distributed representation of words in cause and effect spaces. In AAAI, Cited by: §1, §2.1.2, Table 4, §5.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1.
  • Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu (2019) ERNIE: enhanced language representation with informative entities. ACL. Cited by: §5.
  • B. Zhou, Q. Ning, D. Khashabi, and D. Roth (2020) Temporal common sense acquisition with minimal supervision. ArXiv abs/2005.04304. Cited by: §1, §5.