Log In Sign Up

BUT-FIT at SemEval-2020 Task 5: Automatic detection of counterfactual statements with deep pre-trained language representation models

This paper describes BUT-FIT's submission at SemEval-2020 Task 5: Modelling Causal Reasoning in Language: Detecting Counterfactuals. The challenge focused on detecting whether a given statement contains a counterfactual (Subtask 1) and extracting both antecedent and consequent parts of the counterfactual from the text (Subtask 2). We experimented with various state-of-the-art language representation models (LRMs). We found RoBERTa LRM to perform the best in both subtasks. We achieved the first place in both exact match and F1 for Subtask 2 and ranked second for Subtask 1.


ISCAS at SemEval-2020 Task 5: Pre-trained Transformers for Counterfactual Statement Modeling

ISCAS participated in two subtasks of SemEval 2020 Task 5: detecting cou...

SemEval-2020 Task 5: Detecting Counterfactuals by Disambiguation

In this paper, we explore strategies to detect and evaluate counterfactu...

IITK-RSA at SemEval-2020 Task 5: Detecting Counterfactuals

This paper describes our efforts in tackling Task 5 of SemEval-2020. The...

CRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models

We introduce the CRASS (counterfactual reasoning assessment) data set an...

Counterfactual Detection meets Transfer Learning

We can consider Counterfactuals as belonging in the domain of Discourse ...

Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning

Abductive and counterfactual reasoning, core abilities of everyday human...

1 Introduction

This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details:

One of the concerns of SemEval-2020 Task 5: Modelling Causal Reasoning in Language: Detecting Counterfactuals [16] is to research the extent to which current state-of-the-art systems can detect counterfactual statements. A counterfactual statement, as defined in this competition, is a conditional composed of two parts. The former part is the antecedent – a statement that is contradictory to known facts. The latter is the consequent – a statement that describes what would happen had the antecedent held111 According to several definitions in literature, e.g. [12], the antecedent of counterfactual might not need to counter the facts. . To detect a counterfactual statement, the system often needs to posses a commonsense world knowledge to detect whether the antecedent contradicts with it. In addition, such a system must have an ability to reason over consequences that would arise had the antecedent would have been true. In some cases, the consequent might not be present at all, but instead a sequence resembling consequent, but with no consequential statement, might be present. Figure 1 shows a set of examples drawn from the data.

Figure 1: Three examples from the training data containing counterfactual statements. Antecedents are highlighted with red bold, consequents with blue bold italic. The last example has no consequent.

Counterfactuals were studied across a wide spectrum of domains. For instance, logicians and philosophers focus on logical rules between parts of counterfactual and its outcome [4]. Political scientists conducted counterfactual thought experiments for hypothetical tests on historical events, policies or other aspects of society [13]. However, there is only a small amount of work in computational linguistics studying this phenomenon. SemEval-2020 Task 5 aims at filling this gap in the field. The challenge consists of two subtasks:

  1. Detecting counterfactual statements – classify whether the sentence has a counterfactual statement.

  2. Detecting antecedent and consequence – extract boundaries of antecedent and consequent from the input text.

The approaches we adopted follow recent advancements from deep pre-trained language representation models. In particular, we experimented with fine-tuning of BERT [3], RoBERTa [7] and ALBERT [6] models. Our implementation is available online222

2 System overview

2.1 Language Representation Models

We experimented with three language representation models (LRMs):

BERT [3] is pre-trained using the multi-task objective consisting of denoising LM and inter-sentence coherence (ISC) sub-objectives. The LM objective aims at predicting the identity of 15% randomly masked tokens present at the input333The explanation of token masking is simplified and we refer readers to read details in the original paper [3].. Given two sentences from the corpus, the ISC objective is to classify whether the second sentence follows the first sentence in the corpus. The sentence is replaced randomly in half of the cases. During the pre-training, the input consists of two documents, each represented as a sequence of tokens divided by special token and preceded by token used by the ISC objective, i.e. . The input tokens are represented via jointly learned token embeddings , segment embeddings capturing whether the word belongs into or and positional embeddings since self-attention is position-invariant operation. During fine-tuning, we leave the second segment empty.

RoBERTa [7] is a BERT-like model with the different training procedure. This includes dropping the ISC sub-objective, tokenizing via byte pair encoding [11]

instead of WordPiece, full-length training sequences, more training data, larger batch size, dynamic token masking instead of token masking done during preprocessing and more hyperparameter tuning.


is a RoBERTa-like model, but with n-gram token masking (consecutive n-grams of random length from the input are masked), cross-layer parameter sharing, novel ISC objective that aims at detecting whether the order of two consecutive sentences matches the data, input embedding factorization, SentencePiece tokenization

[5] and much larger model dimension. The model is currently at the top of leaderboards for many natural language understanding tasks including GLUE [14] and SQuAD2.0 [10].

2.2 Subtask 1: Detecting counterfactual statements

The first part of the challenge is a binary classification task, where the participating systems determine whether the input sentence is a counterfactual statement.

A baseline system applying an SVM classifier [2]

over TF-IDF features was supplied by the organizers. We modified this script to use other simple classifiers over the same features – namely Gaussian Naive Bayes and 6-layer perceptron network, with 64 neurons in each layer.

As a more serious attempt at tackling the task, we compare these baselines with state-of-the-art LRMs – RoBERTa and ALBERT. The input is encoded the same way as in 2.3

. We trained both models with cross-entropy objective and we used the linear transformation of

CLS-level output after applying dropout for classification. After the hyperparameter search, we found that RoBERTa model performed the best on this task. For our final system, we built an ensemble from the best checkpoints of RoBERTa model.

2.3 Subtask 2: Detecting antecedent and consequence

We extended each LRM in the same way devlin2019 extended BERT for SQuAD. The input representation for input is obtained by summing the input embedding matrices representing its word embeddings , position embeddings and segment embeddings444RoBERTa is not using segment embeddings. with being the input length and input dimensionality. Applying LRM and dropout , an output matrix is obtained,

being the LRM’s output dimensionality. Finally, a linear transformation is applied to obtain logit vector for antecedent start/end

, and consequent start/end , .


For consequent, we do not mask CLS-level output and use it as a no consequent option for both and . Therefore we predict that there is no consequent iff model’s prediction is and ; assuming is the index of CLS-level output. Finally, the log-softmax is applied and model is trained via minimizing cross-entropy for each tuple of inputs and target indices from the dataset .


An ensemble was built using a greedy heuristic seeking the smallest subset from the pool of trained models s.t. it obtains best exact match on a validation set

555For more details on how the ensemble was built, see TOP-N fusion in fajcik2019..

3 Experimental setup

3.1 Data

For each subtask, training datasets without split were provided. Therefore we took the first 3000 examples from Subtask 1 data and 355 random examples from Subtask 2 data as the validation data. The train/validation/test split was 10000/3000/7000 for Subtask 1 and 3196/355/1950 for Subtask 2. 88.2% of the validation examples in Subtask 1 were labeled 0 (non-counterfactual).

3.2 Preprocessing & Tools

In case of Subtask 1, after performing a length analysis on the data, we truncated input sequences at length of 100 tokens for the LM based models in order to reduce worst-case memory requirements, since only 0.41% of the training sentences were longer than this limit. A histogram of the example lengths in tokens is presented in Appendix A.2. For Subtask 2, all the input sequences fit the maximum input length of 509 tokens.

For the preliminary experiments with simpler machine learning methods, we adopted the baseline script provided by the organizers, which is based on


Python module. We implemented our neural network models in

PyTorch [9] using transformers [15] library. In particular, we experimented with roberta-large and albert-xxlarge-v2 in Subtask 1 and with bert-base-uncased, bert-large-uncased, roberta-large and albert-xxlarge-v1 models in Subtask 2. We used hyperopt [1] to tune model hyperparameters. See Appendix A.1 for further details on hyperparameters. We used the Adam optimizer with a decoupled weight decay [8]. For Subtask 2, we combined this optimizer with lookahead [17]. All models were trained on 12GB GPU.

4 Results and analysis

For Subtask 1, we adapted the baseline provided by the task organizer to asses how more classical machine learning approaches perform on the dataset. After seeing the subpar performance, we turned our attention to pre-trained LRMs, namely RoBERTa and ALBERT. The results of the best run of each model can be found in Table 1. A more comprehensive list of results for different hyperparameters can be found in the Appendix 3.

Our final submission is an ensemble of RoBERTa-large models since we found that this LRM performs better than ALBERT for this task. We trained a number of models on the train set and computed F1 scores on the validation part. 10 best (in terms of F1) single models were selected, and the output probabilities were averaged for all the possible combinations of these models. The combination with highest F1 score was selected as a final ensemble. Then we trained new models with the same parameters as the models in the ensemble, but using the whole training data, including the part that was previously used for validation. Finally, for our submitted ensemble, we used checkpoints saved after the same number of updates as the best checkpoints for the systems trained only on part of the training data.

Model Precision Recall F1
SVM 80.55 8.19 14.87
Naive Bayes 22.81 28.81 25.47
MLP 39.01 29.09 33.33

RoBERTa-large-ens 87.30

Table 1: Results of different models on validation part of Subtask 1 training data (first 3000 sentences). Results for RoBERTa and ALBERT models are averaged over ten training runs with the best found hyperparameters.

We performed an error analysis of the best single RoBERTa and ALBERT models. RoBERTa model misclassified 52 examples (29 false positives, 23 false negatives), while ALBERT misclassified 60 examples (32 false positives, 23 false negatives). 29 wrongly classified examples were common for both of the models. Examples of wrongly classified statements are presented in the Appendix A.3.

M M M M -
Table 2:

Results on the Subtask 2 validation data. For EM/F1, we report means and standard deviations. The statistics were collected from 10 runs.

denotes the number of model’s parameters. We also measured EM/F1 for the extraction of antecedent/consequent separately; denoted as A, A and C, C respectively. At last ACC denotes no-consequent classification accuracy.

For Subtask 2, the results are presented in Table 2

. The hyperparameters were the same for all LRMs. An ensemble was composed of 11 models drawn from the pool of 60 trained models. We found the ALBERT results to have a high variance. In fact, we recorded our overall best result on validation data with ALBERT, obtaining

/ EM/F1. However, in competition, we submitted only RoBERTa models due to less variance and slightly better results on average666We submitted the best ALBERT model in the post-evaluation challenge phase, obtaining worse test data results than the ensemble..

5 Related work

Closest to our work, son2017 created a counterfactual tweet dataset and built a pipeline classifier to detect counterfactuals. The authors identified 7 distinct categories of counterfactuals and firstly attempted to classify the examples into one of these categories using a set of rules. Then for certain categories, they used a linear SVM classifier [2] to filter out tricky false positives.

A large effort in computational linguistics was devoted to the specific form of counterfactuals – so-called what-if questions. A recent paper by Tandon2019WIQAAD presents a new dataset for what-if question answering, including a strong, BERT-based baseline. The task is to choose an answer to a hypothetical question about cause and an effect, e.g. Do more wildfires result in more erosion by the ocean?. Each question is accompanied by a paragraph focused on the topic of the question, which may or may not contain enough information to choose the correct option. The authors show that there is still a large performance gap between humans and state-of-the-art models (73.8% accuracy for BERT against 96.3% for a human). This gap is caused mainly by the inability of the BERT model to answer more complicated questions based on indirect effects, which require more reasoning steps. However, the results show that the BERT model was able to answer a large portion of the questions even without accompanying paragraphs, indicating that the LRM models have a notion of commonsense knowledge.

6 Conclusions

We examined the performance of current state-of-the-art language representation models on both subtasks and we found yet another NLP task benefits from unsupervised pre-training. In both cases, we found RoBERTa model to perform slightly better than other LRMs, while its results also being more stable. We have ended up first in both EM and F1 on Subtask 2 and second in Subtask 1.


This work was supported by the Czech Ministry of Education, Youth and Sports, subprogram INTERCOST, project code: LTC18006.


  • [1] J. Bergstra, D. Yamins, and D. D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. Cited by: Table 5, §3.2.
  • [2] C. Cortes and V. Vapnik (1995) Support-vector networks. Machine learning 20 (3), pp. 273–297. Cited by: §2.2, §5.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2.1, footnote 3.
  • [4] N. Goodman (1947) The problem of counterfactual conditionals. The Journal of Philosophy 44 (5), pp. 113–128. Cited by: §1.
  • [5] T. Kudo and J. Richardson (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: §2.1.
  • [6] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    Albert: a lite bert for self-supervised learning of language representations

    arXiv preprint arXiv:1909.11942. Cited by: §1, §2.1.
  • [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §2.1.
  • [8] I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §3.2.
  • [9] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §3.2.
  • [10] P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789. Cited by: §2.1.
  • [11] R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Cited by: §2.1.
  • [12] W. Starr (2019) Counterfactuals. In The Stanford Encyclopedia of Philosophy, E. N. Zalta (Ed.), Note: Cited by: footnote 1.
  • [13] P. E. Tetlock and A. Belkin (1996) Counterfactual thought experiments in world politics: logical, methodological, and psychological perspectives. Princeton University Press. Cited by: §1.
  • [14] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. EMNLP 2018, pp. 353. Cited by: §2.1.
  • [15] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. (2019)

    Transformers: state-of-the-art natural language processing

    arXiv preprint arXiv:1910.03771. Cited by: §3.2.
  • [16] X. Yang, S. Obadinma, H. Zhao, Q. Zhang, S. Matwin, and X. Zhu (2020) SemEval-2020 task 5: counterfactual recognition. In Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain. Cited by: §1.
  • [17] M. Zhang, J. Lucas, J. Ba, and G. E. Hinton (2019) Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, pp. 9593–9604. Cited by: §3.2.

Appendix A Supplemental Material

a.1 Hyperparameters

a.1.1 Subtask 1

The results of RoBERTa models with their training hyperparameters are presented in Table 3.

batch size learning rate best acc best F1
48 2.00E-05 0.9829943314 0.9209302326
70 1.00E-05 0.9823274425 0.9180834621
72 2.00E-05 0.9829943314 0.9199372057
90 4.00E-05 0.9829943314 0.9209302326
90 1.00E-05 0.9783261087 0.8992248062
96 3.00E-05 0.9839946649 0.9240506329
120 3.00E-05 0.9809936646 0.9107981221
132 4.00E-05 0.9796598866 0.9060092450
Table 3: Different batch sizes and learning rates used to train RoBERTa-large models, results of the best checkpoint on the validation part of the data.

We kept other RoBERTa model hyperparameters as shown in Table 4 for all training runs.

Hyperparameter Value
Max gradient norm 1.0
Epochs 8
Maximum input length 100
Dropout 0.1
Optimizer Adam ( = 1e-8)
Table 4: Hyperparameters for Subtask 2, shared for all runs.

a.1.2 Subtask 2

Our tuned hyperparameters are in Table 5. All other hyperparameters were left the same as PyTorch’s default. We did not use any learning rate scheduler.

Hyperparameter Value
Dropout rate (last layer) 0.0415
Lookahead K 1.263e-5
Lookahead 0.470
Max gradient norm 7.739
Batch size 64
Weight Decay 0.02
Patience 5
Max antecedent length 116
Max consequent length 56
Table 5: Hyperparameters for Subtask 2. We tune only dropout at the last layer (the dropout mentioned in 2.3). Patience denotes the maximum number of epochs, after which we stop the training if there was no EM improvement. All parameters were tuned using HyperOpt [1].

a.2 Data analysis

The distribution of lengths for examples from Subtask 1 is presented in Figure 2. We truncate sequences in this subtask to maximum of 100 tokens per example.

Figure 2: Histogram of example lengths in tokens in the training data for Subtask 1.

a.3 Wrongly classified examples

Table 6 shows examples of statements classified wrongly by both ALBERT and RoBERTa models.

Statement Predicted Correct
MAUREEN DOWD VISITS SECRETARY NAPOLITANO - ”New Year’s Resolutions: If only we could put America in Tupperware”: ”Janet Napolitano and I hadn’t planned to spend New Year’s Eve together. 0 1
If the current process fails, however, in hindsight some will say that it might have made more sense to outsource the whole effort to a commercial vendor. 1 0
Table 6: Examples of wrong predictions.

a.4 Ambiguous labels

During the error analysis, we noticed a number of examples where we were not sure whether the labels are correct (see Table 7).

Statement Label
Given that relatively few people have serious, undiagnosed arrhythmias with no symptoms (if people did, we would be screening for this more often), this isn’t the major concern. 0
A flu shot will not always prevent you from getting flu, but most will have a less severe course of flu than if they hadn’t had the shot,” Dr. Morens said. 0
Table 7: Examples of ambiguous annotation.

a.5 Measurement of results

The individual measurements for Subtask 2 statistics presented in 2 can be found at Note that we did not use the same evaluation script as used in official baseline. Our evaluation script was SQuAD1.1 like, ground truth and extracted strings were firstly normalized the same way as in SQuAD1.1, then the strings were compared. For details see our implementation of method evaluate_semeval2020_task5 in scripts/common/

a.6 Wrong predictions in Subtask 2

Ground Truth Prediction
GLOBAL FOOTPRINT Mylan said in a separate statement that the combination would create ”a vertically and horizontally integrated generics and specialty pharmaceuticals leader with a diversified revenue base and a global footprint.” On a pro forma basis, the combined company would have had revenues of about $4.2 billion and a gross profit, or EBITDA, of about $1.0 billion in 2006, Mylan said. GLOBAL FOOTPRINT Mylan said in a separate statement that the combination would create ”a vertically and horizontally integrated generics and specialty pharmaceuticals leader with a diversified revenue base and a global footprint.” On a pro forma basis, the combined company would have had revenues of about $4.2 billion and a gross profit, or EBITDA, of about $1.0 billion in 2006, Mylan said.
Shortly after the theater shooting in 2012, he told ABC that the gunman was ”diabolical” and would have found another way to carry out his massacre if guns had not been available, a common argument from gun-control opponents. Shortly after the theater shooting in 2012, he told ABC that the gunman was ”diabolical” and would have found another way to carry out his massacre if guns had not been available, a common argument from gun-control opponents.
Now, if the priests in the Vatican had done their job in the first place, a quiet conversation, behind closed doors and much of it would have been prevented. Now, if the priests in the Vatican had done their job in the first place, a quiet conversation, behind closed doors and much of it would have been prevented.
The CPEC may have some advantages for Pakistan’s economy – for one, it has helped address the country’s chronic power shortage – but the costs are worrisome and unless they can be wished away with a wand, it will present significant issues in the future. The CPEC may have some advantages for Pakistan’s economy – for one, it has helped address the country’s chronic power shortage – but the costs are worrisome and unless they can be wished away with a wand, it will present significant issues in the future.
Table 8: An example of bad predictions from LRM over Subtask 2 validation data. Antecedents are highlighted with red bold, consequents with blue bold italic.