Log In Sign Up

Causality Detection using Multiple Annotation Decision

by   Quynh Anh Nguyen, et al.
ETH Zurich

The paper describes the work that has been submitted to the 5th workshop on Challenges and Applications of Automated Extraction of socio-political events from text (CASE 2022). The work is associated with Subtask 1 of Shared Task 3 that aims to detect causality in protest news corpus. The authors used different large language models with customized cross-entropy loss functions that exploit annotation information. The experiments showed that bert-based-uncased with refined cross-entropy outperformed the others, achieving a F1 score of 0.8501 on the Causal News Corpus dataset.


IIT_kgp at FinCausal 2020, Shared Task 1: Causality Detection using Sentence Embeddings in Financial Reports

The paper describes the work that the team submitted to FinCausal 2020 S...

The Causal News Corpus: Annotating Causal Relations in Event Sentences from News

Despite the importance of understanding causality, corpora addressing ca...

1Cademy @ Causal News Corpus 2022: Leveraging Self-Training in Causality Classification of Socio-Political Event Data

This paper details our participation in the Challenges and Applications ...

Event Causality Identification with Causal News Corpus – Shared Task 3, CASE 2022

The Event Causality Identification Shared Task of CASE 2022 involved two...

Global Contentious Politics Database (GLOCON) Annotation Manuals

The database creation utilized automated text processing tools that dete...

1 Introduction

A causal relationship in a sentence implies an underlying semantic dependency between the two main clauses. The clauses in these sentences are generally connected by markers which can have different parts of tags in the sentence. Moreover, the markers can be either implicit or explicit and for these reasons, one cannot rely on regex or dictionary-based systems. Thus, there is a need to investigate the context of the sentences. For the given task, we exploited different large language models that provide a contextual representation of sentences to tackle causality detection.

Shared task 1 in CASE-2022 tan-etal-2022-event aims for causality detection in news corpus, which can be structured as a text classification problem with binary labels. Pre-trained transformer-based models transformers have shown success on tackling a wide range of NLP tasks including text generation, text classification, etc. The authors look into inter-annotation agreements and number of experts and how they can be included in the loss to improve the performance of the pre-trained models.

The main contributions of the paper are as follows:

  1. Extensive experimentation with different large language models.

  2. Incorporation of additional annotation information, i.e inter-annotation agreement and the number of annotators, to the loss.

The remaining paper is formulated as follows: Section  2 reviews the related work, section  3 describes the dataset on which the work has been done, section 4 discusses the methodology used in the paper, the following section discusses the results and provides an ablation of the various loss functions introduced and finally, section  6 concludes the paper and suggests future works.

2 Related Work

Multiple annotations on a single sample reduce the chances of the labelling to be incorrect or bias being incorporated into the dataset Snow2008CheapAF. Including multiple annotators also leads to disagreement among the labels that have been provided by them. The final or gold annotation is then usually determined by majority voting Sabou2014CorpusAT or by using the label of an "expert" Waseem2016HatefulSO. There are also different methodologies which do not use majority voting to select the "ground truth".

Expectation Maximization algorithm has been used to account for the annotator error Dawid1979MaximumLE. Entropy metrics have been developed to identify the performance of the annotatorsWaterhouse2012PayBT; Hovy2013LearningWT; Gordon2021TheDD. Multi-task learning is also used to deal with disagreement in the labels Fornaciari2021BeyondB; Liu2019MultiTaskDN; Cohn2013ModellingAB; Davani2022DealingWD. There are methods which include the annotation disagreement into the loss function for part of speech tagging Plank2014LearningPT; Prabhakaran2012StatisticalMT

on SVMs and perceptron model. The present work considers the inter-annotator agreement as well as the number of annotators into the loss function for any model. The work also compares the performance when the annotators who disagree with the majority voting has been ignored.

3 Dataset

The Causal News Corpus dataset tan_paper_dataset consists of 3,559 event sentences extracted from protest event news. Each sample in the dataset contains the text, the corresponding label, the number of experts who annotated the label and the degree of agreement among the experts. Figure 1 shows a sample from the provided training set. The training data is fairly balanced, containing 1603 sentences with a causal structure and 1322 sentences without a causal structure. Also, the number of causal and non-causal sentences in the validation set does not differ significantly. Finally, 311 news articles have been used as test set for evaluation.

Figure 1: A datapoint from the provided training data.

Besides the binary labels, the Causal News Corpus dataset also provides additional information regarding the number of experts who labeled the sentence and the percentage of agreement between them. Figure 1 shows that the number of experts who annotated the text "The farmworkers’strike resumed on Tuesday when their demands were not met." is 3 (). Also, all of the experts labeled the sentence to be causal so the agreement is 1.0 (100% agreement) and the label is 1. In case only one of three experts assigned label 1 to the previous text, the three predictors num_votes, agreement, label would now become , , respectively. In this paper, the authors exploit this information to give the model more prior and thus potentially improve the model’s performance, which has been described in more detail in section 4.

4 Methodology

The section discusses the pipeline, the different types of loss functions that were implemented, and the experimental details that have been used in the third shared task for CASE 2022 tan-etal-2022-event.

4.1 Pipeline

The authors finetuned large language models with different loss functions to tackle Subtask 1 in Shared Task 3 of CASE@EMNLP-2022, causality detection in a given sentence. The problem can be reformulated as a binary classification where the model predicts whether the sentence is causal or not. Since contextual awareness plays an essential role in handling this specific task, the authors used several transformer-based models, namely, BERT Devlin2019BERTPO, FinBERT Liu2020FinBERTAP, XLNET Yang2019XLNetGA and RoBERTa Liu2019RoBERTaAR.

The given sentence is first tokenized by a tokenizer from the corresponding pretrained model architecture provided by HuggingFace Wolf2019HuggingFacesTS

. The vector output from the tokenization stage is then fed as input to the model. The most informative token is the classification token ([CLS]), which is a special token that can be used as a sentence representation. The [CLS] token is then passed through a feed-forward network to generate logits. The softmax over the logits gives us the probability of whether the sentence is causal or not. For each model, the authors experimented with cross-entropy loss and proposed two loss functions described in detail in subsection


4.2 Loss Functions

Cross Entropy Loss

The loss of the classification task can be represented by a simple cross-entropy loss, as shown in Equation 1:


where and denote the true label and the predicted label for the input in a batch of M sentences.

Noisy Cross Entropy Loss

The dataset not only provides the standard information about {text, label}, but also contains the information about the number of experts who annotated the sentence’s label, and proportion of agreement between them. The authors have considered the annotation by each of the experts to be the true label for the sentence. For a sentence with expert annotations () and percent of agreement (), the loss for each sentence can be written as shown in Equation 2.


The equations can be combined and the loss for a batch of M sentences can be rewritten as:



The different annotations from all the experts has been considered, adding more information to the model. Equation  3 takes the votes from the different experts into account, out of which times it is assigned the correct label, and the incorrect label has been used the other times. If the labels from the different experts are taken directly, there will be conflicts in the labels when the experts disagree. Considering the loss for one sentence when the true label is 1, the derivative of the loss is shown in Equation 4. Figure 2 shows that the loss is minimized when is equal to

and its minima shifts from 1 to 0 as the level of agreement decreases when the true label is 1. A similar profile is obtained when the true label is considered to be 0. The formulation pushes the solution to a distribution where the ideal output is not a one-hot encoding, which is similar to the label smoothing method. Label smoothing was initially proposed by Szegedy et al.


to improve the performance of the Inception architecture on the ImageNet dataset

Deng2009ImageNetAL. In label-smoothing, the ground truth sent to the model is not encoded as a one-hot representation. Since there are conflicts in the annotations and the loss considers all of the noisy data, it has been referred as noisy cross-entropy loss.


Refined Cross Entropy Loss

The ideal output of the model should be close to the ground truth label. Thus, a modification to loss function should be done to improve the performance. The error occurs when the annotators who have not agreed for a particular label have also been taken into consideration. The number of experts who provided the correct label can also be an important signal to the model. If a sentence has been given a label by a more significant number of experts, the model should be penalized more if the sentence is misclassified. The new loss, over a batch of M sentences, can thus be written as :



The number of causal and non-causal sentences is almost the same and there is no significant class imbalance. The authors have thus not considered weight penalization to the class with the higher number of samples.

Figure 2: Loss for noisy cross-entropy

4.3 Experimental Details

The experiments have been performed in PyTorch

pytorch and the authors used the HuggingFace Wolf2019HuggingFacesTS

library to generate the pipeline for the different experiments. Each model has been trained for 10 epochs with a learning rate of

and a seed of 42 for reproducibility. Various models have been considered and trained with the same set of hyperparameters. The code is made publicly available on Github


Model name Cross   Entropy Noisy   Cross   Entropy Refined Cross   Entropy
bert-based-cased pretrain_bert 0.8251 0.8225 0.8235
bert-base-uncased pretrain_bert 0.8283 0.8313 0.8501
bert-large-cased pretrain_bert 0.7105 0.7549 0.7105
xlnet-based-cased Yang2019XLNetGA 0.7953 0.8216 0.8199
roberta-base Liu2019RoBERTaAR 0.8279 0.8279 0.8280
Table 1: Evaluation of models on different loss functions. The best F1 score of each model is marked in bold.

5 Results and Discussion

In this section, the results of the different models and the different losses are discussed.

Table 1 shows the evaluation of the different models on the validation set. Performances of four in five models, excepting the bert-base-uncased case, are enhanced by leveraging the modified cross-entropy loss. In fact, the F1 scores of four models are significantly increasing when we replaced vanilla cross-entropy loss with noisy cross-entropy loss and refined cross-entropy loss. Specifically, model fine-tuned from bert-base-uncased investigating Refined cross-entropy loss function yields the best performance in all experimented models with F1 score of 0.8501. On the other hand, bert-base-cased is the only pretrained model that does not benefit from customized cross-entropy losses. Adapting vanilla cross-entropy function on bert-base-cased model results in its best F1 scores of 0.8251.

The models with noisy and refined cross-entropy loss utilizes the annotated information and thus performs better. The noisy cross-entropy loss is similar to restricting the highest probability output that a model can predict. However, in almost all cases, the degree of agreement was either 1 or . In general, the smooth labelling has a value in the range of 0.9 to 1. Different contradicting annotations of labels might make the model face difficulties in learning and yielding an accurate prediction for each sentence. The refined cross-entropy solely considers the labels that do not contradict each other, thus it performs the best.

Moreover, the experiments show that roberta-based models achieve lower performance compared to BERT-based models, especially bert-base-uncased models. The model pretrained on bert-large-cased has been fine-tuned for only one epoch due to computation limitations. Their F1 scores are worse than those of bert-base-cased and bert-base-uncased models. bert-base models result in better performance, as compared to models fine-tuned on roberta-base. The reason could be that RoBERTa-based models had not been trained on next sentence prediction (NSP) while BERT-based models were. Causality detection can benefit from NSP. A sentence can be considered to be two relevant clauses that are joined by a causal effect. Thus, knowing if the clauses are relevant or not benefits the task of causality detection.

(a) Vanilla CE
(b) Noisy CE
(c) Refined CE
Figure 3: Confusion matrix for the different losses

Figure 3 shows the confusion matrix resulting from bert-base-uncased models which result the best F1 scores in all implemented models. Models are generally good at predicting non-causal sentences regardless of the loss function used. In fact, true negatives and true positives are always the highest measures compared to the others. On the other hand, there is a clear trend in the number of true positives when we shift the loss function from vanilla to noisy and refined cross-entropy. In particular, the model yields 145 true positives and is improved to 152 and 149 true positives when we replaced vanilla cross-entropy loss with noisy and refined cross-entropy loss function.

6 Conclusion

This paper presents our work on detecting causal effect relationships in news corpus by fine-tuning Transformers-based models and adapting multiple loss functions. The experiments showed that considering annotation information using customized loss functions significantly improved the model performance in four out of five experimented models. Besides, the experiments show that BERT outperformed RoBERTa, which can be attributed to the fact that RoBERTa is not trained on NSP. Last but not least, the bert-base-uncased obtained the best performance amongst all 15 models with an F1-score of 0.8501 in validation set and 84.930 in the test set using the refined cross-entropy loss that takes account of the annotation information presented in the dataset.

The authors plan to look into exploiting the uncertainty of the annotator’s information and parameterizing the loss function to further enhance the model’s performance.