Gradient-Based Adversarial Training on Transformer Networks for Detecting Check-Worthy Factual Claims

02/18/2020 ∙ by Kevin Meng, et al. ∙ The University of Texas at Arlington Association for Computing Machinery 0

We present a study on the efficacy of adversarial training on transformer neural network models, with respect to the task of detecting check-worthy claims. In this work, we introduce the first adversarially-regularized, transformer-based claim spotter model that achieves state-of-the-art results on multiple challenging benchmarks. We obtain a 4.31 point F1-score improvement and a 1.09 point mAP score improvement over current state-of-the-art models on the ClaimBuster Dataset and CLEF2019 Dataset, respectively. In the process, we propose a method to apply adversarial training to transformer models, which has the potential to be generalized to many similar text classification tasks. Along with our results, we are releasing our codebase and manually labeled datasets. We also showcase our models' real world usage via a live public API.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The creation and propagation of misinformation has become an increasingly important issue for our society to tackle. Today, many falsehoods are spread via mediums that allow quick dissemination of information, including social media, news outlets, and televised programs. The distribution of objectively incorrect information can negatively impact the operation of our society in many spheres. Especially in the realm of political discourse, misinformation can shake public confidence in government institutions, 111 erroneously inform political judgements (Allcott and Gentzkow, 2017), and reinforce confidence in wrong information (Chan et al., 2017).

In recent years, the number of fact-checking outlets has grown from 44 in 2014 to 226 in 2019 222 as part of a global movement to suppress misinformation. These outlets, including PolitiFact, 333 Snopes, 444 and, 555 hire human fact-checkers to perform claim-checking, a process in which they vet factual claims by reviewing relevant source documents and interviewing subject experts. In addition to outlets that directly fact-check claims, there exist many projects that use computing to aid fact-checkers in disseminating information to the general public, including’s ClaimReview 666 which organizes fact-checks into a unified database; FactStream 777 which compiles fact-checks into a smartphone application; and Fatima, 888 a bot built by Aos Fatos, a Brazilian fact-checking organization, that scans Twitter for tweets containing already-debunked misinformation and refers readers to relevant fact-checks. These organizations and projects play a central role in fighting misinformation, as fact-checks are effective not only for debunking false claims but also deterring speakers from making false claims in the future (Nyhan and Reifler, 2015).

However, due to the intense time commitment demanded by fact-checking, combined with the rapid rate at which new content surfaces via modern media channels, many problematic claims go unnoticed and unchecked (Pennycook and Rand, 2019). These challenges present an opportunity for automated fact-checking tools to help fact-checkers perform their duties. There are several prominent fact-checking projects that are currently testing automated systems, including FactChecker 999 which queries knowledge bases, cross-references known fact-checks, and provides custom ranked Google search results; ClaimPortal  101010 which uses ClaimBuster (Hassan et al., 2017; Jimenez and Li, 2018; Hassan et al., 2017, 2015) to select tweets that are worth fact-checking, as well as various algorithms to retrieve relevant articles and pre-existing fact-checks (Majithia et al., 2019); Squash 111111 which fact-checks live debates by converting speech to text and querying a database of pre-existing fact-checks; Fakta 121212 which checks claims against reliable web sources (Nadeem et al., 2019); and FullFact 131313 which is developing systems to cluster groups of similar claims together.

Claim-spotting is a process that precedes claim-checking where check-worthy claims are spotted from large streams of information available from various sources (e.g., newscasts, news websites, Twitter, Facebook). Claim-spotting is an area that is highly suitable for machine learning algorithms to tackle. The work presented here focuses on the claim-spotting component of ClaimBuster, 

141414 which scores claims based on their check-worthiness. This is paramount to ensuring that 1) check-worthy factual claims are not missed by fact-checkers and 2) unimportant or non-factual claims do not congest fact-checkers’ intellectual bandwidth. To this day, ClaimBuster’s API is regularly in use not only by internal projects such as ClaimPortal but also external collaborators such as the Duke Reporters’ Lab. The closest projects to ClaimBuster in this space are QCRI’s ClaimRank 151515 project, and a component in FullFact’s proposed automated fact-checking system which they gave the name Hawk in their whitepaper. 161616 QCRI’s ClaimRank is very similar to ClaimBuster in that it ranks claims by assigning them a check-worthiness score from to . As for Hawk and FullFact’s system, not many details have been released.

Currently, no existing claim-spotter (Hassan et al., 2017, 2015; Jimenez and Li, 2018; Hansen et al., 2019; Favano et al., 2019) has attempted to apply transformers (Vaswani et al., 2017)

to the claim-spotting task. The transformer is a new deep learning architecture that has recently allowed for rapid progress and development in the natural language processing field. Particularly, Bidirectional Encoding Representations from Transformers (BERT) 

(Devlin et al., 2019) has achieved state-of-the-art performance on many challenging language understanding and classification benchmarks. We surmise that BERT’s architecture is suited for our claim-spotting task. However, BERT models have upwards of 300 million trainable parameters, making them highly susceptible to overfitting (Caruana et al., 2001), especially on limited amounts of training data. To address this, we propose to incorporate adversarial training into a BERT-based model as a regularization technique. Gradient-based adversarial training (Goodfellow et al., 2014; Miyato et al., 2016, 2018)

is a procedure that trains classifiers to be resistant to small, approximately worst-case perturbations to its inputs. It was first applied to computer vision tasks in

(Goodfellow et al., 2014)

and later brought to the NLP domain in Long Short-Term Memory Networks 

(Hochreiter and Schmidhuber, 1997) by Goodfellow et al. (Miyato et al., 2016, 2018). No prior work has attempted to incorporate this type of adversarial training into transformer networks. We are the first to propose this technique, which is also potentially applicable in many other NLP-related tasks.

Motivated by the above, we introduce the first adversarially-regularized, transformer-based claim-spotting model that achieves state-of-the-art results on challenging claim-spotting benchmarks. Our contributions are summarized as follows:

  • We are the first to apply gradient-based adversarial training to transformer networks.

  • We present the first transformer-based neural network architecture for claim-spotting.

  • Our models are the first claim-spotters to be regularized by gradient-based adversarial training.

  • Our models achieve state-of-the-art performance by a substantial margin on challenging claim-spotting benchmarks.

  • We release a public codebase, dataset, and API for both reproducibility and further development (Section 7).

2. ClaimBuster Overview

In this section we present a brief history and overview on the ClaimBuster project. We cover its inception and impact in the community, as well as the current status of our fact-checking framework.

2.1. ClaimBuster’s History and Current Status

ClaimBuster’s foundation was first established in (Hassan et al., 2015), where Hassan et. al. first presented results on different machine learning models trained on an early version of the dataset we are using currently. This work later evolved into what is currently known as ClaimBuster and was presented in (Hassan et al., 2017, 2017). Since then, ClaimBuster has partnered with the Duke Reporters’ Lab (DRL) 171717 and collaborated with them through ClaimBuster’s API. During this time ClaimBuster’s API has been called over times by internal projects and over times by the DRL. ClaimPortal is the internal project that has made the most use of the ClaimBuster API. It scores tweets and provides relevant fact-checks for tweets using the claim-matching component seen in Figure 1. Through this project we have seen that we can successfully apply ClaimBuster to different domains, such as Twitter. As for the DRL, they generate and send out a daily e-mail to fact-checkers with the latest top claims that were identified by ClaimBuster from television and social media. Through our collaboration with the DRL we have been able to contribute to which claims are fact-checked by major news outlets. 181818 The accessibility of our work has allowed it, in general, to have a widespread impact in the fact-checking community.

Since the development of the original SVM model, we have been exploring deep learning (Jimenez and Li, 2018) approaches to improve our claim-spotting model. Recently, this culminated with us employing the BERT architecture due to BERT and its derivative models’ proven track record in performing well on NLP related tasks such as SQuAD and GLUE. 191919 202020 Since then, we have also re-evaluated our approach to the classes used within our datasets, how our dataset is generated, and refined our overall process when it comes to evaluating models. This turnaround in our approach to our dataset has come after a lengthy evaluation of our extraction criteria (i.e., what we consider high-quality labels), and the ratio of check-worthy to non-check-worthy sentences in the dataset. Through these evaluations, we are confident we have obtained a better quality dataset than that used in previous works. The work presented here will also begin a thorough test period with our collaborators at the DRL.

2.2. Fact-Checking Framework

Figure 1 showcases the current status of our fact-checking framework. We monitor claims from various sources (e.g., Twitter, political debates, etc.), and we are even able to process live television closed-caption feeds for important events such as presidential debates. ClaimSpotter then handles scoring all of the claims that our claim monitoring system captures. ClaimSpotter is accessible to the public via an API, which only requires a free API key 212121 to access. We are deploying the deep learning models for the public and other researchers to test and verify the models presented in this paper. Each deep-learning model is running off of a dedicated Nvidia GTX 1080Ti. All resources are running on the same network, so there is no significant overhead added by a server to server communication.

In addition, we also have a repository of fact-checked claims which we use in conjunction with ElasticSearch 222222 in our claim-matcher component to verify the veracity of any claims that have been previously fact-checked by professional fact-checkers. If no previous fact-checks are found then we can send these claims to our fact-checking component, which is still being developed. Currently, our approach is to convert claims to questions (Heilman, 2011) in order to query knowledge bases (e.g., Wolfram, Google, etc.) using natural language to see if they can generate a clear verdict. This approach is useful for general knowledge type claims, but nuanced claims requiring domain-specific knowledge are still very challenging to handle. Finally, we also provide re-ranked Google search results which are sorted based on the content of the pages the initial search query returns. The analysis is based on the Jaccard similarity of the context surrounding the text in each page that matched the initial query. Finally, we regularly publish presidential debate check-worthiness scores during election cycles on our website, 232323 and we also post high-scoring claims on our project’s Twitter account. 242424

Figure 2. Our Custom Adversarially Perturbed Claim Spotting Architecture

3. BERT Claim Spotting Model

In this section, we present our approach to integrating adversarial training into the BERT architecture for the claim spotting task. To the best of our knowledge, our work is the first to apply gradient-sign adversarial training (Goodfellow et al., 2014) to transformer networks.

3.1. Preliminaries

3.1.1. Task Definition

Detecting check-worthy factual claims has been studied as a binary/ternary classification task and a ranking task, as explained below. In this paper, we evaluate the performance of our models on the binary and ranking task definitions.

Binary Classification Task: In this work a sentence is classified as one of two classes, which deviates from the previous definition used in (Hassan et al., 2017; Jimenez and Li, 2018; Hassan et al., 2017, 2015).

  • Non-Check-Worthy Sentence (NCS): This class includes sentences that contain subjective or opinion-centered statements, questions, and trivial factual claims (e.g., The sky is blue).

  • Check-Worthy Factual Sentence (CFS): This class contains claims that are both factual and salient to the general public. They may touch on statistics or historical information, among other things.

Ranking Task: To capture the importance of prioritizing the most check-worthy claims, a check-worthiness score (Hassan et al., 2017) is defined for each sentence :



score defines a classification model’s predicted probability that a given claim is in the CFS class.

3.1.2. BERT Language Model

Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) is a transformer-based language modeling architecture that has recently achieved state-of-the-art performance on many language modeling and text classification tasks, including the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) and General Language Understanding Evaluation (GLUE) (Wang et al., 2018). We review BERT’s relevant features below.

Input/Output Representations: Consider an arbitrary training sentence with ground-truth label . is first tokenized using the WordPiece Tokenizer (Wu et al., 2016). Next, a [CLS] token is prepended to to indicate the start of the sentence, a [SEP] token is appended to to indicate the end of a sentence, and

is padded to a length of

using whitespace. Each resulting token is then converted into its corresponding index in the WordPiece vocabulary list. This input vector, denoted

, is passed to the embedding layers.

Three-Part Embeddings: is first transformed from a sparse bag-of-words form to a dense vector representation (Mikolov et al., 2013) through an embedding lookup matrix , where is the size of the WordPiece vocabulary list and is the embedding dimensionality. The series of operations that applies to is called the token embedding layer, and its output is given as , where . Additionally, BERT utilizes an segment embedding layer that signifies which parts of the input contain the input sentence, as the end of may be padded with empty space. The output of this layer is denoted by . Finally, since vanilla transformers analyze all tokens in parallel and therefore cannot account for the sequential ordering of words, BERT introduces a randomly-initialized real-valued signal via the positional embedding layer to encode the relative order of words. The output of this layer is denoted . The final input, denoted , is the element-wise addition of the three separate embedding layers’ outputs: . We denote the vector representation of the th token in to be .

Transformer Encoder: Using multiple stacked layers of attention heads (Vaswani et al., 2017), the BERT module encodes each input embedding into a hidden vector

, which is a hidden representation that incorporates context from surrounding words bidirectionally, as opposed to unidirectional encoders used in OpenAI GPT

(Radford et al., 2018, 2019).

Pooling Layer: The pooling layer generates a representation for the entire sentence by applying a dense layer on top of the [CLS] token’s hidden representation, resulting in . This sentence-level encoding vector can be used to perform many downstream tasks including claim-spotting.

3.2. Model Architecture

In this section, we outline how BERT is integrated with adversarial perturbations to create a claim-spotting model. The resultant model is end-to-end differentiable and trainable by gradient descent (Kingma and Ba, 2015). We refer the reader to Figure 2 for illustrations on each of the architectural components.

3.2.1. Embedding Process

All three embeddings from the BERT architecture are carried over: token , segment , and positional . Each embedding layer still performs its original function, transforming a given word into the embedding representation . The key difference in our architecture is the implantation of an addition gate through which adversarial perturbations are injected into to create the perturbed embedding .

3.2.2. BERT Transformer

Our work harnesses the power of the BERT architecture which supports transfer learning

(Tan et al., 2018; Radford et al., 2018; Peters et al., 2018; Howard and Ruder, 2018), a process in which weights are loaded from a BERT language model that was pre-trained on billions of English tokens. Denote the number of transformer encoder layers as , the hidden size as , and the number of self-attention heads as . The version of BERT used is (, , ), which has approximately 110-million parameters. Pretrained model weights for BERT can be found on Google Research’s BERT Repository. 252525

3.2.3. Fully-Connected Dense Layer

The dense layer is tasked with considering the information passed by BERT’s hidden outputs and determining a classification. To accomplish this, it is implemented as a fully-connected neural network layer that accepts input and returns un-normalized activations in , where , is passed through the softmax normalization function to produce final output vector as:


where each output activation in represents a classification class. will later be used to compute the check-worthiness score (Equation 1) and compute the predicted classification label as .

3.3. Standard Optimization Objective Function

In neural networks, an objective function, also known as the cost or loss function, is a differentiable expression that serves two purposes: to 1) quantify the disparity between the predicted and ground-truth probability distributions and 2) provide a function for gradient descent to minimize. Negative log-likelihood is a highly common choice for the cost function, because it has a nicely computable derivative for optimization via backpropagation, among other advantageous properties

(Janocha and Czarnecki, 2017). Our standard negative log-likelihood loss function is formulated as the probability that the model predicts ground-truth given embedded inputs , parameterized by the model’s weights :


where is the total number of training examples in a dataset. is used to compute adversarial perturbations in Section 3.4.

3.4. Computing Adversarial Perturbations

Gradient-based adversarial training is a regularization technique first introduced in (Goodfellow et al., 2014). The procedure trains classifiers to be resistant to small perturbations to its inputs. Rather than passing regular embedded input into a processing module such as a transformer or LSTM, adversarial training passes . is typically a norm-constrained vector that modifies the input slightly to force the classifier to output incorrect predictions. Then, the disparity between the ground-truth () and perturbed prediction () based on the perturbed input is minimized through backpropagation, hence training the model to be resistant to these adversarial perturbations. We are particularly interested in adversarial training’s potential as a regularization technique (Shafahi et al., 2019; Dalvi et al., 2004; Nguyen et al., 2015; Shaham et al., 2018; Goodfellow et al., 2014; Miyato et al., 2016), as BERT networks are prone to overfitting when being fine-tuned on small datasets (Sun et al., 2019). To the best of our knowledge, we contribute the first implementation of this technique on transformer networks.

We denote as the parameterization of our neural network and as a vector perturbation that is added element-wise to before being passed to the transformer encoder. can be computed in several ways. Firstly, random noise may be added to disrupt the classifier. This is typically formalized as sampling

from a Gaussian distribution:

. Alternatively, we can compute perturbations that are adversarial, meaning that they increase the model’s negative-log-likelihood error (Equation 3) by the theoretical maximum margin. This achieves the desired effect of generating a perturbation in the direction in which the model is most sensitive. In this case, is given as:


where is a constraint on the perturbation that limits the magnitude of the perturbation.

In (Miyato et al., 2016), it was shown that random noise is a far weaker regularizer than adversarially-computed perturbations. Therefore, we adopt adversarial perturbations for our model (Equation 4) and propose to apply them on the embeddings of the BERT model.

Equation 4 gives the absolute worst-case adversarial perturbation given a constraint that . However, this value is impossible to compute with a closed-form analytic solution in neural networks; functions such as Equation 3 are neither convex nor concave in topology. Therefore, we propose a novel technique for generating approximately worst-case perturbations to the model.

Because BERT embeddings are composed of multiple components (Section 3.1.2), it may not be optimal from a regularization standpoint to compute perturbations w.r.t. . Therefore, to determine the optimal perturbation setting, we propose to experiment with computing w.r.t. all possible combinations of the 3 embedding components. There are 7 different possible configurations in the set of perturbable combinations , letting denote the set of embedding layers:


Given this list of components that can be perturbed, we denote sum of the subset of the embeddings we will perturb as where . We then generate approximate worst-case perturbations by linearizing with respect to . To understand what this means, consider the simplified example shown in Figure 3, which graphs an example cost function with respect to an example embedding space . For ease of visualization, in Figure 3 it is assumed that exists on a scalar embedding space; but in reality, our embeddings are in high-dimensional vector space. The gradient at the point gives us information regarding which direction should be moved to increase the value of :


However, we must be careful in determining how much should be perturbed, because the assumption that is linear may not hold in reality. If the perturbation is too large, as with , the adversarial effect will not be achieved, as the value of will in fact decrease. However, if we introduce a norm constraint to limit the perturbations to a reasonable size, linearization can accomplish the task of approximating a worst-case perturbation, as shown with .

Figure 3. Visualization of Linearization

Given the above insight, we generalize the one-dimensional example (Equation 6) to higher dimensions using the gradient vector. Therefore, the adversarial perturbation is computed with Equation 7, which can be implemented using backpropagation in deep neural networks:


Since we desire to train our language classification model to become resistant to the perturbations defined in Equation 7, we create adversarially-perturbed input embeddings as follows:


After is passed into the transformer module, predictions will be generated. These predictions will be used in formulating the adversarial optimization objective function (Section 3.5).

3.5. Compound Optimization Objective

Our model’s final optimization objective contains two components: standard loss and adversarial loss. Standard loss was defined in Equation 3. Adversarial loss optimizes for distributional smoothness, given by the negative log-likelihood of a model parameterized by predicting given perturbed input :


where represents the number of training samples in .

The final optimization objective is given as the sum of and . By combining the two losses, gradient descent will optimize for both distributional smoothness and model accuracy jointly:


where is a balancing factor between standard and adversarial loss.

3.6. Adversarial Training Algorithm

Let be the parameters of the token embedding lookup table, be the parameters of the segment embedding layer, and be the parameters of the positional embedding layer, be the parameters for each of the transformer encoder layers, be the parameterization of the pooling layer, and be the weights and biases in the fully-connected dense layer. We also define as the number of encoder layers to freeze (i.e. render the weights uneditable during backpropagation to preserve knowledge obtained from pre-trained weights), where .

Input : Training data
Initialize using Xavier method (Glorot and Bengio, 2010);
Load pretrained weights for , , , ;
Set , , , and to untrainable;
claim-spotting model (Figure 2);
while not converge do
       Sample , from data ;
       Tokenize and process into ;
       Pass through embeddings to produce ;
       Standard forward-propagation
       Compute using (Equation 3);
       Generate and apply perturbations
       Compute using (Equation 7);
       Compute perturbed input as ;
       Adversarial forward-propagation
       Compute using (Equation 9);
       Adversarial training
       Optimize (Equation 10);
end while
Algorithm 1 Adversarial Training Loop

The adversarial training procedure is shown in Algorithm 1. First, model is used to compute the optimization function . Then, is used to compute the adversarial perturbation (Equation 7), which is used to compute the adversarial optimization objective (Equation 9). This objective is added to the standard objective (Equation 10) and minimized using gradient descent.

4. Results and Discussions

Claim Label CB-SVM CWS Score CB-BBA CWS Score
The U.S. loses millions of lives each year to homicide. CFS 0.6000 0.9999
I really think you’re overthinking the situation. NCS 0.2178
Table 1. Sample Sentences, Labels, and Check-Worthiness Scores

We evaluate our new transformer-based claim-spotting models on both the Classification and Ranking Tasks (Section 3.1.1). We compare against re-trained and refined versions of past ClaimBuster models (Hassan et al., 2017; Jimenez and Li, 2018) and the top-two performing systems from the 2019 CLEF-CheckThat! Challenge. Table 1 shows several example sentences, their ground-truth labels, and our models’ scores.

4.1. Experiment Setup

4.1.1. Datasets

We use two claim-spotting datasets to evaluate model performance.

ClaimBuster Dataset (CBD): The ClaimBuster dataset is our own in-house, manually labeled dataset. A different version of the CBD was used by (Hassan et al., 2017, 2017; Jimenez and Li, 2018). The current CBD consists of two classes, as mentioned in Section 3.1.1: NCS and CFS. The switch to this scheme was motivated by our observation that the non-check-worthy factual sentence class in the previous versions of CBD was not really useful and possibly negatively impacting models trained using it. The CBD consists of 9674 sentences (6910 NCS and 2764 CFS). For validation we perform 4-fold cross validation using this same dataset. The dataset is composed of manually labeled sentences from all U.S. presidential debates from 1960 to 2016. We describe the details of dataset collection in Section 7.3.2. This dataset is publicly available, as noted in Section 7.3.

CLEF2019-CheckThat! Dataset (C): We also evaluate our model on the 2019 CLEF-CheckThat! 262626 claim-spotting dataset. CLEF-CheckThat! is an annual competition that assesses the state-of-the-art in automated computational fact-checking by providing datasets for claim-spotting. The C dataset is comprised of political debate and interview transcripts. Sentences are labelled as check-worthy only if they were fact-checked by Note that this labelling strategy introduces significant bias into the dataset, as many problematic claims go unchecked due to the limited resources of fact-checkers from a single organization (Section 1). The training set contains 15,981 non-check-worthy and 440 check-worthy sentences, and the testing set contains 6,943 non-check-worthy and 136 check-worthy sentences. The C dataset also includes speaker information for each sentence, which we did not use in training our models for two reasons: (1) it may introduce unwanted bias based on the name of speaker and (2) it makes the claim spotting model inapplicable to real-time events since live transcripts typically lack speaker information.

4.2. Evaluated Models

CB-BBA: This model is trained using our novel claim-spotting framework detailed in Section 3.2. It is trained adversarially using the compound optimization objective defined in Equation 10.

CB-BB: This model is architecturally identical to CB-BBA but is trained using the standard optimization objective (Equation 3). In implementation, is simply set to . This model serves as a point of comparison for the adversarial model.

CB-BiL(Jimenez and Li, 2018) This model is a reimplementation of (Jimenez and Li, 2018)

in TensorFlow 2.1. It uses normalized GloVe word embeddings 


and consists of a bi-directional LSTM layer which allows it to capture forward and reverse sequence relationships. The model’s binary cross entropy loss function is optimized using RMSProp.

CB-SVM(Hassan et al., 2015, 2017) The SVM classifier uses a linear kernel. The feature-vector used to represent each sentence is composed of a tf-idf weighted bag-of-unigrams vector, part-of-speech vector, and sentence length (i.e., number of words). The total number of features for each sentence using our dataset is . The core SVM model is produced using scikit-learn’s LinearSVC class with the max number of iterations set to an arbitrary high number (), to ensure model convergence.

4.2.1. 2019 CLEF-CheckThat! Models

Neither of the top two teams in CLEF2019 released their code; therefore, we are only able to retrieve their results on CLEF2019.

Copenhagen (Hansen et al., 2019): Team Copenhagen’s model, the top performer on C, consisted of an LSTM model (Hochreiter and Schmidhuber, 1997) token embeddings fused with syntactic-dependency embeddings. To train their model, Copenhagen did not use the C dataset, instead using an external dataset of Clinton/Trump debates that was weakly labeled using our ClaimBuster API. footnote 14

TheEarthIsFlat (Favano et al., 2019)

: TheEarthIsFlat, the second-place performer, used a feed-forward neural network trained on the C

dataset. They encoded sentences using the Universal Sentence Encoder (Cer et al., 2018).

4.3. Embedding Perturbation Study Results

P R F1
0 0.9275 0.8270 0.9315 0.8180 0.9295 0.8225
1 0.8475 0.7725 0.9315 0.5810 0.8876 0.6632
2 0.8931 0.7960 0.9259 0.7229 0.9092 0.7577
3 0.9266 0.8245 0.9305 0.8158 0.9286 0.8201
4 0.9303 0.8310 0.9329 0.8253 0.9316 0.8281
5 0.9379 0.8149 0.9230 0.8473 0.9304 0.8308
6 0.9215 0.8473 0.9424 0.7992 0.9318 0.8226
0 1 2 3 4 5 6
pos seg tok
Table 2. Perturbation Combinations Study Results
Averaged Across Stratified 4-Fold Cross Validation

In Table 2, we see the results of perturbing the 3 different embedding layers in BERT. From the results we conclude that setting produces the best models for our task. Particularly, this setting produces the best recall for the CFS class, which is arguably the more important class. The sacrifice in recall, with respect to the NCS class, compared to other settings is only at most. While setting achieves the best performance in three spots as well, the drop in recall with respect to the CFS class (i.e., ) is not justified by the paltry improvements in other areas which only ranges from to . Thus, from here on, any results dealing with adversarial training will employ setting and perturb only the seg embedding layer.

4.4. Classification Task, Ranking, and Distribution Results

Model P P P R R R F1 F1 F1
(lr)2-3 (lr)6-7 (lr)10-11 NCS CFS NCS CFS NCS CFS
CB-SVM 0.8935 0.7972 0.8454 0.8660 0.9263 0.7240 0.8251 0.8685 0.9096 0.7588 0.8342 0.8665
CB-BiL 0.9067 0.7773 0.8420 0.8697 0.9123 0.7652 0.8387 0.8703 0.9095 0.7712 0.8403 0.8700
CB-BB 0.9290 0.8100 0.8695 0.8950 0.9227 0.8238 0.8733 0.8945 0.9259 0.8169 0.8714 0.8947
CB-BBA 0.9406 0.8167 0.8786 0.9052 0.9233 0.8542 0.8887 0.9036 0.9319 0.8350 0.8834 0.9042
Table 3. Precision, Recall, and F1 Averaged Across Stratified 4-Fold Cross Validation
nDCG 0.9765 0.9817 0.9882 0.9887
Table 4. nDCG Averaged Across
Stratified 4-Fold Cross Validation

4.4.1. Classification Results

Our results are encapsulated in Table 3 and Table 4. We assume familiarity with the metrics, which are defined in Section 7.2. In Table 3, we observe that the SVM based model, CB-SVM, has the lowest performance across many measures. This is expected, as the SVM can only capture the information present in the dataset, while the deep learning models benefit from outside knowledge afforded to them by either pre-trained word-embeddings or a pre-trained model (i.e., BERT) that can be fine tuned. The CB-BiL model shows modest improvements overall, but it does achieve noticeably better CFS recall than the SVM model. With respect to BERT-based architectures, both models outperform CB-SVM and CB-BiL considerably. Between CB-BB and CB-BBA, CB-BBA edges out CB-BB. Particularly importantly, the CFS recall and NFS precision are highest on CB-BBA. This is desirable as we want (1) to avoid assigning CFS sentences into the NFS class and (2) capture as many quality CFS sentences as possible. Ultimately, CB-BBA achieves a 3.42 point F1 score improvement over the past state-of-the-art CB-BiL model, a 3.77 point F1-score improvement over the CB-SVM model, and a 0.95 point F1-score improvement over a regularly-trained BERT model. This demonstrates the effectiveness of our new architecture and training algorithm.

The results on the C dataset are in Table 5. The metrics presented for the CLEF competition teams are taken from (Atanasova et al., 2019), since we could not find the source code to reproduce them. For this reason we also cannot provide the P, R, F1, and nDCG for these teams. We tested models trained on both the CBD and C training set and used the C testing set to evaluate them. The models trained on CBD and tested on the CLEF test set didn’t perform as well; this was expected, given that our methodology of dataset labelling differs significantly from CLEF’s. Despite this, when trained on C, CB-BBA obtained state-of-the-art results, outperforming the next best model by 1.09 mAP score units.

4.4.2. nDCG Results

In Table 4 we observe that the best nDCG score is achieved by the CB-BBA model, and the CB-BB and CB-BiL models are within of it. The CB-SVM model has the “worst” nDCG, but is still not far behind the deep learning models. It is noteworthy that all models show relatively good performance on this measure since the CFS class is less represented in the dataset.

4.4.3. Distribution of Scores

Figure 4. Comparison of Check-Worthiness Score Distributions Using the CB-BBA and CB-SVM Models on 100 sentences from the January 14, 2020, Democratic presidential debate.

To analyze the distribution of our models’ outputs on a typical corpus of text, we process 100 sentences from the January 14, 2020 Democratic presidential debate. 282828 The sentences were chosen so that there would be about equal numbers of check-worthy and non-check-worthy sentences. Figure 4

displays the results, which use Kernel Density Estimation

(Rosenblatt, 1956) to estimate the score distribution from discrete data points. Observing the density spikes around 0 and 1 on CB-BBA’s distribution, we conclude that our model more clearly differentiates sentences as check-worthy or not check-worthy. The more well-delineated distribution of the CB-BBA model also improves its interpretability over CB-SVM.

mAP P@10 P@20 P@50 P R F1 nDCG
Copenhagen C 0.1660 0.2286 0.1571 0.1143
TheEarthIsFlat C 0.1597 0.2143 0.1857 0.1457
CB-SVM C 0.1087 0.1429 0.1429 0.1114 0.9813 0.2105 0.9978 0.0294 0.9895 0.0516 0.4567
CB-BBA C 0.1769 0.2714 0.2071 0.1400 0.9917 0.0665 0.8220 0.6471 0.8989 0.1205 0.5400
CB-SVM CB 0.1134 0.1571 0.1429 0.1143 0.9885 0.0678 0.8694 0.4853 0.9251 0.1190 0.4744
CB-BBA CB 0.1235 0.1429 0.1286 0.1143 0.9916 0.0764 0.8502 0.6324 0.9155 0.1363 0.4880
Table 5. CLEF-2019 Test Dataset Classification and Ranking Task Results

5. Related Works

In recent years, there have been several efforts with respect to building claim spotting models. ClaimBuster (Hassan et al., 2015) is the first of several notable claim spotting models. Another team (Gencheva et al., 2017) extended CB-SVM feature set 4.2 by including several contextual features such as: position of the target sentence in its segment, speaker name, interaction between opponents, etc. They created a dataset from the 2016 US presidential and vice presidential debates and annotated sentences by taking fact-checking outputs from 9 fact-checking organizations. If a sentence was fact-checked by at least one fact-checking outlet, it was labeled as check-worthy. A follow-up study (Jaradat et al., 2018) built an online system, namely, ClaimRank footnote 15 for prioritizing sentences for fact-checking based on their check-worthiness score. ClaimRank is a re-implementation of the aforementioned study, but it also supports Arabic by employing cross-language English-Arabic embeddings. Another study (Patwari et al., 2017) followed the same data annotation strategy on a larger dataset by including sentences from an additional 15 2016 U.S. primary debates. The authors developed a multi-classifier based model called TATHYA that uses multiple SVMs trained on different clusters of the dataset. The model takes the output from the most confident classifier. The feature set used was comprised of tf-idf weighted bag-of-unigrams, topics of the sentences, part-of-speech tuples, and a count of entities.

Another effort by Konstantinovskiy et al. (Konstantinovskiy et al., 2018)

, which utilized the expertise of professional fact-checkers, designed an annotation schema and created a benchmark dataset for training a claim spotting model. The authors trained the model using logistic regression on top of dataset’s universal sentence representation derived from InferSent 

(Conneau et al., 2017). Their model classifies a sentence as either checkable or non-checkable. The authors also disagreed with ClaimBuster’s and ClaimRank’s idea of a check-worthiness score. They believe the decision of how important a claim is, should be left to the professional fact-checkers. In the recent CLEF2019 competition on check-worthiness detection, the Copenhagen team developed the winning approach (Hansen et al., 2019) which leveraged the semantic and syntactic representation of each word in a sentence. They generated domain-specific pretrained word embeddings that helped their system achieve better performance in the competition. They used a large weakly-labeled dataset, whose labels were assigned by ClaimBuster, for training an LSTM model.

6. Conclusion

We have presented our work on detecting check-worthy factual claims employing adversarial training on transformer networks. Our results have shown that through our methods we have achieved state-of-the-art results on two different datasets (i.e., CBD and C). During the process we also re-vamped our dataset and approach to collecting and assigning labels for it. We have also come to realize that the lack of a large standardized dataset holds this field back, and thus we look forward to contributing and establishing efforts to fix this situation. We plan on releasing different versions of our dataset periodically in hopes that we can get more significant community contributions with respect to expanding it. 292929

In the future, we are interested in exploring adversarial training as a defense against malicious adversaries. As a publicly deployed API, ClaimBuster may be susceptible to exploitation without incorporating mechanisms that improve its robustness. For example, it has been shown by (Jin et al., 2019) that a model’s classification can be strongly influenced when certain words are replaced by their synonyms. We are currently researching methods to combat similar weaknesses.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, et al. (2016) TensorFlow: a system for large-scale machine learning. In OSDI, pp. 265–283. Cited by: §7.6.
  • H. Allcott and M. Gentzkow (2017) Social media and fake news in the 2016 election. Working Paper Technical Report 23089, Working Paper Series, National Bureau of Economic Research. Cited by: §1.
  • P. Atanasova, P. Nakov, G. Karadzhov, M. Mohtarami, and G. Da San Martino (2019) Overview of the clef-2019 checkthat! lab on automatic identification and verification of claims. task 1: check-worthiness. In CEUR Workshop Proceedings, Cited by: §4.4.1.
  • R. Caruana, S. Lawrence, and C. L. Giles (2001) Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In NIPS, pp. 402–408. Cited by: §1.
  • D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al. (2018) Universal sentence encoder. External Links: 1803.11175 Cited by: §4.2.1.
  • M. S. Chan, C. R. Jones, K. Hall Jamieson, and D. Albarracín (2017) Debunking: a meta-analysis of the psychological efficacy of messages countering misinformation. Psychological Science 28 (11), pp. 1531–1546. Cited by: §1.
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. External Links: 1705.02364 Cited by: §5.
  • N. Dalvi, P. Domingos, S. Sanghai, D. Verma, et al. (2004) Adversarial classification. In SIGKDD, pp. 99–108. Cited by: §3.4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §1, §3.1.2.
  • L. Favano, M. J. Carman, and P. L. Lanzi (2019) TheEarthIsFlat’s submission to clef’19 checkthat! challenge. In CEUR Workshop Proceedings, Cited by: §1, §4.2.1.
  • P. Gencheva, P. Nakov, L. Màrquez, A. Barrón-Cedeño, and I. Koychev (2017) A context-aware approach for detecting worth-checking claims in political debates. In RANLP, pp. 267–276. Cited by: §5.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pp. 249–256. Cited by: 1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. External Links: 1412.6572 Cited by: §1, §3.4, §3.
  • C. Hansen, C. Hansen, J. G. Simonsen, and C. Lioma (2019) Neural Weakly Supervised Fact Check-Worthiness Detection with Contrastive Sampling-Based Ranking Loss. In CEUR Workshop Proceedings, Cited by: §1, §4.2.1, §5.
  • N. Hassan, F. Arslan, C. Li, and M. Tremayne (2017) Toward automated fact-checking: detecting check-worthy factual claims by claimbuster. In SIGKDD, pp. 1803–1812. Cited by: §1, §1, §2.1, §3.1.1, §3.1.1, §4.1.1, §4.2, §4, §7.3.2.
  • N. Hassan, C. Li, and M. Tremayne (2015) Detecting check-worthy factual claims in presidential debates. In CIKM, pp. 1835–1838. Cited by: §1, §1, §2.1, §3.1.1, §4.2, §5.
  • N. Hassan, G. Zhang, F. Arslan, J. Caraballo, and et al. (2017) ClaimBuster: the first-ever end-to-end fact-checking system. PVLDB 10 (12), pp. 1945–1948. Cited by: §1, §2.1, §3.1.1, §4.1.1.
  • M. Heilman (2011) Automatic factual question generation from text. Ph.D. Thesis, Carnegie Mellon University, USA. External Links: ISBN 9781267582249 Cited by: §2.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §4.2.1.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In ACL, Cited by: §3.2.2.
  • K. Janocha and W. M. Czarnecki (2017) On loss functions for deep neural networks in classification. Schedae Informaticae 1/2016. Cited by: §3.3.
  • I. Jaradat, P. Gencheva, A. Barrón-Cedeño, L. Màrquez, and P. Nakov (2018) ClaimRank: detecting check-worthy claims in Arabic and English. In NAACL, pp. 26–30. Cited by: §5.
  • D. Jimenez and C. Li (2018) An empirical study on identifying sentences with salient factual statements. In IJCNN, pp. 1–8. Cited by: §1, §1, §2.1, §3.1.1, §4.1.1, §4.2, §4.
  • D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits (2019) Is bert really robust? natural language attack on text classification and entailment. arXiv preprint arXiv:1907.11932. Cited by: §6.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §3.2.
  • L. Konstantinovskiy, O. Price, M. Babakar, and A. Zubiaga (2018) Towards automated factchecking: developing an annotation schema and benchmark for consistent automated claim detection. External Links: 1809.08193 Cited by: §5.
  • S. Majithia, F. Arslan, S. Lubal, D. Jimenez, P. Arora, J. Caraballo, and C. Li (2019) ClaimPortal: integrated monitoring, searching, checking, and analytics of factual claims on twitter. In ACL, pp. 153–158. Cited by: §1.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: §3.1.2.
  • T. Miyato, A. M. Dai, and I. Goodfellow (2016) Adversarial training methods for semi-supervised text classification. External Links: 1605.07725 Cited by: §1, §3.4, §3.4.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018)

    Virtual adversarial training: a regularization method for supervised and semi-supervised learning

    IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §1.
  • M. Nadeem, W. Fang, B. Xu, M. Mohtarami, and J. Glass (2019) FAKTA: an automatic end-to-end fact checking system. In NAACL, Cited by: §1.
  • A. Nguyen, J. Yosinski, and J. Clune (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In CVPR, Cited by: §3.4.
  • B. Nyhan and J. Reifler (2015) Estimating fact-checking‘s effects. External Links: Link Cited by: §1.
  • A. Patwari, D. Goldwasser, and S. Bagchi (2017) TATHYA: a multi-classifier system for detecting check-worthy statements in political debates. In CIKM, pp. 2259–2262. Cited by: §5.
  • G. Pennycook and D. G. Rand (2019) Fighting misinformation on social media using crowdsourced judgments of news source quality. Proceedings of the National Academy of Sciences 116 (7), pp. 2521–2526. Cited by: §1.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, Cited by: §3.2.2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §3.1.2, §3.2.2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §3.1.2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, Cited by: §3.1.2.
  • M. Rosenblatt (1956) Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics 27 (3), pp. 832–837. Cited by: §4.4.3.
  • A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein (2019) Adversarial training for free!. External Links: 1904.12843 Cited by: §3.4.
  • U. Shaham, Y. Yamada, and S. Negahban (2018) Understanding adversarial training: increasing local stability of supervised models through robust optimization. Neurocomputing 307, pp. 195–204. Cited by: §3.4.
  • C. Sun, X. Qiu, Y. Xu, and X. Huang (2019) How to fine-tune bert for text classification?. External Links: 1905.05583 Cited by: §3.4.
  • C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu (2018) A survey on deep transfer learning. Lecture Notes in Computer Science, pp. 270–279. Cited by: §3.2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. External Links: 1706.03762 Cited by: §1, §3.1.2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. External Links: 11804.07461 Cited by: §3.1.2.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, and et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    External Links: 1609.08144 Cited by: §3.1.2.

7. Reproducibility

7.1. Code Repositories, API, and Related Projects

We provide an API of the claim-spotting algorithm online to show its real-world usage at We are also releasing our code along with detailed instructions on running and training our models. Our code and its documentation can be found at Along with these, we also present several projects which showcase how ClaimBuster can and is currently being used:

7.2. Formulas for Performance Measures

Precision (P):


Recall (R):




Macro (P, R, F1):


Weighted (P, R, F1):


Mean Average Precision (MAP):


Normalized Discounted Cumulative Gain (nDCG):


7.3. Datasets

We provide the CBD dataset, a collection of sentences labelled manually in-house by high-quality coders, in our repository at CBD was curated to have , for ; where is the number of non-check-worthy sentences and is the number of check-worthy sentences. This was done after evaluating different values of (i.e., ) and concluding the best ratio for NCS to CFS was . The C dataset, containing sentences from the first and second presidential debates and the first vice presidential debate from 2016, can be found at

7.3.1. Contributing

We are always looking for collaborators to contribute to the labelling of more data. Contributions will benefit everyone as we plan on releasing periodic updates when a significant amount of new labels are gathered. To contribute please visit and make an account at:

7.3.2. Dataset Labeling Criteria

The labels for the dataset are assigned by high-quality coders, which are participants that have a pay-rate and have labeled at least sentences. The pay-rate for a user is internally calculated by taking into account their labeling quality, the average length of sentence a user labels, and how many sentences a user skips. More specifically, we define the quality () of a coder () with respect to the screening sentences they have labeled () as:

where is the weight factor when labeled the screening sentence as and the experts labeled it as . Both , . We set where , where . The weights are set empirically. The pay-rate () is then defined as:

where, is the average length of all the sentences, is the average length of sentences labeled by , is the set of sentences labeled by and is the set of sentences skipped by . The numerical values in the above equation were set in such a way that it would be possible for a participant to earn up to per sentence. Using this scheme, out of users in our system, users are considered high-quality coders. A label is then only assigned to a particular sentence if it has unanimously been assigned that label by at least 2 high-quality coders. More precisely, we defined the number of high-quality labels needed as: where, is the number of top-quality labels of type , and a top quality label is one that has been given by a high-quality coder (Hassan et al., 2017).

7.4. Hyperparameters

Parameter BBA BB
cs_train_steps 10 5
cs_lr 5e-5 5e-5
cs_kp_cls 0.7 0.7
cs_batch_size_reg 24 24
cs_batch_size_adv 12 -
cs_perturb_norm_length 2.0 -
cs_lambda 0.1 -
cs_combine_reg_adv_loss True -
cs_perturb_id 5 -
Table 6. Major Parameters for Training

We provide Table 6 for major parameter settings used in the BBA, BB, and BiLSTM claimspotting algorithm. The description of the major parameters are as follows:

  • cs_train_steps: number of epochs to run

  • cs_lr: learning rate during optimiation

  • cs_perturb_norm_length: norm length of adversarial perturbation

  • cs_kp_cls: keep probability of dropout in fully connected layer

  • cs_lambda: adversarial loss coefficient (eq. 10)

  • cs_combine_reg_adv_loss: add loss of regular and adversarial loss during training

  • cs_batch_size_reg: size of the batch

  • cs_batch_size_adv: size of the batch when adversarial training

  • cs_perturb_id: index in Table 2

7.5. Evaluation and Training Final Models

We perform 4-fold cross validation to evaluate our models, selecting the best model from each fold using the weighted F1-score (eq.  15) calculated on the validation set. Therefore, in each iteration the data is split as follows: test, validation, and training. The metrics produced at the end are based on the classifications across all folds. We train the final models (for both CBD and C) on the entire dataset for up to 10 epochs and select the best epoch based on the weighted F1-score calculated on the validation set.

7.6. Hardware and Software Specifications

Our neural network models and training algorithms were written in TensorFlow 2.1 (Abadi et al., 2016) and run on machines with four Nvidia GeForce GTX 1080Ti GPU’s. We did not parallelize GPU usage with distributed training; each experiment was run on a single 1080Ti GPU. The machines ran Arch Linux and had an 8-Core i7 5960X CPU, 128GB RAM, 4TB HDD, and 256GB SSD.