Rating Facts under Coarse-to-fine Regimes

07/13/2021 ∙ by Guojun Wu, et al. ∙ National Taiwan University of Science and Technology 0

The rise of manipulating fake news as a political weapon has become a global concern and highlighted the incapability of manually fact checking against rapidly produced fake news. Thus, statistical approaches are required if we are to address this problem efficiently. The shortage of publicly available datasets is one major bottleneck of automated fact checking. To remedy this, we collected 24K manually rated statements from PolitiFact. The class values exhibit a natural order with respect to truthfulness as shown in Table 1. Thus, our task represents a twist from standard classification, due to the various degrees of similarity between classes. To investigate this, we defined coarse-to-fine classification regimes, which presents new challenge for classification. To address this, we propose BERT-based models. After training, class similarity is sensible over the multi-class datasets, especially in the fine-grained one. Under all the regimes, BERT achieves state of the art, while the additional layers provide insignificant improvement.



There are no comments yet.


page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The manipulation of weaponized fake news has drawn public concern in many political events. One study about fake news (Allcott and Gentzkow, 2017)

reported that average American adult encountered one or several fake news stories during 2016 election period. Since their database was limited, this estimate can be conservative. Moreover, the steep decline of public trust in mass media in 2016

(Brenan, 2021) also drew attention to the impact of fake news.

To limit the widespread of fake news, many fact-checking websites have emerged such as PolitiFact and First Draft222https://firstdraftnews.org/. Take PolitiFact, a reporter will research on the statement and provide detailed analysis. Then several editors will rate the truthfulness together. While this fact-checking procedure is responsible, it will consume a lot effort and time.

Automatic fact checking seems to be an elegant approach to detect fake news quickly and cut the spread. However, verification of novel claims is challenging because of the difficulty to retrieve relative information. Thus, we expect to assess the truthfulness of a claim purely depending on linguistic analysis.Wang (2017) has discussed the lack of available datasets about fake news, which limits the potential to combat fake news. To address this, they introduced a new benchmark dataset, LIAR, which contains 12.8K rated statements from PolitiFact. In our work, we constructed a similar dataset relatively larger than LIAR.

True: The statement is accurate and there is nothing significant missing.
Mostly true: The statement is accurate but needs clarification or additional information.
Half true: The statement is partially accurate but leaves out important details or takes thing out of context.
Mostly false: The statement contains an element of truth but ignores critical fact that would give a different impression.
False: The statement is not accurate.
Pants on fire: The statement is not accurate and makes a ridiculous claim.
Table 1: Meter of truthfulness rating.

Empirically, our task is about rating-inference, which represents some difference from standard multi-class text classification. The various degrees of similarity between classes can be confusing for classification. One study (Pang and Lee, 2005) suggests that as the class similarity increases, performance of both human and models drop obviously. This problem has also been discussed in prior work about fact checking (Vlachos and Riedel, 2014; Rashkin et al., 2017). In order to investigate this, we defined three classification regimes (i.e., fine-grained, coarse-grained, and binary – 6-class, 3-class, and 2-class respectively).

In this paper, we propose BERT-based models to rate the statements. Pre-trained language models have been a significant ingredient in multiple NLP tasks. While the standard BERT has achieved state of the art

(Devlin et al., 2019), we hypothesis that the approach for classification representation can be enhanced. We propose a structure with the recurrent (i.e., BiLSTM (Sachan et al., 2019)) or convolutional (i.e., CNN (Kim, 2014)

) layers on top of BERT. Both the additional layers have achieved competitive performance in many text classification tasks based on word vectors. For comparison, we use BERT as our baseline model.

Our experiments suggest that class similarity can influence the judgement of the model. This situation is more obvious in fine-grained dataset. Moreover, BERT achieves state-of-the-art performance under all three regimes. While the additional BiLSTM fails to advance the results, the CNN consistently provides slight improvement.

2 Related work


A prior work (Vlachos and Riedel, 2014) provides one of the first fact-checking datasets, which consists of 106 claims from PolitiFact. Later, Wang (2017) introduce a new benchmark dataset, LIAR. It includes 12.8K claims also from PolitiFact and their meta-data (e.g., the context of the claim). More recent work (Rashkin et al., 2017) has collected 10,483 claims from PolitiFact and its spin-off sites. Our paper introduces datasets relatively larger than these datasets of similar type, and digs deeper about classification regimes.


There have been various kinds of linguistic analysis, such as separation of fake news types (Rubin et al., 2015), article structure and content of hoaxes (Kumar et al., 2016), linguistic style of clickbaits (Biyani et al., 2016), quantitative study of linguistic differences (Rashkin et al., 2017), and bias of language patterns in fake news (O’Brien et al., 2018). Except for these, claim verification has often been related to stance classification. Claims and associated articles are required in the datasets, such as Emergent (Ferreira and Vlachos, 2016) and FEVER (Thorne et al., 2018). The claim-level truthfulness assessment can be reached through the article-level stance classification (Mohtarami et al., 2018; Xu et al., 2019; Baly et al., 2018)

. While stance classification can detect repetition or paraphrase of existed claims, it cannot deal with the novel ones. Thus, our work takes the linguistic approach, utilizing deep learning to assess the claims.

3 Politic Fact Dataset

In this section, we introduce and provide some analyses for the new Politic Fact datasets, which include labeled statements collected from PolitiFact. As shown in Table 1, there are various degrees of similarity between classes; for example, “True” is closer to “Mostly true” than to “false”. To further investigate this, we defined coarse-to-fine classification regimes. Accordingly, we constructed three datasets with different regimes. To make the classes balanced, we kept all statements in rare class and randomly selected equal number of statements in abundant class. Due to this filtering, the size of each dataset can vary. We report the specifications of these datasets in Table 2.

Regime Train Test Class
Fine-grained 11932 2987 6
Coarse-grained 16980 4245 3
Binary 11320 2830 2
Table 2: Specifications of Politic Fact datasets.

3.1 FPF: Fine-grained Politic Fact

In this dataset, we keep the original labels (6-class) in PolitiFact, as shown in Table 1

. Since it is meticulously classified, we will name this Fine-grained Politic Fact (FPF). This regime can capture the main variability of statements, with well-balanced labels from “True” to “Pants on fire”.

When we began to construct this 6-class dataset, we found that the most false class (a.k.a., “Pants on fire”) contained only about 10% of all the statements. Thus, after filtering about 40% of the data, this dataset included 14.9K statements. Then all the statements were split into train (11932) and test split (2984).

3.2 CPF: Coarse-grained Politic Fact

One problem with the FPF is that the high similarity between close classes makes classification challenging. For example, the difference between “True” and “Mostly true” is hard to tell. Since both of them indicate that the statement is accurate, while the latter one needs additional information.

To address these, we proposed a new dataset – Coarse-grained Politic Fact (CPF). We treated top two truthfulness ratings as true, the middle two as neutral and the lowest two as false. This dataset shrinked the number of classes to three and reduced the similarity between classes. Then, we filtered and split the dataset into train (16980) and test split (4245).

3.3 BPF: Binary Politic Fact

The simplest member of this family of Politic Fact dataset is the Binary Politic Fact (BPF). This dataset consists of either true or false class. We constructed this based on CPF, ignoring the neutral class. This filtered 1/3 of CPF with the two sets (train/test) having 11320/2830 statements.

4 Models

The models in this section all base on BERT (Devlin et al., 2019). BERT utilizes the [CLS] token as the representation and has already achieved state-of-the-art results in many text classification tasks, such as SST-2 (Socher et al., 2013). However, we believe that with the additional layers the model can capture the sentence representation better.

4.1 Bert

Following the standard way (Devlin et al., 2019), we take the [CLS] token of the final layer as the sentence representation. This model is the baseline model of our task.


Since BERT has proven capable to capture high-level feature, we take advantage of it by outputting the entire final layer. Then we make use of the BiLSTM (Sachan et al., 2019)—We first pass the output to a BiLSTM layer. Next, we concatenate the forward LSTM and backward LSTM at each time step. Then we apply max-over-time operation over the concatenated hidden states to obtain the sentence representation.

4.3 Bert-Cnn

In this model, we utilize CNN (Kim, 2014) for sentence representation. We first build the sentence matrix with the final layer of BERT. Then we perform one-dimension convolution on it to produce the feature maps with multiple filters. Then we run a max-over-time pooling over each feature maps to capture the most significant features. Next, we concatenated these features to form the sentence representation.

5 Experiment

In this section, we present our analysis of class relationship of the data, compare performance of our models, and discuss about the necessity to design a new metric.

5.1 Hyperparameters


We made use of the 12-layer BERT-base. When we searched parameters for BERT, we followed the fine-tuning procedure (Devlin et al., 2019)

. The dropout rate was always 0.1. After an exhaustive search with the recommended parameters, we found that when batch size was 32, learning rate (Adam) was 5e-5, and the number of epochs was 4, the model performed the best.

BERT with additional layers

we kept the parameters in BERT the same. In BiLSTM, we kept hidden size the same as the dimension of the token (768) in BERT. Dropout rate was 0.5, following Sachan et al. (2019). In CNN, we follow the sensitivity analysis of CNN (Zhang and Wallace, 2015). After comparing the results of several suggested settings. The model performed best when region size of filters was (7,7,7,7), the number of feature maps was 768 and the dropout rate was 0.5.

Fine tuning

These parameters were chosen after search on a more true/false dataset (i.e., top three truthfulness ratings as more true and the others as more false). We do not otherwise perform any tuning on the three Politic Fact datasets.

Figure 1: Normalized distribution of predictions over FPF. (Average of three random seeds)

5.2 Result analysis: Class similarity

We present the distribution of predictions by BERT over FPF (6-class) and CPF (3-class) test set in Figure  1 and Figure 2 respectively. The proportion of predicted labels decrease as the distance from the ground truth become further. Moreover, the polarized classes present higher accuracy than the neutral ones. These situations indicate that class similarity can fool our model to choose the classes that are close to the ground truth. Especially for neutral classes, the statements are more likely to be misclassified because they locate in the middle and have close relationship with both sides. While the model classify the statements more accurately in CPF, class similarity can still have slight impact on the judgement.

5.3 Result analysis: Classifier comparison

Table 3

shows our models performance over three datasets. We chose weighted averaged F1 score as the metric to evaluate the performance. Since we utilized various random seeds, we report the average of five random seeds, with standard deviation as subscripts. Our baseline model (i.e., BERT) already performs well on its own. We had hoped to have performance gains through an additional BiLSTM layer. However, it only outperforms the baseline slightly on CPF. Moreover, BERT-CNN consistently outperforms the other two models over all three datasets.

Figure 2: Normalized distribution of predictions over CPF. (Average of three random seeds)

5.4 Metric

In this section, we discuss about a new evaluation metric and we hope it can throw some light on further works. Intuitively, rating the ground truth as closer classes should be less wrong than as further classes. However, widely used evaluation metrics (i.e., accuracy and F1 score) only evaluate whether the model can hit the target (a.k.a., ground truth) which makes it unable to fully represent rating performance. To address this, we suggest mean absolute error (MAE) as a new metric. It can differentiate the scores through the gaps between the predictions and the target. Further research is needed on how to quantify the gaps appropriately.

Table 3: F1 scores for Politic Fact datasets. We represent averages of five random seeds. Standard deviations are shown as subscripts.

6 Conclution

In this paper, we introduce the Politic Fact datasets with coarse-to-fine regimes for fact-checking researches. Through analysis on the distribution of predictions, we have shown that class similarity is sensible in both fine-grained and coarse-grained datasets. Though the additional layers cannot provide significant improvement, BERT is capable to tackle the task. In order to address the limit of current metrics, we also suggest a new metric and hope it inspire further researches.


  • H. Allcott and M. Gentzkow (2017) Social media and fake news in the 2016 election. Journal of Economic Perspectives 31 (2), pp. 211–36. External Links: Document, Link Cited by: §1.
  • R. Baly, M. Mohtarami, J. Glass, L. Màrquez, A. Moschitti, and P. Nakov (2018) Integrating stance detection and fact checking in a unified corpus. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 21–27. External Links: Link, Document Cited by: §2.
  • P. Biyani, K. Tsioutsiouliklis, and J. Blackmer (2016) ” 8 amazing secrets for getting more clicks”: detecting clickbaits in news streams using article informality. In

    Thirtieth AAAI conference on artificial intelligence

    Cited by: §2.
  • B.M. Brenan (2021) Americans Remain Distrustful of Mass Media. External Links: Link Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §4.1, §4, §5.1.
  • W. Ferreira and A. Vlachos (2016) Emergent: a novel data-set for stance classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies, pp. 1163–1168. Cited by: §2.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Doha, Qatar, pp. 1746–1751. External Links: Link, Document Cited by: §1, §4.3.
  • S. Kumar, R. West, and J. Leskovec (2016) Disinformation on the web: impact, characteristics, and detection of wikipedia hoaxes. In Proceedings of the 25th international conference on World Wide Web, pp. 591–602. Cited by: §2.
  • M. Mohtarami, R. Baly, J. Glass, P. Nakov, L. Màrquez, and A. Moschitti (2018) Automatic stance detection using end-to-end memory networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 767–776. External Links: Link, Document Cited by: §2.
  • N. O’Brien, S. Latessa, G. Evangelopoulos, and X. Boix (2018) The language of fake news: opening the black-box of deep learning based detectors. Cited by: §2.
  • B. Pang and L. Lee (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan, pp. 115–124. External Links: Link, Document Cited by: §1.
  • H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, and Y. Choi (2017) Truth of varying shades: analyzing language in fake news and political fact-checking. In Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 2931–2937. Cited by: §1, §2, §2.
  • V. L. Rubin, Y. Chen, and N. K. Conroy (2015) Deception detection for news: three types of fakes. Proceedings of the Association for Information Science and Technology 52 (1), pp. 1–4. Cited by: §2.
  • D. S. Sachan, M. Zaheer, and R. Salakhutdinov (2019) Revisiting lstm networks for semi-supervised text classification via mixed objective function. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6940–6948. Cited by: §1, §4.2, §5.1.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: §4.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 809–819. External Links: Link, Document Cited by: §2.
  • A. Vlachos and S. Riedel (2014) Fact checking: task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, Baltimore, MD, USA, pp. 18–22. External Links: Link, Document Cited by: §1, §2.
  • W. Y. Wang (2017) ” Liar, liar pants on fire”: a new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648. Cited by: §1, §2.
  • B. Xu, M. Mohtarami, and J. Glass (2019) Adversarial domain adaptation for stance detection. arXiv preprint arXiv:1902.02401. Cited by: §2.
  • Y. Zhang and B. Wallace (2015)

    A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification

    arXiv preprint arXiv:1510.03820. Cited by: §5.1.