Log In Sign Up

Team QCRI-MIT at SemEval-2019 Task 4: Propaganda Analysis Meets Hyperpartisan News Detection

In this paper, we describe our submission to SemEval-2019 Task 4 on Hyperpartisan News Detection. Our system relies on a variety of engineered features originally used to detect propaganda. This is based on the assumption that biased messages are propagandistic in the sense that they promote a particular political cause or viewpoint. We trained a logistic regression model with features ranging from simple bag-of-words to vocabulary richness and text readability features. Our system achieved 72.9 is annotated manually and 60.8 supervision. Additional experiments showed that significant performance improvements can be achieved with better feature pre-processing.


page 1

page 2

page 3

page 4


TUDublin team at Constraint@AAAI2021 – COVID19 Fake News Detection

The paper is devoted to the participation of the TUDublin team in Constr...

A multi-layer approach to disinformation detection on Twitter

We tackle the problem of classifying news articles pertaining to disinfo...

Deep Ensemble Learning for News Stance Detection

Stance detection in fake news is an important component in news veracity...

1 Introduction

The rise of social media has enabled people to easily share information with a large audience without regulations or quality control. This has allowed malicious users to spread disinformation and misinformation (a.k.a. “fake news”) at an unprecedented rate. Fake news is typically characterized as being hyperpartisan (one-sided), emotional and riddled with lies Potthast et al. (2017a). The SemEval-2019 Task 4 on Hyperpartisan News Detection Kiesel et al. (2019) focused on the challenge of automatically identifying whether a text is hyperpartisan or not. While hyperpartisanship is defined as “exhibiting one or more of blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person”, we model this task as a binary document classification problem.

Scholars have argued that all biased messages can be considered propagandistic, regardless of whether the bias was intentional or not (Ellul, 1965, p. XV). As a result, we approached the task departing from an existing model for propaganda identification Barrón-Cedeño et al. (2019). Our hypothesis is that as propaganda is inherent in hyperpartisanship – the two problems are two sides of the same coin, and solving one of them would help solve the other. Our system consists of a logistic regression model that is trained with a variety of engineered features that range from word and character TFiDF

-grams and lexicon-based features to more sophisticated features that represent different aspects of the article’s text such as the richness of its vocabulary and the complexity of its language.

Our official submission achieved an accuracy of 72.9% (while the winning system achieved 82.2%). This was achieved using word and character -grams. Additional, post-submission experiments show that further performance improvements can be achieved by careful pre-processing of the engineered features.

2 Related Work

The analysis of bias and disinformation has attracted significant attention, especially after the 2016 US presidential election Brill (2001); Finberg et al. (2002); Castillo et al. (2011); Baly et al. (2018a); Kulkarni et al. (2018); Mihaylov et al. (2018). Most of the proposed approaches have focused on predicting credibility, bias or stance. Popat et al. (2017) assessed the credibility of claims based on the occurrence of assertive and factive verbs, hedges, implicative words, report verbs and discourse markers, which were extracted using manually crafted gazetteers (referred to as stylistic features).

Stance detection was considered as an intermediate step for detecting fake claims, where the veracity of a claim is checked by aggregating the stances of retrieved relevant articles Baly et al. (2018b). Several stance detection models have been proposed as part of the Fake News Challenge (FNC)222

including deep convolutional neural networks 

Baird et al. (2017)

, multi-layer perceptrons 

Hanselowski et al. (2018), and end-to-end memory networks Mohtarami et al. (2018)

The stylometric analysis model of Koppel et al. (2007) was used by Potthast et al. (2017b) when looking for hyperpartisanship. They used articles from nine news sources whose factuality has been manually verified by professional journalists. Writing style and complexity was also considered by Horne and Adalı (2017) to differentiate real news from fake news and satire. They used features such as the number of occurrences of different part-of-speech tags, swearing and slang words, stop words, punctuation, and negation as stylistic markers. They also used a number of readability measures. Rashkin et al. (2017) focused on a multi-class setting: real news, satire, hoax, or propaganda. Their supervised model relied on word -grams.

Similarly to Potthast et al. (2017b), we believe that there is an inherent style in propaganda, regardless of the source publishing it. Many stylistic features were proposed for authorship identification, i.e., the task of predicting whether a piece of text has been written by a particular author. One of the most successful representations for such a task are character-level -grams Stamatatos (2009), and they turn out to represent some of our most important stylistic features.

More details about research on fact-checking and the spread of fake news online can be found in Lazer et al. (2018); Vosoughi et al. (2018); Thorne and Vlachos (2018).

3 System Description

We developed our system for detecting hyper-partisanship in news articles by training a logistic regression classifier using a set of engineered features that included the following: character and word

-grams, lexicon-based indicators, and readability and vocabulary richness measures. Below, we describe these features in detail.

Character -grams.

Stamatatos (2009) argued that, for tasks where the topic is irrelevant, character-level representations are more sensitive than token-level ones. We hypothesize that this applies to hyperpartisan news detection, since articles on both sides of the political spectrum may be discussing the same topics. Stamatatos (2009) found that “the most frequent character -grams are the most important features for stylistic purposes”. These features capture different style markers, such as prefixes, suffixes and punctuation marks. Following the analysis in Barrón-Cedeño et al. (2019), we include TFiDF-weighted character 3-grams in our feature set.

Word -grams

Bag-of-words (BoW) features are widely used for text classification. We extracted the most frequent -grams, and we represented them using their TFiDF scores. We ignored

-grams that appeared in more than 90% of the documents, most of which contained stopwords and were irrelevant with respect to hyperpartisanship. Furthermore, we incorporated Naive Bayes by weighing the

-grams based on their importance for classification, as proposed by Wang and Manning (2012). We define

as a row vector in the TFiDF feature matrix, representing the

training sample with a target label , where is the vocabulary size. We also define vectors and , and we set the smoothing parameter to 1. Finally, we calculate the vector:


which is used to scale the TFiDF features to create the NB-TFiDF features as follows:


Bias Analysis

We analyze the bias in the language used in the documents by (i) creating bias lexicons that contain left and right bias cues, and (ii) using these lexicons to compute two scores for each document, indicating the intensity of bias towards each ideology. To generate the list of cues that signal biased language, we use Semantic Orientation (SO) Turney (2002) to identify the words that are strongly associated with each of the left and right documents in the training dataset. Those SO values can be either positive or negative, indicating association with right or left biases, respectively. Then, we select words whose absolute SO value is to create two bias lexicons: and . Finally, we use these lexicons to compute two bias scores per document according to Equation (3), where for each document , the frequency of cues in the lexicon that are present in is normalized by the total number of words in :


Lexicon-based Features.

Rashkin et al. (2017) studied the occurrence of specific types of words in different kinds of articles, and showed that words from certain lexicons (e.g., negation and swear words) appear more frequently in propaganda, satire, and hoax articles than in trustworthy articles. We capture this by extracting features that reflect the frequency of words from particular lexicons. We use 18 lexicons from the Wiktionary, Linguistic Inquiry and Word Count (LIWC) Pennebaker et al. (2001), Wilson’s subjectives Wilson et al. (2005), Hyland’s hedges Hyland (2015), and Hooper’s assertives Hooper (1975). For each lexicon, we count the total number of words in the article that appear in the lexicon. This resulted in 18 features, one for each lexicon.

Vocabulary Richness

Potthast et al. (2017b) showed that hyperpartisan outlets tend to use a writing style that is different from mainstream outlets. Different topic-independent features have been proposed to characterize the vocabulary richness, style and complexity of a text. For this task, we used the following vocabulary richness features: (i) type–token ratio (TTR): the ratio of types to tokens in a text, (iiHapax Legomena: number of types appearing once in a text, (iiiHapax Dislegomena: number of types appearing twice in a text, (ivHonore’s R: A combination of types, tokens and hapax legomena Honore (1979):


and (vYule’s characteristic K

: The chance of a word occurring in a text following a Poisson distribution 

Yule (1944):


where tokens refer to all words in a text (including repetitions), types refer to distinct words, are the tokens’ frequency ranks (1 being the least frequent), and types are the number of tokens with the frequency.


We also used the following readability features that were originally designed to estimate the level of text complexity: 1)

Flesch–Kincaid grade level: represents the US grade level necessary to understand a text Kincaid et al. (1975), 2) Flesch reading ease: is a score for measuring how difficult a text is to read Kincaid et al. (1975), and 3) Gunning fog index: estimates the years of formal education necessary to understand a text Gunning (1968).

Features Labeled by-Article Labeled by-Publisher
Accuracy Prec. Rec. F1 Accuracy Prec. Rec. F1
1 BoW (TFiDF) 67.8 53.8 89.1 67.1 56.7 55.1 72.5 62.6
2 BoW (NB-TFiDF) 69.6 56.1 80.7 66.2 57.1 56.4 61.9 59.0
+ Char trigrams
74.0 62.5 73.5 67.6 54.8 54.3 60.8 57.4
+ Bias
75.2 67.7 62.6 65.1 54.5 55.0 50.4 52.6
+ Lexical
75.2 67.0 64.7 65.8 52.3 52.3 51.5 51.9
+ Vocab. Richness
75.8 67.1 67.6 67.4 50.9 50.8 52.5 51.7
+ Readability
76.0 66.4 70.6 68.4 51.6 51.5 53.9 52.7
Table 1: An incremental analysis showing the performance of different feature combinations, evaluated on the validation sets labeled by article and by publisher.

4 Experiments and Results

4.1 Dataset

We trained our models on the Hyperpartisan News Dataset from SemEval-2019, Task 4 Kiesel et al. (2019), which is split by the task organizers into: 1) Labeled by-Publisher: contains 750K articles labeled via distant supervision, i.e. using labels of their publisher333Publishers labels are identified by BuzzFeed journalists or by the Media Bias/Fact Check project. Labels are evenly distributed across the “hyperpartisan” and “not-hyperpartisan” classes. This set is further split into 600K for training and 150K for validation. 2) Labeled by-Article: This set contains 645 articles labeled through crowd-sourcing (37% are hyperpartisan and 63% are not). Only articles with a consensus among annotators were included.

4.2 Experimental Setting

We train a logistic regression (LR) model with a Stochastic Average Gradient solver Schmidt et al. (2017) due to the large size of the dataset. In order to reduce overfitting we use regularization (with

as the regularization parameter). Feature normalization was needed since the different features represent different aspects of text, hence have very different scales. We tried to normalize each feature set by subtracting the mean and scaling it to unit variance. However, we found that multiplying the features by constant scaling factors resulted in better performance. The scaling factor for each family of features was a hyperparameter that was tuned during the fine-tuning experiments.

We trained the classifier using the 600K training examples annotated by-Publisher, then used the remaining 150K examples for evaluation. We fine-tuned the hyperparameters on the 645 by-Article examples. The hyper-parameters include as the most frequent word -grams and the scaling parameters of the different features except for the -grams. Best fine-tuning results suggested using the 200K most-frequent word -grams. We assessed the different feature sets, described in Section 3, by incrementally adding each set, one at a time, to the mix of all features.

4.3 Results

Table 1 illustrates the results obtained on both the by-Article set (which we used to fine-tune the model’s hyperparameters) and the by-Publisher set (which we used for evaluation). Our results suggest that scaling the TFiDF values through Naive Bayes is better than using raw TFiDF scores. Hence, these features were used for all subsequent experiments. It can also be observed that adding each group of features introduces a consistent improvement in accuracy on the by-Article data. However, we observed an opposite behaviour on the by-Publisher data. We believe this is due to the significant amount of noisy labels introduced by the distant supervision labeling strategy. Therefore, we based our decisions on the results obtained on the by-Article data since its labels are more accurate.

The normalization strategy, i.e., scaling the features using calibrated scaling parameters, introduced significant performance improvements. Unfortunately, we were not able to perform these calibration experiments by the competition’s deadline, hence we submitted the system that was available at that time, which is based on the BoW (NB-TFiDF) and character 3-gram features, as shown in row 3 in Table 1. Our system achieved a 72.9% accuracy on the test by-Article data, ranking 20th/42. It also achieved 60.8% accuracy on the test by-Publisher data, ranking 15th/42. All subsequent, and superior, results (rows 4–7) were obtained after the deadline.

5 Conclusion

In this paper, we present our submission to SemEval-2019 Task 4 on Hyperpartisan News Detection. We trained a logistic regression model with a feature set that included word and character -grams, represented with TFiDF. This system achieved a 72.9% and 60.8% accuracy on the test data that is labeled by-Article and by-Publisher, respectively.

We also evaluated additional features that represent different aspects of the article’s text such as its vocabulary richness, the kind of language it uses according to different lexicons, and its level of complexity. Initial experiments showed that these features hurt the model. However, with proper pre-processing and scaling we were able to achieve significant performance improvements of up to 2% in absolute accuracy. These results were obtained after the competition’s deadline, hence were not considered as part of our submission.

6 Acknowledgment

This research was carried out in collaboration be- tween the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and the HBKU Qatar Computing Research Institute (QCRI).


  • Baird et al. (2017) Sean Baird, Doug Sibley, and Yuxi Pan. 2017. Talos targets disinformation with fake news challenge victory.
  • Baly et al. (2018a) Ramy Baly, Georgi Karadzhov, Dimitar Alexandrov, James Glass, and Preslav Nakov. 2018a. Predicting factuality of reporting and bias of news media sources. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , EMNLP ’18, pages 3528–3539.
  • Baly et al. (2018b) Ramy Baly, Mitra Mohtarami, James Glass, Lluís Màrquez, Alessandro Moschitti, and Preslav Nakov. 2018b. Integrating stance detection and fact checking in a unified corpus. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’18, pages 21–27, New Orleans, LA, USA.
  • Barrón-Cedeño et al. (2019) Alberto Barrón-Cedeño, Giovanni Da San Martino, Israa Jaradat, and Preslav Nakov. 2019. Proppy: Organizing news coverage on the basis of their propagandistic content. Information Processing and Management.
  • Barrón-Cedeño et al. (2019) Alberto Barrón-Cedeño, Giovanni Da San Martino, Israa Jaradat, and Preslav Nakov. 2019. Proppy: A system to unmask propaganda in online news. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI’19), AAAI’19, Honolulu, HI, USA.
  • Brill (2001) Ann M Brill. 2001. Online journalists embrace new marketing function. Newspaper Research Journal, 22(2):28.
  • Castillo et al. (2011) Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. 2011. Information credibility on Twitter. In Proceedings of the 20th International Conference on World Wide Web, WWWZ’11, pages 675–684, Hyderabad, India.
  • Ellul (1965) Jacques Ellul. 1965. Propaganda: The Formation of Men’s Attitudes. Vintage Books, United States.
  • Finberg et al. (2002) Howard Finberg, Martha L Stone, and Diane Lynch. 2002. Digital journalism credibility study. Online News Association. Retrieved November, 3:2003.
  • Gunning (1968) Robert Gunning. 1968. The Technique of Clear Writing. McGraw-Hill.
  • Hanselowski et al. (2018) Andreas Hanselowski, Avinesh PVS, Benjamin Schiller, Felix Caspelherr, Debanjan Chaudhuri, Christian M Meyer, and Iryna Gurevych. 2018. A retrospective analysis of the fake news challenge stance detection task. arXiv preprint arXiv:1806.05180.
  • Honore (1979) Anthony Honore. 1979. Some Simple Measures of Richness of Vocabulary. Association for Literary and Linguistic Computing Bulletin, 7(2):172–177.
  • Hooper (1975) Joan B. Hooper. 1975. On assertive predicates. In J. Kimball, editor, Syntax and Semantics, volume 4, page 91–124. Academic Press, New York.
  • Horne and Adalı (2017) Benjamin D Horne and Sibel Adalı. 2017. This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News. In Proceedings of the International Workshop on News and Public Opinion at ICWSM, Montreal, Canada.
  • Hyland (2015) Ken Hyland. 2015. The International Encyclopedia of Language and Social Interaction, chapter Metadiscourse. American Cancer Society.
  • Kiesel et al. (2019) Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, David Corney, Payam Adineh, Benno Stein, and Martin Potthast. 2019. Semeval-2019 task 4: Hyperpartisan news detection. In Proceedings of The 13th International Workshop on Semantic Evaluation. Association for Computational Linguistics.
  • Kincaid et al. (1975) J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Memphis TN Naval Air Station, Research B.
  • Koppel et al. (2007) Moshe Koppel, Jonathan Schler, and Elisheva Bonchek-Dokow. 2007. Measuring differentiability: Unmasking pseudonymous authors elisheva bonchek-dokow.

    Journal of Machine Learning Research

    , 8:1261–1276.
  • Kulkarni et al. (2018) Vivek Kulkarni, Junting Ye, Steve Skiena, and William Yang Wang. 2018. Multi-view models for political ideology detection of news articles. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP ’18, Brussels, Belgium.
  • Lazer et al. (2018) David MJ Lazer, Matthew A Baum, Yochai Benkler, Adam J Berinsky, Kelly M Greenhill, Filippo Menczer, Miriam J Metzger, Brendan Nyhan, Gordon Pennycook, David Rothschild, et al. 2018. The science of fake news. Science, 359(6380):1094–1096.
  • Mihaylov et al. (2018) Todor Mihaylov, Tsvetomila Mihaylova, Preslav Nakov, Lluís Màrquez, Georgi Georgiev, and Ivan Koychev. 2018. The dark side of news community forums: Opinion manipulation trolls. Internet Research, 28(5):1292–1312.
  • Mohtarami et al. (2018) Mitra Mohtarami, Ramy Baly, James Glass, Preslav Nakov, Lluís Màrquez, and Alessandro Moschitti. 2018. Automatic stance detection using end-to-end memory networks. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’18, pages 767–776, New Orleans, LA, USA.
  • Pennebaker et al. (2001) James W Pennebaker, Martha E Francis, and Roger J Booth. 2001. Linguistic inquiry and word count: Liwc 2001. LIWC Operators Manual 2001.
  • Popat et al. (2017) Kashyap Popat, Subhabrata Mukherjee, Jannik Strötgen, and Gerhard Weikum. 2017. Where the truth lies: Explaining the credibility of emerging claims on the Web and social media. In Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion, pages 1003–1012, Perth, Australia.
  • Potthast et al. (2017a) Martin Potthast, Johannes Kiesel, Kevin Reinartz, Janek Bevendorff, and Benno Stein. 2017a. A stylometric inquiry into hyperpartisan and fake news. arXiv preprint arXiv:1702.05638.
  • Potthast et al. (2017b) Martin Potthast, Johannes Kiesel, Kevin Reinartz, Janek Bevendorff, and Benno Stein. 2017b. A stylometric inquiry into hyperpartisan and fake news. CoRR, abs/1702.05638.
  • Rashkin et al. (2017) Hannah Rashkin, Eunsol Choi, Jin Yea Jang, Svitlana Volkova, and Yejin Choi. 2017. Truth of varying shades: Analyzing language in fake news and political fact-checking. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP ’17, pages 2931–2937, Copenhagen, Denmark.
  • Schmidt et al. (2017) Mark Schmidt, Nicolas Le Roux, and Francis Bach. 2017. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112.
  • Stamatatos (2009) Efstathios Stamatatos. 2009. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, 60(3):538–556.
  • Thorne and Vlachos (2018) James Thorne and Andreas Vlachos. 2018. Automated fact checking: Task formulations, methods and future directions. In Proceedings of the 27th International Conference on Computational Linguistics, COLING ’18, pages 3346–3359, Santa Fe, NM, USA.
  • Turney (2002) Peter D. Turney. 2002. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 417–424, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Vosoughi et al. (2018) Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online. Science, 359(6380):1146–1151.
  • Wang and Manning (2012) Sida Wang and Christopher D Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 90–94. Association for Computational Linguistics.
  • Wilson et al. (2005) Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005.

    Recognizing contextual polarity in phrase-level sentiment analysis.

    In Proceedings of the conference on human language technology and empirical methods in natural language processing, pages 347–354. Association for Computational Linguistics.
  • Yule (1944) George Udny Yule. 1944. The Statistical Study of Literary Vocabulary. Cambridge University Press.