Propaganda is the expression of an opinion or an action by individuals or groups deliberately designed to influence the opinions or the actions of other individuals or groups with reference to predetermined ends . We are interested in propaganda from a journalistic point of view: how news management lacking neutrality shapes information by emphasizing positive or negative aspects purposefully [5, p. 1]. Propaganda uses psychological and rhetorical techniques that are intended to go unnoticed to achieve maximum effect. As a result, malicious propaganda news outlets have proven to be able to achieve large-scale impact. In particular, the power of disinformation and propaganda was arguably demonstrated during recent events, such as Brexit and the 2016 U.S. Presidential campaign.111https://www.justice.gov/file/1035477/
With the rise of the Web, a combination of freedom of expression and ease of publishing contents online has nurtured a number of news outlets that produce or distribute propagandistic content. Social media further amplified the problem by making it possible to reach millions of users almost instantaneously. Thus, with the aim of helping fight the rise of propaganda, here we introduce proppy, a system to unmask articles with propagandistic content, which can (i) help investigative journalists to study propaganda online and (ii) raise awareness that a news article, or a news outlet in general, might be trying to influence people’s mindset.
To the best of our knowledge, proppy222Visit the proppy project at http://proppy.qcri.org is the first publicly available real-world, real-time monitoring and propaganda detection system for online news, which aims at raising awareness about propaganda.
Architecture of the System
Figure 1 shows the architecture of proppy. We describe its four modules next.
1. Article retrieval.
Proppy regularly monitors a variety of news outlets and extracts the content of the latest news articles from their websites. We use GDELT333http://gdeltproject.org to obtain links to news articles and the Newspaper3k Python library444https://newspaper.readthedocs.io to extract their content. Proppy then analyzes the articles in batches every 24 hours, and performs the remaining three steps.
2. Event identification.
for article representation, pre-trained on articles from Associated Press. We compute the pairwise distances for DBSCAN as 1 minus the cosine similarity. DBSCAN has two hyper-parameters: the minimum number of members in a cluster and the maximum distance between two members of the same cluster,
. We set the former parameter to 2, thus discarding singletons. We estimate the parameteron the METER corpus .
Next, we discard near-duplicates using a standard text re-use technique: comparison of word -grams  after standard pre-processing (case-folding, tokenization, and stopword removal). We compute the similarity between all pairs of documents in a cluster using the Jaccard coefficient. Once again, we use the METER corpus to optimize the value of and the threshold to consider two documents as near-duplicates. At run time, we discard all near-duplicates but one.
4. Propaganda Index Computation.
We train a maximum entropy classifier with L2 regularization to discriminate propagandistic vs non-propagandistic news articles. We use the confidence of the classifier, a value in the range, to group articles into bins. We call this value the propaganda index
, since it reflects the probability for an article to have a propagandistic intent. We use four families of features:
Word -gram features We use tf.idf-weighted word -grams .
Lexicon features. We try to capture the typical vocabulary
of propaganda by considering representations reflecting the frequency of specific words from a number of lexicons coming from the Wiktionary, Linguistic Inquiry and Word Count (LIWC), Wilson’s subjectives, Hyland hedges, and Hooper’s assertives. rashkin-EtAl:2017:EMNLP2017 (rashkin-EtAl:2017:EMNLP2017) showed that the words in some of these lexicons appear more frequently in propagandistic than in trustworthy articles.
Style, vocabulary richness, and readability. Our writing style representation consists of tf.idf-weighted character 3-grams. This representation captures different style markers, such as prefixes, suffixes, and punctuation marks. We further consider the type-token ratio (TTR) as well as the number of tokens appearing exactly once or twice in the document: hapax legomena and dislegomena. Moreover, we combine types, tokens, and hapax legomenæ to compute Honore’s R and Yule’s characteristic K. We also use three readability features originally designed to estimate the level of complexity of a text: Flesch–Kincaid grade level, Flesch reading ease and the Gunning fog index.
NELA. We also integrate the NEws LAndscape (NELA) features : 130 content-based features collected from the existing literature that measure different aspects of a news article (e.g., sentiment, bias, morality, complexity).
We evaluated proppy on data from rashkin-EtAl:2017:EMNLP2017 (rashkin-EtAl:2017:EMNLP2017) in a binary setup of distinguishing propaganda vs non-propaganda. Their -grams system yielded an F of 88.21, whereas proppy achieved 96.72 (+8.51), a statistically significant improvement (measured with the McNemar test).
Figure 2 shows a screenshot of proppy. The architecture follows a push publishing model: it updates automatically the material that it presents to the user without her taking any action but exploring the available events. The left panel shows events from the last 24 hours. When the user clicks on an event, its articles are shown in the right panel, organized into five bins according to their propaganda index.
The articles in bins 1 and 2 are considered nearly non-propagandistic, whereas those in the two right bins are propagandistic. In this way, the user can easily observe how different media cover related events on the propaganda dimension and may guide her further exploration and judgment.
Conclusion and Future Work
We have presented proppy, a publicly available real-world, real-time propaganda detection system for online news, which aims at raising awareness with the objective of limiting the impact of propaganda and helping fight disinformation. In future work, we plan to add support for multiple languages, and a pull mode where users will be able to submit any article and get its propaganda index.
-  (2002) Building and annotating a corpus for the study of journalistic text reuse. In LREC, pp. 1678–1691. Cited by: 2. Event identification..
-  (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of KDD, pp. 226–231. Cited by: 2. Event identification..
-  (2018) Sampling the news producers: a large news and feature data set for the study of the complex media landscape. In Proc. of ICWSM, pp. 518–527. Cited by: 4. Propaganda Index Computation..
-  (1938) How to Detect Propaganda. In Propaganda Analysis. Vol. I of the Publications of the Institute for Propaganda Analysis, Cited by: Introduction.
-  (2012) Propaganda and Persuasion. 5th edition, SAGE, Los Angeles, CA. Cited by: Introduction.
-  (2014) Distributed Representations of Sentences and Documents. In Proc. of ICML, pp. 1188–1196. Cited by: 2. Event identification..
-  (2004) A Theoretical Basis to the Automated Detection of Copying Between Texts, and its Practical Implementation in the Ferret Plagiarism and Collusion Detector. See 8, Cited by: 3. Deduplication..
-  (2004) Plagiarism: Prevention, Practice and Policies Conference. Plagiarism Advice. Cited by: 7.
-  (2017) Truth of varying shades: analyzing language in fake news and political fact-checking. In Proc. of EMNLP, pp. 2931–2937. Cited by: 4. Propaganda Index Computation..