Propaganda is biased information that deliberately propagates a particular ideology or political orientation . Propaganda aims to influence the public’s mentality and emotions, targeting their reciprocation due to their personal beliefs . News propaganda is a sub-type of propaganda that manipulates lies, semi-truths, and rumours in disguise of a credible news . The emphasis on this manipulation differentiates propaganda and its various classes from each other and free verbalization . News propaganda can lead to the mass circulation of misleading information, shared agenda, conflicts, religious or ethnic reasons, and can further even lead to violence and terrorism. Due to massive size, high velocity, rich online user interaction, and diversity, the manual identification of propaganda techniques is overwhelmingly impractical. Hence, the development of a generalized system for propaganda detection in news articles is a very vital task not only for the security analysts but also for society .
In SemEval2020-Task11, DaSanMartinoSemeval20task11 propose a corpus of news articles for propaganda detection. Each article is annotated with propaganda spans belonging to propaganda techniques. The annotation is performed at the fragment level. The task of propaganda detection is divided into two sub-tasks; Span identification (sequence labelling) and technique classification (sequence classification). Span identification sub-task aims to detect the propagandist spans of text in the news articles. Whereas, the technique classification sub-task aims to classify the propaganda spans into various propaganda techniques. The work presented in this paper aims to provide independent approaches for both sub-tasks and
. Most recent approaches for propaganda detection task use pre-trained transformer models. In this work, we propose a multi-granularity knowledge sharing model built on top of the embeddings extracted from these transformer models for the propaganda span detection sub-task. Further, we show the effectiveness of ensembling linguistic features based machine learning classifier with these transformer models in covering the minority-classes for the technique classification sub-task.
Before the ascension of the detection of fake news and propaganda in NLP, mihalcea2009lie introduced the automatic detection of lying in text. In this work, the authors proposed three datasets based on abortion, death penalty, and best friends
and approached the problem as multi-class classification. The Amazon Turk Service performed the annotations for these datasets. They used support vector machine and naive bayes classifiers for conducting their experiments. On the other hand, ciampaglia2015computational had employed knowledge graphs for computational fact-checking and utilizing data amassed from Wikipedia. Detection of propaganda in news articles was advocated by rashkin2017truth and barron2019proppy. The former proposed the use of Long Short-term Memory (LSTM) models and machine learning methods for deception detection and classification to different types of news,trusted, satire, hoax, and propaganda. The latter presented proppy111Proppy, propaganda detection system. https://proppy.qcri.org/ the first publicly available real-world and real-time propaganda detection system for online news. Fake news and propaganda have been getting more attention recently [5, 13, 14]. The constrained resources and lack of annotated corpora are considered to be an immense challenge for researchers in this field. da-san-martino-etal-2019-fine proposes a new dataset with text articles annotated on the fragment level with one of given propaganda techniques. To classify them, they also employ BERT  models for this high granularity task. Given the relatively low count of some of these techniques, they merged similar underrepresented techniques into superclasses like Bandwagon, Reductio_ad_hitlerum, Exaggeration, Minimisation, Name_Calling, labelling, and Whataboutism, Straw_Men, Red_Herring and eliminate Obfuscation, Intentional Vagueness, Confusion to compile classes .
In this section, we explain our approach and the road-map to our conclusion. A detailed explanation of our implementations is provided in Section 4. Inspired by da-san-martino-etal-2019-fine, we consider three granularities of propaganda detection: Document-level, Sentence/Fragment-level, and Token-level. The Span Identification sub-task focuses on high-granularity propaganda detection through a token-level sequence labelling task. Whereas, the Technique Classification sub-task focuses on sentence/fragment level classification. Even though both the sub-tasks are restricted to a single granularity, the context from other granularities can be used by defining auxiliary objectives for each granularity.
3.1 Sub-Task 1: Span Identification
We approach the span identification sub-task with token-level binary classification. We label every token as belongs / does not belong to a propaganda span. A continuous sequence of propaganda tokens in the text combine to form a propaganda span. We use pre-trained word embeddings to represent the tokens. As the context from lower granularity is important during the classification, using only token-level information for this task is not sufficient. To incorporate the lower granularity context for token-classification, we propose a multi-granularity knowledge-sharing model. We use two lower granularity contexts for this: the sentence and the article (document) to which the token belongs. Figure 1 illustrates the high-level architecture of the proposed model for span identification. We train the proposed model simultaneously for three different objectives (one for each granularity): article-level regression, sentence-level classification, and token-level classification.
3.1.1 Article-level Regression
For each article in the dataset, we have the article-title and the article-content split at sentence-level. The title of an article is a good summarized representation of the article’s content. We take the mean-pooled embedding of all the word embeddings from the article’s title to get the title embedding (). We represent each article by its respective . Further, is passed through an article encoder (Figure 1) to get a lower-dimensional article encoding (). For every article, we calculate the normalized count of sentences with propaganda spans. We refer to this value as the propaganda-ratio () which is defined according to Equation 1. is restricted to the range . is used for regression over the values of the articles. To minimize the error in predicted values, we use Smooth L1 Loss () as an objective function for this regression task to avoid exploding gradients from the outlying predictions.
3.1.2 Sentence-level Classification
We label every sentence in the article as contains / does not contain a propaganda span. For every sentence, we obtain its sentence embedding () by taking the mean-pooled embedding of all the word embeddings from the sentence. is passed through a sentence encoder (Figure 1) to get a lower dimensional sentence encoding (). We concatenate with its corresponding to incorporate the article context. The concatenated encoding is further used for the sentence-level binary classification task. We use Binary Cross-Entropy Loss () as an objective function for this task.
3.1.3 Token-level Classification
For token-level binary classification, we pass the token word embedding () through a token encoder (Figure 1) to get a lower-dimensional token encoding (). We concatenate with its corresponding and . This concatenated encoding is finally used for the token-level binary classification task. We use Binary Cross-Entropy Loss () as an objective for this task. The concatenation of encoding with its corresponding lower granularity encoding incorporates a unidirectional knowledge transfer from lower to higher granularity level. To implement an implicit bi-directional knowledge transfer, we perform simultaneous training for tasks across all the granularities. To perform the training, we define a combined multi-granularity objective function () as shown in Equation 2, where
is a hyperparameter. Model parameters across all three granularities are trained together to optimizeobjective function, resulting in a multi-task learning setting (Figure 1).
3.2 Sub-Task 2: Technique Classification
This sub-task is a multi-class classification of propaganda techniques. Our system classifies each propagandist statement amongst the
given techniques. The dataset is highly imbalanced, and there is a lot of variation in the length of sequences. Therefore, we ensemble our machine learning algorithm and deep learning transformer architecture to provide a solution with better generalization across the classes.
3.2.1 Machine Learning
For hand-crafted feature-based machine learning methods, we preprocess the sentences and extract their pragmatic and lexical features 
. Machine learning algorithms such as XGBoost and Logistic Regression use these features and perform better than the baseline substantially and generalize well. Since news articles are written in a formal language without any typographical or linguistic errors, the identification of these features is less technically challenging in comparison to free-form text.
We perform preprocessing on our text to remove any unwanted content and enrich our text for feature extraction. The various procedures for the preprocessing include UTF-8 conversion, removal of non-ASCII characters, lower casing, lemmatizing/stemming, and removal of extra white-spaces, newlines, and stop-words. We create the feature space for the classification using three types of metadata; contextual (sequence length, count of ’!’ and ’?’), content (word and char n-gram TFIDF, part of speech) and context-based metadata (polarity of a sentence222TextBlob API. https://textblob.readthedocs.io/en/dev/, and topic modeling). However, these models do not effectively capture the sequential nature of the data.
3.2.2 Deep Learning : Stacked LSTMs, Pretrained Embeddings, CNN-LSTMs
We shift our focus to deep learning models that retain the sequential information by training Long-Short-Term-Memory [LSTM] Networks 
. The non-linear decision boundaries of these neural networks capture the complex features inside text. We convert the sentences to lowercase, non-ASCII characters are removed and tokenized to words. While validating or testing all obscure words are converted to a special [UNK] token. Approximatelyof the words present in the development set’s vocabulary intersect with the training set’s vocabulary. Also, of the words in the training-vocabulary have a count lesser than five.
Each word is encoded to fixed-size-vector or embedding. These embeddings are randomly initialized and passed to a stacked LSTM network with a classification layer on top. These networks give us comparable results. For this approach, we use the stacking of two LSTM layers with hidden-sizes each. On increasing the number of stacked layers and the hidden-size of each LSTM unit, the models tend to overfit and neglect the minority classes due to the relatively small feature space and training samples per class. Therefore to improve the semantic relationship between the words and increase the training vocabulary, we propose systems that have prior knowledge of the language and a larger vocabulary set. Such systems show improvement on these linguistically rich corpora over models with randomly initialized embeddings . We can infer this because their language modelling is more refined and wider.
To increase the vocabulary for our model and to extract the contextual importance of words, we incorporate pre-trained word embeddings, ”Global Vectors” [GloVe] . These embeddings map each input token to a fixed-length vector. We set them to be non-trainable to avoid changing these vectors substantially. Pre-trained embeddings help the model to understand the underlying relationships.
Since GloVe is trained on a large corpus to obtain vector representation of words, of the vocabulary in the development set and of the training vocabulary intersects with its own. Therefore, in total less than of these words are out of vocabulary and address the issue of vocabulary coverage in above models. On inspection, we find that most of the out-of-vocab words in the GloVe are conjunct words commonly used in the text, such as jawboned, antichrists, trumpists, cyberspies, atleast. Therefore we use sub-word tokenization and break each of these absent words into two in-vocab words where the smaller word is at least two characters long like jaw boned, anti christs, trump ists, cyber spies, at least. The sub-word tokenization increases our coverage to with the remaining words either being proper-nouns or non-alphabetic patterns. Since these words convey no meaning, they are removed during training and testing.
We employ CNNs (Convolutional Neural Networks) and CNN-LSTM models as well by adding convolution filters before extracting and pooling information from the embeddings. CNN-LSTM models help to extract a sequence of higher-level phrase representations
. As a result of reduced inputs for the LSTM layers, these models converge faster (in less number of epochs) and quicker (in terms of training time per epoch), but the performance does not improve.
We pad all the sequences which have less thantokens and truncate the others to the maximum length of . The tokens are encoded to word embeddings. The obtained embeddings are fed to the stacked LSTM layers, with dropout regularization to prevent over-fitting. We pass the outputs of the LSTM layers to the linear classification layers with sigmoid activation to obtain the categorical cross-entropy loss. We use Adam optimizer with a small learning rate to update the parameters for epochs .
3.2.3 Deep Learning : Transformer Architectures
Consequently, we transit to deep pre-trained transformer architectures with attention mechanism . We employ BERT-base  model which has Layers. BERT tokenizer implements word-piece tokenization, which eliminates out-of-vocabulary words by splitting the words into sub-words. For example the word ”judgmental
” is broken to [”judgement”, ”##al”]. This model has been pre-trained on large corpus for masked language modelling and hence can be used for many downstream NLP tasks such as classification, question answering systems, and named-entity-recognition. We freeze most of the layers of the model and train the remaining with a very small learning-rate:- for four epochs with a classification layer on top of the extracted pooled output embeddings.
3.2.4 Ensemble Learning
The majority of the deep learning models perform better than the machine learning models on the validation split. On closer inspection, we observed that deep learning models tended to neglect the minority classes and had some unpromising class-wise F1 scores (discussed in Section 5.2). On the other hand, machine learning models had no zero-scores for any technique but proved to be less decisive for the majority-classes. We infer that training an ensemble of these models; the proposed architecture can improve in terms of generalization across all the classes. In this manner, we account for the gain in overall performance and also avoid zero-scores for the minority classes. Therefore we concatenate the scaled outputs from both the models and pass this to a logistic regression classifier. We penalize this classifier by the L2-norm (euclidean distance) and keep class wise weights for the objective in inverse proportion to their count in training set to obtain the final predictions. Our F1 scores for minority classes show massive improvements while only a minor change in the overall micro-F1 score.
4 Experimental Setup
4.1 Sub-Task 1: Span Identification
For the span-identification sub-task, we use word embeddings from pre-trained transformer models , extracted using flairNLP333Flair. A state-of-the-art NLP framework. https://github.com/flairNLP/flair . We test our model with various embeddings such as BERT , RoBERTa , XLNet 
and GPT-2, which differ in their model architectures and pre-training methods. These embeddings and their respective transformer-models are kept non-trainable throughout our experiments so that the standalone performance of our proposed system can be monitored effectively. We split the given training dataset at token-level with a -
split for training and validation purposes, respectively. We compute the accuracy, F1 score, precision, and recall metrics for the token-level binary-classification. Since there is a high class-imbalance for token-level classification, we consider the token-level F1 score for validation purposes. The official evaluation metrics (span-level F1, precision, and recall) used for this task are calculated with respect to the overlaps between the predicted and the ground-truth spans.
4.2 Sub-Task 2: Technique Classification
For this sub-task, we use BERT (base-uncased variant), a pre-trained transformer model based on the work done by devlin2018bert with a classification layer on top. We extract the contextual pooled output embedding from the BERT model using transformers 444Transformers. https://github.com/huggingface/transformers, a python package by Hugging Face. All layers except for the last two and classification layers are kept non-trainable. To avoid updating the saved state of the parameters substantially, we train it with a small learning rate for four epochs, it increases in steps to and then decays linearly. The gradients are also clipped as discussed by DBLP:journals/corr/abs-1904-00962 and the AdamW optimizer is used to optimize the model 
. For machine learning models, we examine the feature importance scores from the XGBoost model for feature selection. We only consider features that have relative feature importance scores more than , i.e. word -grams tf-idf scores for = and character -gram tf-idf scores for = to
, and the character count of the spans. We use the logistic regression model with L2 penalty and class weights inversely proportional to their count in the training set to get the technique classification probabilities. The output logits from BERT’s classification layer are scaled and concatenated with the scaled probabilities obtained from the logistic regression model. We pass this concatenated vector to another logistic regression classifier with the same hyper-parameters to get the final predictions.
5.1 Sub-Task 1: Span Identification
For the span-identification sub-task, our system ranks with a span-level F1 score of on the final test-set leaderboard. Table 1 shows the Span-level metrics for the official development set and the token-level metrics on our token-level validation split. Our model consistently achieves more than token-level accuracy across all the embeddings. We get the best token-level F1 scores with BERT and RoBERTa embeddings. The latter also outperforms all the other embeddings for the span-level metrics. Therefore, we use RoBERTa embeddings for the final submission to the task leaderboard. The performance for this sub-task can possibly be improved further by using trainable embeddings by fine-tuning their respective transformer-models.
|Span-level Metrics||Token-level Metrics|
|Validation Split||Development Set|
|Micro F1||Macro F1||Micro F1||Macro F1|
|BERT + Logistic Regression*||0.65||0.45||0.58||0.43|
|Whataboutism, Straw_Men, Red_Herring||0.100||0.053||0.059||0.067||0.049||0.292||0.0588|
5.2 Sub-Task 2: Technique Classification
For the technique-classification sub-task, our system ranks with a micro-F1 score of on the final test-set. We have shown the performances of our models on the development set in Table 2. Since this is an imbalanced multi-class classification task, we also report our models’ macro F1 scores to test the generalization of our proposed systems across all the propaganda techniques. Macro averaging gives equal importance to all classes. Table 2 unveils that in both validation and development splits, the combined BERT and logistic regression outperforms all models with a macro-f1 score of and , respectively.
To further inspect the performance of our models across particular techniques, we also calculate the class-wise F1 scores as shown in Table 3. Machine learning models (logistic regression) achieve a relatively lower micro averaged F1 score of on the development set, but give non-zero scores for all the classes. Stacked LSTM networks with randomly initialized embeddings do not achieve any improvement on the development-set and conceive near-zero scores for multiple classes as well. This is due to the vast vocabulary of this relatively smaller dataset. Consequently, on using pre-trained GloVe embeddings with the stacked LSTM networks, we achieve a significant improvement .
Similarly, while CNN-LSTM models are relatively faster, they lower the overall performance and conceive zero scores for multiple classes. BERT outperforms the above models significantly ; however, it does not perform well for the minority classes. While the ensemble of BERT with the logistic regression model does not improve the micro F1 score significantly ; unlike other models it gives non-zero F1 score for all the classes, improving the macro F1 score from to ( gain). We use this ensemble model as our final submission due to its better generalization across all the propaganda technique classes. Our results reveal that minority classes such as Appeal to authority, Bandwagon, Reductio ad Hitlerum, Black and white fallacy and Thought terminating cliches are relatively difficult to predict.
We make our code publicly available as GitHub repositories. The source code for the proposed multi-granularity knowledge sharing model for sub-task can be accessed at (https://github.com/rajaswa/semeval2020-task11). The source code for the proposed BERT ensemble with linguistic features based logistic regression model for sub-task can be accessed at (https://github.com/someshsingh22/News-Propaganda-Detection). Our models and acquired results are available for benchmarking, comparison, and reproducibility. We do not share the experimental dataset as per the task guidelines.
In this paper, we proposed systems for the task of span identification and multi-class imbalanced technique classification of propaganda spans in news articles. We analyzed the performance of various machine learning and deep learning-based architectures for these high granularity tasks. On the span-identification sub-task test set, our multi-granularity knowledge sharing model gives a span-level F1 score of . For the technique classification task, our ensemble of pre-trained transformer model with logistic regression gives a micro F1 score of . We further infer the effectiveness of incorporating linguistic features and achieve non-zero F1 scores for all techniques and gain in the macro-F1 score. Our results also unveil the limitations/ineffectiveness of deep learning models to capture the minority-class techniques.
We plan to extend our work by using trainable transformer-model embeddings and improve the performance of the span-identification sub-task. The work can further be enhanced by adding more granularities for knowledge-sharing. The proposed knowledge-sharing model may also be used for various closely-related tasks such as fake news and hate speech detection, given application-specific appropriate objective functions are defined across multiple granularities for these tasks.
NSIT@ nlp4if-2019: propaganda detection from news articles using transfer learning. EMNLP-IJCNLP 2019, pp. 143. Cited by: §1.
-  (2018) Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pp. 1638–1649. Cited by: §4.1.
-  (2019) Organized persuasive communication: a new conceptual framework for research on public relations, propaganda and promotional culture. Critical Sociology 45 (3), pp. 311–328. External Links: Cited by: §1.
Proppy: a system to unmask propaganda in online news.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9847–9848. Cited by: §1.
From clickbait to fake news detection: an approach based on detecting the stance of headlines to articles.
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism, pp. 84–89. Cited by: §2.
-  (2020-09) SemEval-2020 task 11: detection of propaganda techniques in news articles. In Proceedings of the 14th International Workshop on Semantic Evaluation, SemEval 2020, Barcelona, Spain, pp. . External Links: Cited by: §2.
-  (2019-11) Fine-grained analysis of propaganda in news article. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5636–5646. External Links: Cited by: §4.1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
-  (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §3.2.3, §4.1.
-  (2010) Why does unsupervised pre-training help deep learning?. Journal of Machine Learning Research 11 (Feb), pp. 625–660. Cited by: §3.2.2.
-  (1999) Learning to forget: continual prediction with lstm. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), Vol. 2, pp. 850–855 vol.2. Cited by: §3.2.2.
-  (1984) Decline of a paradigm? bias and objectivity in news media studies. Critical Studies in Media Communication 1 (3), pp. 229–259. Cited by: §1.
Weakly supervised learning for fake news detection on twitter. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 274–277. Cited by: §2.
-  (2018) Fake news detection. In 2018 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), pp. 1–5. Cited by: §2.
-  (2018) Propaganda & persuasion. Sage Publications. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.2.
-  (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Cited by: §4.1.
-  (2017) Fixing weight decay regularization in adam. CoRR abs/1711.05101. External Links: Cited by: §4.2.
-  (2004) Complex linguistic features for text classification: a comprehensive study. In Advances in Information Retrieval, S. McDonald and J. Tait (Eds.), Berlin, Heidelberg, pp. 181–196. External Links: Cited by: §3.2.1.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.2.2.
-  (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §4.1.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.2.3, §4.1.
-  (2019) XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Cited by: §4.1.
-  (2017) Short-term load forecasting using emd-lstm neural networks with a xgboost algorithm for feature importance evaluation. Energies 10 (8), pp. 1168. Cited by: §4.2.
-  (2015) A c-lstm neural network for text classification. arXiv preprint arXiv:1511.08630. Cited by: §3.2.2.