Propaganda is the approach deliberately designed with specific purposes to influence the opinions of readers. Different from the fake news which is entirely made-up and refers to fabricated news with no verifiable facts, propaganda conveys information with strong emotion or somewhat biased, albeit is possibly built upon an element of truth. This characteristic makes propaganda more effective and unnoticed through the rise of social media platforms. There are many propaganda techniques. For instance, examples of propagandistic texts and definitions of corresponding techniques are shown in Figure 1.
We study the problem of fine-grained propaganda detection in this work, which is possible thanks to the recent release of Propaganda Techniques Corpus (Da San Martino et al., 2019). Different from earlier works (Rashkin et al., 2017; Wang, 2017) that mainly study propaganda detection at a coarse-grained level, namely predicting whether a document is propagandistic or not, the problem requires identification of tokens of particular propaganda techniques in news articles. Da San Martino et al. (2019) propose strong baselines in a multi-task learning manner, which are trained by binary detection of propaganda at sentence level and fine-grained propaganda detection over 18 techniques at token level. Such data-driven methods have the merits of convenient end-to-end learning and strong generalization, however, they cannot guarantee the consistency between sentence-level and token-level predictions. In addition, it is appealing to integrate human knowledge into data-driven approaches.
In this paper, we introduce an approach named LatexPRO that leverages logical and textual knowledge for propaganda detection. Following (Da San Martino et al., 2019)
, we develop a BERT-based multi-task learning approach as the base model, which makes predictions for 18 propaganda techniques at both sentence-level and token-level. Based on that, we inject two types of knowledge as additional objectives to regularize the learning process. Specifically, we use logic knowledge by transforming the consistency between sentence-level and token-level predictions with propositional Boolean expressions. Moreover, we use textual definition of propaganda techniques by first representing each of them as a contextual vector and then minimizing the distances to corresponding model parameters in semantic space.
We conduct extensive experiments on Propaganda Techniques Corpus (PTC) (Da San Martino et al., 2019), a large manually annotated dataset for fine-grained propaganda detection. Experiments show that our knowledge augmented method significantly improves a strong multi-task learning approach. In particular, results show our model greatly improves precision, demonstrating injecting declarative knowledge expressed in both natural language and first-order logic can help the model to make more accurate predictions. What is more important, further analysis indicates that augmenting the learning process with declarative knowledge reduces the percentage of inconsistency in model predictions.
The contributions of this paper are summarized as follows:
We introduce an approach to leverage declarative knowledge expressed in both natural language and first-order logic for fine-grained propaganda techniques.
We utilize both types of knowledge as regularizers in the learning process, which enables the model to make more consistent between sentence-level and token-level predictions.
Extensive experiments on the PTC dataset (Da San Martino et al., 2019) demonstrate that our method achieves superior performance with high and precision.
|Appeal to fear-prejudice||187||32||20|
|Appeal to Authority||91||2||23|
|Reductio ad hitlerum||44||5||5|
Following previous work (Da San Martino et al., 2019), we conduct experiments on two different granularities tasks: sentence-level classification (SLC) and fragment-level classification (FLC). Formally, in both tasks, the input is a plain-text document containing a sequence of characters and a set of propagandistic fragments , in that each propagandistic text fragment is represented as a sequence of contiguous characters . For SLC, the target is to predict whether a sentence is propagandistic which can be regarded as a binary classification. For FLC, the target is to predict a set with propagandistic fragments and identify to one of the propagandistic techniques.
This paper utilizes Propaganda Techniques Corpus (PTC) (Da San Martino et al., 2019) for experiments. PTC is a manually annotated dataset for fine-grained propaganda detection, containing 293/ 57/ 101 articles and 14,857/ 2,108/ 4,265 corresponding sentences for training, validation and testing, respectively. Each article is annotated with the start and end of the propaganda text span as well as the type of propaganda technique. As the annotations of the official testing set are not publicly available, we divided the validation set into a validation set of 22 articles and a test set of 35 articles. The statistics of all 18 propaganda techniques and their frequencies (instance per technique) is shown as Table 1.
For SLC task, we evaluate the models using precision, recall and scores. As for FLC task, we adopt the evaluation script provided by Da San Martino et al. (2019) to calculate precision, recall and , in that giving partial credit to imperfect matches at the character level. The FLC task is evaluated on two kinds of measures: (1) Full task is the overall task of detecting both propagandistic fragments and identifying the technique, while (2) Spans is a special case of the Full task, which only considers the spans of fragments except for their propaganda techniques.
In this section, we present our approach LatexPRO
that injects declarative knowledge of fine-grained propaganda techniques into neural networks. A high-level illustration is shown in Figure2. We first present our base model (§3.1), which is a multi-task learning neural architecture that slightly extends the model of (Da San Martino et al., 2019). Afterwards, we introduce ways to regularize the learning process with textual knowledge from literal definitions of propaganda techniques (§3.3 and logical knowledge about the consistency between sentence-level and token-level predictions (§3.2). Finally, we describe the training and inference procedures (§3.4).
3.1 Base Model
To better exploit the sentence-level information and further help token-level prediction, we develop a fine-grained multi-task method as our base model, which makes predictions for 18 propaganda techniques at both sentence-level and token-level. Inspired by the success of pre-trained language models on various natural language processing downstream tasks, we adopt BERT(Devlin et al., 2019) as the backbone model here. To fine-tune the model, for each sentence, the input sequence is modified as “
”. Specifically, we add 19 binary classifiers and one 19-way classifier on top of BERT, where all classifiers are implemented as linear layers. At sentence level, we perform multiple binary classifications and this can further support leveraging declarative knowledge. The last representation of the special tokenwhich is regarded as a summary of the semantic content of the input, is adopted to perform multiple binary classifications, including one binary classification of propaganda vs. non-propaganda and 18 binary classifications of each propaganda technique. We adopt sigmoid activation for each binary classifier. At token level, the last representation of each token is fed into a linear layer to predict the propaganda technique over 19 categories (i.e., 18 categories of propaganda techniques plus one category for “none of them”). We adopt Softmax activation for the 19-way classifier. We apply two different losses for this multi-task learning process, including the sentence-level loss and the token-level loss . is the binary cross-entropy loss of multiple binary classifications. is the focal loss (Lin et al., 2017) of 19-way classification for each token, which could address class imbalance problem.
3.2 Inject Logical Knowledge
There are some implicit logical constraints between predictions. However, neural networks are less interpretable and need to be trained with a large amount of data to make it possible to learn such implicit logic. Therefore, we consider to further improve performance by using logic knowledge. To this end, we propose to employ propositional Boolean expressions to explicitly regularize the model with logic-driven objective, which improves logical consistency between sentence-level and token-level predictions, and makes our method more interpretable. For instance, in this work, if a propaganda class is predicted by the multiple binary classifiers (indicates the sentence contains this propaganda technique), then the token-level predictions belonging to the propaganda class should also exist. We thus consider the propositional rule , formulated as:
where A and B are two variables. Specifically, substituting and into above formula as , then the logic rule can be written as:
where denotes the input, is the binary classifier for propaganda class , and
is the probability of fine-grained predictions that containsbeing category of .
can be obtained by max-pooling over all the probability of predictions for class. Note that the probabilities of the unpredicted class are set to 0 to prevent any violation, i.e., ensuring that each class has a probability corresponding to it. Our objective here is maximizing , i.e., minimizing , to improve logical consistency between coarse- and fine-grained predictions.
3.3 Inject Textual Knowledge
Declarative knowledge in natural language, i.e., the literal definitions of propaganda techniques in this work, can be regarded as somewhat textual knowledge which contains useful semantic information. To exploit this kind of knowledge, we adopt an additional encoder to encode the literal definition of each propaganda technique. Specifically, for each definition, the input sequence is modified as “” and fed into BERT. We adopt the last representation of the special token as each definition representation where represents the -th propaganda technique class. We calculate Euclidean distance between each predicted propaganda category representation and definition representation . Our objective is minimizing textual definition loss , which regularizes the model to refine the propaganda category representations.
3.4 Training and Inference
To train the whole model jointly, we introduce a weighted sum of losses which consists of the token-level loss , fine-grained sentence-level loss , textual definition loss and logical loss :
where hyper-parameters , , and are employed to control the tradeoff among losses. During training, our goal is minimizing
using stochastic gradient descent.
|BERT (Da San Martino et al., 2019)||50.39||46.09||48.15||27.92||27.27||27.60||-|
|MGN (Da San Martino et al., 2019)||51.16||47.27||49.14||30.10||29.37||29.73||-|
For the SLC task, our method with a condition to predict “propaganda” only if the probability of propagandistic binary classification for the positive class is above 0.7. This threshold is chosen according to the number of propaganda and non-propaganda samples in the training dataset. For the FLC task, to better use the coarse-grained (sentence-level) information to guide fine-grained (token-level) prediction, we design a way that can be used to explicitly make constraints on 19-way predictions when doing inference. Prediction probabilities of 18 fine-grained binary classifications above 0.9 are set to 1, and vice versa to 0. Then the Softmax probability of 19-way predictions (except for the “none of them” class) of each token is multiplied by the corresponding 18 probabilities of propaganda techniques. This means that our model only considers making predictions for the propaganda techniques which are strongly confident the sentence contains.
4.1 Experimental Settings
In this paper, we conduct experiments on Propaganda Techniques Corpus (PTC)111Note that the annotations of the official PTC test set are not publicly available, thus we split the original dev set into dev and test set as Section 2. We use the released code Da San Martino et al. (2019) to run the baseline. (Da San Martino et al., 2019) which is a large manually annotated dataset for fine-grained propaganda detection, as detailed in Section 2. We adopt score as the final metric to represent the model performance. We select the best model on the dev dataset.
We adopt BERT base cased (Devlin et al., 2019) as the pre-trained model. We implement our model using Huggingface (Wolf et al., 2019). We use AdamW as the optimizer. In our best model on the dev dataset, the hyper-parameters in loss optimization are set as , , and
. We set the max sequence length to 256, the batch size to 16, the learning rate to 3e-5 and warmup steps to 500. We train our model for 20 epochs and adopt an early stopping strategy on the average validationscore of Spans and Full Task with patience of 5. For all experiments, we set the random seed to be 42 for reproducibility.
4.2 Models for Comparison
We compare our proposed methods with several baselines for fine-grained propaganda detection. Moreover, three variants of our method are provided to reveal the impact of each component. The notations of LatexPRO (T+L), LatexPRO (T), and LatexPRO (L) denote our model which injects of both textual and logical knowledge, only textual knowledge and only logical knowledge, respectively. Each of these models will be described as follows.
BERT (Da San Martino et al., 2019) adds a linear layer on the top of BERT, and is fine-tuned on SLC and FLC tasks, respectively.
MGN (Da San Martino et al., 2019) is a multi-task learning model, which regards the SLC task as the main task and drive the FLC task on the basis of the SLC task.
LatexPRO is our baseline model without leveraging declarative knowledge.
LatexPRO (T) arguments LatexPRO with declarative textual knowledge in natural language, i.e., the literal definitions of propaganda techniques.
LatexPRO (L) injects logical knowledge by employing propositional Boolean expressions to explicitly regularize the model.
LatexPRO (T+L) is our full model in this paper.
|Propaganda Technique||MGN||LatexPRO||LatexPRO (T+L)|
|Appeal to Authority||0||0||0||0||0||0||0||0||0|
|Appeal to fear-prejudice||8.41||18.26||11.52||15.69||14.90||15.28||13.53||14.90||14.18|
|Reductio ad hitlerum||45.40||49.02||47.14||99.85||59.88||74.87||100.00||45.74||62.77|
4.3 Experiment Results and Analysis
Fragment-Level Propaganda Detection.
The results for the FLC task are shown in Table 2. Our basic model LatexPRO has achieved better results than other baseline models, which approves the effectiveness of our fine-grained multi-task learning structure. It is worth noting that, our full model LatexPRO (T+L) significantly outperforms MGN by 10.06% precision and 2.85% for Spans task, 12.54% precision and 4.92% for Full task, which is considered as significant process on this dataset. This demonstrates that leveraging declarative knowledge in text and first-order logic helps to predict the propaganda types more accurately. Moreover, our ablated models LatexPRO (T) and LatexPRO (L) both gain improvements over LatexPRO, while LatexPRO (L) gains more improvements than LatexPRO (T). This indicates that injecting each kind of knowledge is useful, and the effect of different kinds of knowledge can be superimposed and uncoupled. It should be noted that, compared with baseline models, our models have achieved a superior performance thanks to high precision, but the recall slightly loses. This is mainly because our models tend to make predictions for the high confident propaganda types.
To further understand the performance of models for the FLC task, we make a more detailed analysis of each propaganda technique. Table 3 shows detailed performance on the Full task. Our models achieve precision and improvements of almost all the classes over baseline model, and can also predict some low-frequency propaganda techniques, e.g., Whataboutism and Obfus.,Int. This further demonstrates that our method can stress class imbalance problem, and make more accurate predictions.
|BERT (Da San Martino et al., 2019)||58.26||57.81||58.03|
|MGN (Da San Martino et al., 2019)||57.41||62.50||59.85|
Sentence-Level Propaganda Detection.
Table 4 shows the performances of different models for the SLC task. The results indicate that our model achieves superior performances over other baseline models. Compared to MGN, LatexPRO (T+L) increases the precision by 1.63%, recall by 9.16% and score by 4.89%. This demonstrates the effectiveness of our model, and shows that our model can find more positive samples which will further benefit the token-level predictions for the FLC task.
4.4 Effectiveness of Improving Consistency
We further define the following metric to measure the consistency between sentence-level predictions which is a set of predicted propaganda technique classes, and token-level predictions which is a set of predicted propaganda techniques for input tokens:
where denotes a normalizing factor, represents the indicator function:
Table 2 presents the consistency scores . The higher the score indicates the better consistency. Results illustrate that our methods with declarative knowledge can substantially outperform the basic model LatexPRO. Compared to the basic model, our declarative knowledge augmented methods enrich the source information by introducing textual knowledge from propaganda definitions, and logical knowledge from implicit logical rules between predictions, which enables the model to make more consistent predictions.
4.5 Case Study
Figure 3 gives a qualitative comparison example between MGN and our LatexPRO (T+L). Different colors represent different propaganda techniques. The results show that although MGN could predict the spans of fragments correctly, it fails to identify their techniques to some extent. However, our method shows promising results on both spans and specific propaganda techniques, which further confirms that our method can make more accurate predictions.
4.6 Error Analysis
Although our model has achieved the best performance, it still some types of propaganda techniques are not identified, e.g., Appeal to Authority and Red Herring as shown in Table 3. To explore why our model LatexPRO (T+L) cannot predict for those propaganda techniques, we compute a confusion matrix for the Full Task of FLC task, and visualize the confusion matrix using a heatmap as shown in Figure 4. We find that most of the off-diagonal elements are in class O which represents none of them. This demonstrates most of the cases are wrongly classified into O. We think this is due to the imbalance of the propaganda and non-propaganda categories in the dataset. Similarly, Straw Men, Red Herring and Whataboutism are the relatively low frequency of classes. How to deal with the class imbalance still needs further exploration.
5 Related work
Our work relates to fake news detection and the injection of first-order logic into neural networks. We will describe related studies in these two directions.
Fake news detection draws growing attention as the spread of misinformation on social media becomes easier and leads to stronger influence. Various types of fake news detection problems are introduced. For example, there are 4-way classification of news documents Rashkin et al. (2017), and 6-way classification of short statements Wang (2017). There are also sentence-level fact checking problems with various genres of evidence, including natural language sentences from Wikipedia Thorne et al. (2018), semi-structured tables Chen et al. (2019), and images Zlatkova et al. (2019); Nakamura et al. (2019). Our work studies propaganda detection, a fine-grained problem that requires token-level prediction over 18 fine-grained propaganda techniques. The release of a large manually annotated dataset Da San Martino et al. (2019) makes the development of large neural models possible, and also triggers our work, which improves a standard multi-task learning approach by augmenting declarative knowledge expressed in both natural language and first-order logic.
Neural networks have the merits of convenient end-to-end training and good generalization, however, they typically need a lot of training data and are not interpretable. On the other hand, logic-based expert systems are interpretable and require less or no training data. It is appealing to leverage the advantages from both worlds. In NLP community, the injection of logic to neural network can be generally divided into two groups. Methods in the first
group regularize neural network with logic-driven loss functionsXu et al. (2017); Fischer et al. (2018); Li et al. (2019)
. For example, rocktaschel2015injecting target on the problem of knowledge base completion. After extracting and annotating propositional logical rules about relations in knowledge graph, they ground these rules to facts from knowledge graph and add a differentiable training loss function. kruszewski2015deriving map text to Boolean representations, and derive loss functions based on implication at Boolean level for entailment detection. demeester2016lifted propose lifted regularization for knowledge base completion to improve the logical loss functions to be independent of the number of grounded instances and to further extend to unseen constants, The basic idea is that hypernyms have ordering relations and such relations correspond to component-wise comparison in semantic vector space. hu2016harnessing introduce a teacher-student model, where the teacher model is a rule-regularized neural network, whose predictions are used to teach the student model. wang2018deep generalize virtual evidencePearl (2014) to arbitrary potential functions over inputs and outputs, and use deep probabilistic logic to integrate indirection supervision into neural networks. More recently, Asai2020LogicGuidedDA regularize question answering systems with symmetric consistency and symmetric consistency. The former creates a symmetric question by replacing words with their antonyms in comparison question, while the latter is for causal reasoning questions through creating new examples when positive causal relationship between two cause-effect questions holds.
. For example, rocktaschel2017end target at the problem of knowledge base completion, and use neural unification modules to recursively construct model similar to the backward chaining algorithm of Prolog. evans2018learning develop a differentiable model of forward chaining inference, where weights represent a probability distribution over clauses.Li and Srikumar (2019)
inject logic-driven neurons to existing neural networks by measuring the degree of the head being true measured by probabilistic soft logic(Kimmig et al., 2012). Our approach belongs to the first direction, and to the best of knowledge our work is the first one that augments neural network with logical knowledge for propaganda detection.
In this paper, we propose a fine-grained multi-task learning approach, which leverages declarative knowledge to detect propaganda techniques in news articles. Specifically, the declarative knowledge is expressed in both natural language and first-order logic, which are used as regularizers to obtain better propaganda representations and improve logical consistency between coarse- and fine- grained predictions, respectively. Extensive experiments on the PTC dataset demonstrate that our knowledge augmented method achieves superior performance with more consistent between sentence-level and token-level predictions.
- TabFact: a large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164. Cited by: §5.
- Fine-grained analysis of propaganda in news article. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5640–5650. Cited by: 3rd item, §1, §1, §1, §2, §2, §2, Table 2, §3, §4.1, §4.2, §4.2, Table 4, §5, footnote 1.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §3.1, §4.1.
- Neural logic machines. arXiv preprint arXiv:1904.11694. Cited by: §5.
- Dl2: training and querying neural networks with logic. Cited by: §5.
- A short introduction to probabilistic soft logic. In Proceedings of the NIPS Workshop on Probabilistic Programming: Foundations and Applications, pp. 1–4. Cited by: §5.
- A logic-driven framework for consistency of neural models. arXiv preprint arXiv:1909.00126. Cited by: §5.
- Augmenting neural networks with first-order logic. arXiv preprint arXiv:1906.06298. Cited by: §5.
Focal loss for dense object detection.
Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §3.1.
- R/fakeddit: a new multimodal benchmark dataset for fine-grained fake news detection. arXiv preprint arXiv:1911.03854. Cited by: §5.
- Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier. Cited by: §5.
- Truth of varying shades: analyzing language in fake news and political fact-checking. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2931–2937. Cited by: §1, §5.
- FEVER: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355. Cited by: §5.
- ” Liar, liar pants on fire”: a new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648. Cited by: §1, §5.
- HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §4.1.
A semantic loss function for deep learning with symbolic knowledge. arXiv preprint arXiv:1711.11157. Cited by: §5.
- Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems, pp. 2319–2328. Cited by: §5.
- Fact-checking meets fauxtography: verifying claims about images. arXiv preprint arXiv:1908.11722. Cited by: §5.