DeepAI
Log In Sign Up

Graph Neural Networks with Continual Learning for Fake News Detection from Social Media

07/07/2020
by   Yi Han, et al.
The University of Melbourne
0

Although significant effort has been applied to fact-checking, the prevalence of fake news over social media, which has profound impact on justice, public trust and our society as a whole, remains a serious problem. In this work, we focus on propagation-based fake news detection, as recent studies have demonstrated that fake news and real news spread differently online. Specifically, considering the capability of graph neural networks (GNNs) in dealing with non-Euclidean data, we use GNNs to differentiate between the propagation patterns of fake and real news on social media. In particular, we concentrate on two questions: (1) Without relying on any text information, e.g., tweet content, replies and user descriptions, how accurately can GNNs identify fake news? Machine learning models are known to be vulnerable to adversarial attacks, and avoiding the dependence on text-based features can make the model less susceptible to the manipulation of advanced fake news fabricators. (2) How to deal with new, unseen data? In other words, how does a GNN trained on a given dataset perform on a new and potentially vastly different dataset? If it achieves unsatisfactory performance, how do we solve the problem without re-training the model on the entire data from scratch, which would become prohibitively expensive in practice as the data volumes grow? We study the above questions on two datasets with thousands of labelled news, and our results show that: (1) GNNs can indeed achieve comparable or superior performance without any text information to state-of-the-art methods. (2) GNNs trained on a given dataset may perform poorly on new, unseen data, and direct incremental training cannot solve the problem. In order to solve the problem, we propose a method that achieves balanced performance on both existing and new datasets, by using techniques from continual learning to train GNNs incrementally.

READ FULL TEXT VIEW PDF

page 4

page 6

page 8

07/27/2022

Modelling Social Context for Fake News Detection: A Graph Neural Network Based Approach

Detection of fake news is crucial to ensure the authenticity of informat...
06/03/2018

TI-CNN: Convolutional Neural Networks for Fake News Detection

With the development of social networks, fake news for various commercia...
02/10/2019

Fake News Detection on Social Media using Geometric Deep Learning

Social media are nowadays one of the main news sources for millions of p...
10/11/2020

How to Stop Epidemics: Controlling Graph Dynamics with Reinforcement Learning and Graph Neural Networks

We consider the problem of monitoring and controlling a partially-observ...
05/12/2019

A Benchmark Study on Machine Learning Methods for Fake News Detection

The proliferation of fake news and its propagation on social media have ...
12/13/2022

Exploring Fake News Detection with Heterogeneous Social Media Context Graphs

Fake news detection has become a research area that goes way beyond a pu...

1. Introduction

While social media has facilitated the timely delivery of various types of information around the world, a consequence is that news is emerging at an unprecedentedly high rate, making it increasingly difficult to fact-check. A series of incidents over the recent years have demonstrated the significant damage fake news can cause to society. Therefore, how to automatically and accurately identify fake news before it is widespread has become an urgent challenge for research. Here we use the definition in  (Zhou and Zafarani, 2018): fake news is intentionally and verifiably false news published by a news outlet—similar definitions have also been used in previous studies on fake news detection (Monti et al., 2019; Shu et al., 2019a, 2017; Ruchansky et al., 2017).

In our work, we focus on a propagation-based approach for fake news detection. In other words, we use the propagation pattern of news on social media, e.g., tweets and retweets of news on Twitter, to determine whether it is fake or not. The feasibility of this approach builds on (1) empirical evidence that fake news and real news spread differently online (Vosoughi et al., 2018); and (2) the latest development in graph neural networks (GNNs) (Bruna et al., 2013; Niepert et al., 2016; Ying et al., 2018; Wu et al., 2019) that has enhanced the performance of machine learning models on non-Euclidean data. In addition, as pointed out in (Monti et al., 2019), whereas content-based approaches require syntactic and semantic analyses, propagation-based approaches are language-agnostic, and can be less vulnerable to adversarial attacks (Szegedy et al., 2013; Goodfellow et al., 2014), where advanced news fabricators carefully manipulate the content in order to bypass detection.

The idea of using propagation patterns to detect fake news has been explored in a number of previous studies (Wu et al., 2015; Ma et al., 2017; Wu and Liu, 2018; Liu and Wu, 2018; Zhou and Zafarani, 2019; Shu et al., 2019b), where different types of models have been considered: Wu et al. (Wu et al., 2015)

use a hybrid Support Vector Machine (SVM), Ma

et al. (Ma et al., 2017) use Propagation Tree Kernel; Wu et al. (Wu and Liu, 2018)

incorporate Long Short-Term Memory (LSTM) cells into the Recurrent Neural Network (RNN) model; Liu

et al. (Liu and Wu, 2018)

use both RNNs and Convolutional Neural Networks (CNNs); Shu

et al. (Shu et al., 2019b) and Zhou et al. (Zhou and Zafarani, 2019) propose different types of features and compare multiple commonly used machine learning models. The most relevant work is (Monti et al., 2019), which also applies GNNs to study propagation patterns. However, in addition to selecting a different GNN algorithm specifically designed for graph classification (refer to Section 2 for further explanation), our work mainly focuses on the following questions:

  • [wide=0pt]

  • Section 3: Without relying on any text information, e.g., tweet content, replies and user descriptions, how accurately can GNNs identify fake news? It is demonstrated in Section 3 that even though our model is limited to a restricted set of eight features obtained from social context—(1) whether the Twitter user is verified, (2) the timestamp when the user is created, (3) the number of followers, (4) the number of friends, (5) the number of lists, (6) the number of favourites, (7) the number of statuses and (8) the timestamp of the tweet, GNNs can be trained on propagation patterns and these features to achieve comparable or superior performance to state-of-the-art methods that require sophisticated analyses on tweet content, user replies, etc. We argue that the limited set of features can further enhance the security of our models against adversarial attacks, as previous work has shown that high dimensionality facilitates the generation of adversarial samples, resulting in an increased attack surface (Wang et al., 2016). In addition, we do not rely on the follower or following relations between Twitter users, since these types of information are more difficult to obtain in real time due to the rate limit of Twitter APIs. Therefore, our model is more suitable to online detection;

  • Section 4: How to deal with new, unseen data? The above question is only concerned with the performance of GNNs on a single dataset. However, a trained model may face vastly different data in practice, and it is important to further investigate how models perform in this scenario. Specifically, we find that GNNs trained on a given dataset may perform poorly on another dataset, and direct incremental training cannot solve the problem—this issue has not been discussed in the previous work that uses GNNs for fake news detection. In order to solve the problem, we propose a method that applies techniques from continual learning to train GNNs incrementally, so that they achieve balanced performance on both existing and new datasets. The method avoids re-training the model on the entire data from scratch—new data always exist, and this becomes prohibitively expensive as data volumes grow.

The remainder of this paper is organised as follows: Section 2 briefly introduces the background on graph neural networks; Section 3 describes our content-free, GNNs-based fake news detection algorithm; Section 4 investigates how to deal with new, unseen data, and proposes a solution to achieve balanced performance on both existing and new data by applying techniques from continual learning; Section 5 reviews previous work in fake news detection on social media; and finally Section 6 concludes the paper and offers directions for future work.

2. Background on Graph Neural Networks

Although deep learning has witnessed tremendous success in a wide range of applications, including image classification, natural language processing and speech recognition, it mostly deals with data in Euclidean space. GNNs 

(Bruna et al., 2013; Niepert et al., 2016; Ying et al., 2018; Wu et al., 2019), by contrast, are designed to process data generated from non-Euclidean domains.

Consider a graph with vertices/nodes and edges, where is the adjacency matrix. if there is an edge from node to node , and otherwise; is the feature matrix, i.e., each node has features. Given and as inputs, the output of a GNN, i.e., node embeddings, after the step is: , where is the propagation function parameterised by , and is initialised by the feature matrix, i.e., .

There have been a number of implementations for the propagation function. A simple form of the function is: , where

is a non-linear activation function,

e.g.

, the rectified linear unit (ReLU) function, and

is the weight matrix for layer . A popular implementation of the function is (Kipf and Welling, 2017): , where , . Please refer to (Wu et al., 2019) for more choices of the function.

GNNs can perform node regression, node classification, link prediction, edge classification or graph classification depending on the requirements. In our work, since the goal is to label the propagation pattern of each item of news, which is a graph, we choose the algorithm of DiffPool (Ying et al., 2018) that is specifically designed for graph classification. DiffPool extends any existing GNN model by further considering the structural information of graphs. At each layer DiffPool takes the original output and the adjacency matrix , and learns a coarsened graph of nodes, with the adjacency matrix and the node embeddings .

3. Propagation-based Fake News Detection

As mentioned in the introduction, we use the definition in  (Zhou and Zafarani, 2018) that fake news is intentionally and verifiably false news published by a news outlet. Once an item of news is published, it may be tweeted by multiple users. We call these tweets that directly reference the news URL initial/source tweets. Each of them and their retweets form a separate cascade (Vosoughi et al., 2018), and all the cascades form the propagation pattern of an item of news. The purpose of this work is to determine the validity of an item of news using its propagation pattern.

Formally, we define the propagation-based fake news detection problem as follows: given a set of labeled graphs , where is the propagation pattern for news , and is the label of graph , the goal is to learn a mapping that labels each graph.

In the remaining of this section, we first explain how we generate a graph in Section 3.1, i.e., the adjacency matrix and the feature matrix, and present the experimental results to verify the effectiveness of the GNN-based detection algorithm in Section3.2.

Figure 1. An illustration of the graph for each item of news.
(a) Accuracy
(b) Precision
(c) Recall
(d) F1
Figure 2. Performance comparison on the dataset of PolitiFact. The first eight bars correspond to the results of eight fake news detection algorithms as reported in (Shu et al., 2019a), the second last bar is the result of our propagation-based method trained on the whole dataset, and the last bar is the result of our propagation-based method trained on the clipped dataset.

3.1. Data Generation

In order to generate the news propagation pattern, we use the dataset of FakeNewsNet (Shu et al., 2018), which is especially collected for the purpose of fake news detection. FakeNewsNet contains labelled news from two websites: politifact.com111https://www.politifact.com/ and gossipcop.com222https://www.gossipcop.com/—the news content includes both linguistic and visual information, all the tweets and retweets for each item of news, and the information of the corresponding Twitter users (refer to (Shu et al., 2018) for more details).

Adjacency matrix. As illustrated in Fig. 1, each item of news is represented as a graph, where a node refers to a tweet (including the corresponding user), either the source tweet that references the news or its retweets. A special case is that an extra node representing the news is added to connect all cascades together. All the feature values for this node are set to zero. Edges here represent information flow, i.e., how the news transfers from one person to another. Specifically, there is an edge from node to node if:

  • The user of node mentions the user of node in the tweet;

  • Tweet is public and tweet is posted within a certain period of time after tweet , e.g., five hours.

Note that the follower and following relations are not considered here, since it is more difficult to obtain these types of information in real time—Twitter applies a much stricter rate limit on corresponding APIs. In order for our model to be more adaptive to the online context, we only transfer the above relations to edges.

Feature matrix. As mentioned earlier we do not rely on any textual information in this work, including tweet content, user reply or user description, and only choose the following information from the social context as the features for each node:

  • Whether the user is verified;

  • The timestamp when the user was created, encoded as the number of months since March 2006—the time when Twitter was founded;

  • The number of followers;

  • The number of friends;

  • The number of lists;

  • The number of favourites;

  • The number of statuses;

  • The timestamp of the tweet, encoded as the number of seconds since the first tweet references the news is posted.

3.2. Experimental Verification

Using the method introduced in the previous subsection to generate the graphs (the adjacency and feature matrices), we test multiple DiffPool models with a range of different architectures: 2-4 pooling layers, 16-128 hidden dimensions and 16-128 embedding dimensions. As recommended by the authors in (Ying et al., 2018), we use DiffPool built on top of GraphSage (Hamilton et al., 2017).

(a) Accuracy
(b) Precision
(c) Recall
(d) F1
Figure 3. Performance comparison on the dataset of GossipCop. The first eight bars correspond to the results of eight fake news detection algorithms as reported in (Shu et al., 2019a), the second last bar is the result of our propagation-based method trained on the whole dataset, and the last bar is the result of our propagation-based method trained on the clipped dataset.

In addition, we train GNNs first on the whole dataset of PolitiFact/GossipCop, and then on the clipped dataset that contains only the first 100 tweets or tweets from the first five hours, whichever is smaller, for each item of news—it is more critical to detect fake news at an early stage before it becomes widespread, since the wider fake news spreads, the more likely people would trust it (Boehm, 1994), and once the first impression is formed, it is difficult to correct people’s perceptions (keersmaecker and Roets, 2017).

In order to make our results comparable with those reported in (Shu et al., 2019a) (as they also tested fake news detection algorithms on the same dataset), we follow the same procedure to train and test the GNNs: randomly choose 75% of the news as the training data while keeping the rest as the test data, and the final result is the average performance over five repeats. In addition, the model is evaluated with the following commonly used metrics: accuracy, precision, recall and F1 score.

The experimental results are presented in Figs. 2 and 3, where (1) The first eight bars correspond to the results of eight fake news detection algorithms as reported in (Shu et al., 2019a) on the same dataset—Rhetorical Structure Theory (RST) (Rubin et al., 2015), Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al., 2015), Hierarchical Attention Networks (HAN) (Yang et al., 2016), text-CNN (Kim, 2014), TCNN-URG (Qian et al., 2018), HPA-BLSTM (Guo et al., 2018), CSI (Ruchansky et al., 2017) and dEFEND (Shu et al., 2019a). Note that all of these methods require analysis on textual information, e.g., tweet content, user replies. (2) The second last bar is the result of our propagation-based method trained on the whole dataset. (3) The last bar is the result of our method trained on the clipped dataset.

As we can see from the figures, by only relying on the limited features as introduced in Section 3.1, our model can achieve comparable performance on the dataset of PolitiFact, and the best result on the dataset of GossipCop, not only when trained on the complete dataset, but also when trained using the first 100 tweets or the tweets from the first five hours for each item of news.

In addition, we have also tested clipped datasets that contain the first 100 (without the five hour time limit), 200, 500, 1000 and 1500 tweets for each item of news. Table 1 presents the performance of models trained on these datasets. The results further demonstrate the effectiveness of our proposed method.

Model efficiency.

When training and testing our models, we also find that GNNs converge very quickly—most of the time it only takes dozens of epochs for the model to reach similar performance to the final model in terms of the four metrics, while each epoch lasts from only a couple of seconds to several minutes, depending on the different model structures and sizes of the datasets.

All these results provide strong support for applying GNNs in propagation-based fake news detection.

Dataset Metric 100 200 500 1000 1500
PolitiFact Accuracy 0.850 0.861 0.860 0.883 0.890
Precision 0.846 0.852 0.852 0.876 0.873
Recall 0.852 0.858 0.860 0.880 0.890
F1 0.846 0.854 0.855 0.876 0.880
GossipCop Accuracy 0.882 0.881 0.894 0.889 0.902
Precision 0.876 0.877 0.893 0.889 0.897
Recall 0.884 0.877 0.894 0.891 0.900
F1 0.879 0.877 0.893 0.888 0.899
Table 1. Performance of the GNN-based fake news detection algorithm on the clipped datasets that contain the first 100 (without the five hour time limit), 200, 500, 1000 and 1500 tweets for each item of news.

4. Dealing with New Data

While the above results demonstrate the effectiveness of our proposed method on a single dataset, this section further studies the model performance on new data.

Let one dataset, e.g., PolitiFact, represent the existing data that our model has been trained on, and the other dataset, e.g.,, GossipCop, represent the unknown data that our model will face in the future, we find that models trained on PolitiFact do not perform well on GossipCop (Fig. 4), and vice versa (the figure for this case is omitted due to similarity).

Note that from here forward we focus on models trained on the clipped dataset with the first 100 tweets or the tweets from the first five hours for each item of news (although same experiments have been performed on the complete dataset as well, and similar results are obtained. Please refer to Appendix A.1 for more details), since previous results have shown that models trained on this dataset can achieve reasonably close performance to models trained on the complete dataset, and more importantly, as we emphasised in the previous section, it is crucial to detect fake news items before they become widespread.

Figure 4. Models trained on the clipped dataset of PolitiFact perform poorly on the dataset of GossipCop.

An examination of the graphs reveals that the graphs generated from PolitiFact and GossipCop are vastly different, in terms of the numbers of nodes and edges, which explains the reason for the observed behavior.

Why not directly train on both datasets? A natural thought is to re-train the model on both datasets, but this may not be feasible, or at least not ideal in practice: there will always be new data that our model has not seen before, and it does not make sense to re-train the model from scratch on the entire data every time a new dataset is obtained, especially since as the data size grows, this can become prohibitively expensive. In the remainder of this section, we address the issue of dealing with new, unseen data.

4.1. Incremental Training

We first test incremental training, i.e., further train the model obtained from PolitiFact (or GossipCop) on the other dataset of GossipCop (or PolitiFact). However, as shown in Fig. 5, then the models only perform well on the latter dataset on which they are trained, while achieve degraded results on the former dataset (the figure for models first trained on GossipCop and then on PolitiFact is omitted due to similarity). Note that during incremental training, we still randomly choose 75% of graphs as the training data and the rest as the test data.

This is similar to the problem of catastrophic forgetting (McCloskey and Cohen, 1989; Ratcliff, 1990; McClelland et al., 1995; French, 1999) in the field of continual learning: when a deep neural network is trained to learn a sequence of tasks, it degrades its performance on the former tasks after it learns new tasks, as the new tasks override the weights.

In our case, each new dataset can be considered as a new task. In the next subsection, we investigate how to solve the problem by applying techniques from continual learning.

Figure 5. Models first trained on the clipped dataset of PolitiFact and then on GossipCop only perform well on the latter dataset on which it is trained, i.e., GossipCop.

4.2. Continual Learning

In order to deal with catastrophic forgetting, a number of approaches have been proposed, which can be roughly classified into three types 

(Parisi et al., 2018)

: (1) regularisation-based approaches that add extra constraints to the loss function to prevent the loss of previous knowledge; (2) architecture-based approaches that selectively train a part of the network for each task, and expand the network when necessary for new tasks; (3) dual-memory-based approaches that build on top of complementary learning systems (CLS) theory 

(McClelland et al., 1995; Kumaran et al., 2016), and replay samples for memory consolidation.

In this paper, we choose the following two popular methods:

  • Gradient Episodic Memory (GEM) (Lopez-Paz and Ranzato, 2017)—GEM uses episodic memory to store a number of samples from previous tasks, and when learning a new task , it does not allow the loss over those samples held in memory to increase compared to when the learning of task is finished;

  • Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017)—its loss function consists of a quadratic penalty term on the change of the parameters, in order to prevent drastic updates to those parameters that are important to the old tasks.

(a) Sample size=100
(b) Sample size=200
(c) Sample size=300
Figure 6. Performance of models first trained on the clipped dataset of PolitiFact and then on GossipCop using GEM.
(a) Sample size=100
(b) Sample size=200
(c) Sample size=300
Figure 7. Performance of models first trained on the clipped dataset of PolitiFact and then on GossipCop using EWC ().

In our case, the learning on the two datasets ( and ) are considered as two tasks. When the model learns the first task, it is trained as usual; then during the learning of the second task, we apply GEM and EWC:

  • Let be the model parameters after the first task, and be the set of instances sampled from the first dataset, then the optimisation problem under GEM becomes:

  • Let be the regularisation weight, be the Fisher information matrix, and

    be the parameters of the Gaussian distribution used by EWC to approximate the posterior of

    , then the loss function under EWC is:

    Note that when estimating the Fisher information matrix

    , we sample a set of instances ( and compare the model performance under different sample sizes.

In terms of parameters, we test sample size (all the samples are chosen randomly), and (for EWC only).

Figs. 6 and 7 show the performance of models first trained on PolitiFact and then (incrementally) on GossipCop using GEM and EWC (), respectively (the remaining results are given in Appendices A.2, A.3). The results demonstrate that while all these models can achieve a relatively balanced performance over the two datasets, GEM trained models work better than EWC trained models in general. In addition, from the results in Appendix A.1 we can see that when the model is incrementally trained using GEM on the whole dataset, the performance can be further improved.

Another point worth mentioning is that it requires more fine-tuning during the EWC training process. For example, we need to apply early stopping to ensure balanced results on both datasets when the model is trained with EWC.

Efficiency. In terms of efficiency, the following observations can be made from our experiments on both datasets: (1) compared with the normal training process, training with GEM and EWC requires slightly more time; (2) there is no significant difference in training time between GEM and EWC; and (3) the impact of the parameters, i.e., sample size and , on the training time is also not significant.

5. Related Work

Detecting fake news on social media has been a popular research problem over recent years. In this section, we briefly review the prior work on this topic. Specifically, similar to (Shu et al., 2017; Pierri and Ceri, 2019), we classify existing work into three categories: content-based approaches, context-based approaches and mixed approaches, the first two of which, as suggested by their names, mainly rely on news content and social context to extract features for detection, respectively.

5.1. Content-based Approaches

Content-based approaches use news headlines and body content to verify the validity of the news. It can be further classified into two categories: knowledge-based and style-based (Shu et al., 2017; Zhou and Zafarani, 2018).

5.1.1. Knowledge-based Detection

In order for this type of method to work, a knowledge base or knowledge graph 

(Nickel et al., 2016) has to be built first. Here, knowledge can be represented in the format of a triple: (Subject, Predicate, Object), i.e., SPO triple (32). Then, to verify an item of news, knowledge extracted from its content is compared with the facts in the knowledge graph (Wu et al., 2014; Ciampaglia et al., 2015; Shi and Weninger, 2016). If a triple

is missing in the knowledge graph, different link prediction algorithms can be used to calculate the probability of an edge labelled

existing from node to node .

5.1.2. Style-based Detection

According to forensic psychological studies (Undeutsch, 1967), statements based on real-life experiences differ significantly in both content and quality from those derived from fabrication or fiction. Since the purpose of fake news is to mislead the public, they often exhibit unique writing styles that are rarely seen in real news. Therefore, style-based methods aim to identify these characteristics. For example, Perez-Rosas et al. (Pérez-Rosas et al., 2018) train linear SVMs on the following linguistic features to detect fake news: unigrams, bigrams, punctuation, psycholinguistic, readability and syntax features. Other methods that fall into this category include (Horne and Adali, 2017; Volkova et al., 2017; Wang, 2017; Potthast et al., 2018).

In addition to textual information, images posted in social media have also been investigated to facilitate the detection of fake news (Jin et al., 2017; Yang et al., 2018; Wang et al., 2018; Zhou et al., 2020).

5.2. Context-based Approaches

Social context here refers to the interactions between users, including tweet, retweet, reply, mention and follow. These engagements provide valuable information for identifying fake news spread on social media.

Jin et al. (Jin et al., 2016) build a stance network where the weight of an edge represents how much each pair of posts support or deny each other. Then fake news detection is based on estimating the credibility of all the posts related to the news item, which can be formalised as a graph optimisation problem.

Tacchini et al. (Tacchini et al., 2017) propose to detect fake news based on user interactions, i.e.,

users who liked them on Facebook. Their experiments show that both the logistic regression based and the harmonic Boolean label crowdsourcing based methods can achieve high accuracy.

Unlike the above supervised methods, an unsupervised approach is proposed in Yang et al. (Yang et al., 2019)

. It builds a Bayesian probability graphical model to capture the generative process among the validity of news, user opinions and user credibility.

Note that propagation-based approaches as mentioned in the introduction also belong to this category.

5.3. Mixed Approaches

Mixed approaches use both news content and associated user interactions over social media to differentiate between fake news and real news.

Ruchansky et al. (Ruchansky et al., 2017) design a three-module architecture that combines the text of a news article, the received user response and the source of the news: (1) the first module takes the user response, news content and user feature as the input, and trains a Recurrent Neural Network (RNN) to capture temporal representations of articles; (2) the second module is fed with user features to generate a score and a low-dimensional representation for each user; (3) the third module takes the output of the first two modules and trains a neural network to label the news item.

Zhang et al. (Zhang et al., 2018) propose to use a pre-extracted word set to construct explicit features from the news content, user profile and news subject description, and meanwhile use a RNN to learn latent features, such as news article content information inconsistency and profile latent patterns. Once the features are obtained, a deep diffusive network is built to learn the representations of news articles, creators and subjects.

Shu et al. (Shu et al., 2019c) use the tri-relationship among publishers, news articles and users to detect false news. Specifically, non-negative matrix factorization is used to learn the latent representations for news content and users, and the problem is formalised as an optimisation over the linear combination of each relation. Multiple machine learning algorithms are tested to solve the optimisation problem, and the results demonstrate its effectiveness.

In addition to the above work, two recent papers have started to work on explainability, i.e., why their model labels certain news items as fake (Popat et al., 2018; Shu et al., 2019a).

6. Conclusions and Future Work

The prevalence of fake news over social media has become a serious social problem. In this paper, we propose a context-based approach for fake news detection, more specifically a propagation-based method that uses GNNs to distinguish between the different propagation patterns of fake news and real news over social networks. Even though the method only requires a limited number of features obtained from the social context, and does not rely on any text information, it can achieve comparable or superior performance to state-of-the-art methods that require syntactic and semantic analyses.

In addition, we identify the problem that GNNs trained on a given dataset may not perform well on new data where the graph structure is vastly different, and direct incremental training cannot solve the issue. Since this is similar to the catastrophic forgetting problem in continual learning, we propose a technique that applies two popular approaches, GEM and EWC, during the incremental training, so that balanced performance can be achieved on both existing and new data. This avoids re-training on the entire data, as it becomes prohibitively expensive as data size grows.

For future work, we will investigate whether, to some extent, the catastrophic forgetting phenomenon in this case can be mitigated by the choices of features—either increase the number of features, or find “universal” features that can work well despite the different graph structures.

Figure 8. Models trained on the whole dataset of GossipCop perform poorly on the dataset of PolitiFact.
Figure 9. Models first trained on the whole dataset of GossipCop and then on PolitiFact only perform well on the latter dataset on which it is trained, i.e., PolitiFact.
(a) Sample size=100
(b) Sample size=200
(c) Sample size=300
Figure 10. Performance of models first trained on the whole dataset of PolitiFact and then on GossipCop using GEM.
(a) Sample size=100
(b) Sample size=200
(c) Sample size=300
Figure 11. Performance of models first trained on the whole dataset of GossipCop and then on PolitiFact using GEM.

References

  • L. E. Boehm (1994) The validity effect: a search for mediating variables. Personality and Social Psychology Bulletin 20 (3), pp. 285–293. Cited by: §3.2.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv e-prints, pp. arXiv:1312.6203. Cited by: §1, §2.
  • G. L. Ciampaglia, P. Shiralkar, L. M. Rocha, J. Bollen, F. Menczer, and A. Flammini (2015) Computational fact checking from knowledge networks. PLOS ONE 10 (6), pp. 1–13. Cited by: §5.1.1.
  • R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3 (4), pp. 128 – 135. External Links: ISSN 1364-6613 Cited by: §4.1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. eprint arXiv:1412.6572. Cited by: §1.
  • H. Guo, J. Cao, Y. Zhang, J. Guo, and J. Li (2018) Rumor detection with hierarchical social attention network. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM ’18, Torino, Italy, pp. 943–951. External Links: ISBN 978-1-4503-6014-2 Cited by: §3.2.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30, pp. 1024–1034. Cited by: §3.2.
  • B. D. Horne and S. Adali (2017) This just in: fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. arXiv e-prints, pp. arXiv:1703.09398. Cited by: §5.1.2.
  • Z. Jin, J. Cao, Y. Zhang, J. Zhou, and Q. Tian (2017) Novel visual and statistical image features for microblogs news verification. IEEE Transactions on Multimedia 19 (3), pp. 598–608. Cited by: §5.1.2.
  • Z. Jin, J. Cao, Y. Zhang, and J. Luo (2016) News verification by exploiting conflicting social viewpoints in microblogs. In Proceedings of the 30th AAAI, AAAI’16, Phoenix, Arizona, pp. 2972–2978. Cited by: §5.2.
  • J. D. keersmaecker and A. Roets (2017) ‘Fake news’: incorrect, but hard to correct. the role of cognitive ability on the impact of false information on social impressions. Intelligence 65, pp. 107 – 110. External Links: ISSN 0160-2896 Cited by: §3.2.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Cited by: §3.2.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, Palais des Congrès Neptune, Toulon, France. Cited by: §2.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521. Cited by: 2nd item.
  • D. Kumaran, D. Hassabis, and J. L. McClelland (2016) What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in Cognitive Sciences 20 (7), pp. 512–534. External Links: ISSN 1879-307X 1364-6613 Cited by: §4.2.
  • Y. Liu and Y. B. Wu (2018) Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks. In Proceedings of the 32nd AAAI, pp. 354–361. Cited by: §1.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems 30, pp. 6467–6476. Cited by: 1st item.
  • J. Ma, W. Gao, and K. Wong (2017) Detect rumors in microblog posts using propagation structure via kernel learning. In Proceedings of the 55th ACL, pp. 708–717. Cited by: §1.
  • J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly (1995) Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.. Psychological Review 102 (3), pp. 419–457. External Links: ISSN 0033-295X 0033-295X Cited by: §4.1, §4.2.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, Vol. 24, pp. 109 – 165. Cited by: §4.1.
  • F. Monti, F. Frasca, D. Eynard, D. Mannion, and M. M. Bronstein (2019) Fake news detection on social media using geometric deep learning. arXiv e-prints, pp. arXiv:1902.06673. Cited by: §1, §1, §1.
  • M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich (2016) A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1), pp. 11–33. Cited by: §5.1.1.
  • M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 2014–2023. Cited by: §1, §2.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2018) Continual lifelong learning with neural networks: a review. arXiv e-prints, pp. arXiv:1802.07569. Cited by: §4.2.
  • J. W. Pennebaker, R. L. Boyd, K. Jordan, and K. Blackburn (2015) The development and psychometric properties of LIWC2015. Technical report External Links: Link Cited by: §3.2.
  • V. Pérez-Rosas, B. Kleinberg, A. Lefevre, and R. Mihalcea (2018) Automatic detection of fake news. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 3391–3401. Cited by: §5.1.2.
  • F. Pierri and S. Ceri (2019) False news on social media: a data-driven survey. SIGMOD Record 48 (2), pp. 18–27. External Links: ISSN 0163-5808 Cited by: §5.
  • K. Popat, S. Mukherjee, A. Yates, and G. Weikum (2018) DeClarE: debunking fake news and false claims using evidence-aware deep learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 22–32. Cited by: §5.3.
  • M. Potthast, J. Kiesel, K. Reinartz, J. Bevendorff, and B. Stein (2018) A stylometric inquiry into hyperpartisan and fake news. In Proceedings of the 56th ACL, pp. 231–240. Cited by: §5.1.2.
  • F. Qian, C. Gong, K. Sharma, and Y. Liu (2018) Neural user response generator: fake news detection with collective user intelligence. In

    Proceedings of the 27th International Joint Conference on Artificial Intelligence

    ,
    IJCAI’18, Stockholm, Sweden, pp. 3834–3840. External Links: ISBN 978-0-9992411-2-7 Cited by: §3.2.
  • R. Ratcliff (1990) Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.. Psychological Review 97 (2), pp. 285–308. External Links: ISSN 0033-295X 0033-295X Cited by: §4.1.
  • [32] (1999)(Website) External Links: Link Cited by: §5.1.1.
  • V. Rubin, N. Conroy, and Y. Chen (2015) Towards news verification: deception detection methods for news discourse. In Proceedings of the 48th Hawaii International Conference on System Sciences (HICSS48), Cited by: §3.2.
  • N. Ruchansky, S. Seo, and Y. Liu (2017) CSI: a hybrid deep model for fake news detection. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, Singapore, Singapore, pp. 797–806. External Links: ISBN 978-1-4503-4918-5 Cited by: §1, §3.2, §5.3.
  • B. Shi and T. Weninger (2016) Fact checking in heterogeneous information networks. In Proceedings of the 25th International Conference Companion on World Wide Web, Montréal, Québec, Canada, pp. 101–102. External Links: ISBN 978-1-4503-4144-8 Cited by: §5.1.1.
  • K. Shu, L. Cui, S. Wang, D. Lee, and H. Liu (2019a) DEFEND: explainable fake news detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, Anchorage, AK, USA, pp. 395–405. External Links: ISBN 978-1-4503-6201-6 Cited by: §1, Figure 2, Figure 3, §3.2, §3.2, §5.3.
  • K. Shu, D. Mahudeswaran, S. Wang, D. Lee, and H. Liu (2018) FakeNewsNet: a data repository with news content, social context and spatialtemporal information for studying fake news on social media. arXiv e-prints, pp. arXiv:1809.01286. Cited by: §3.1.
  • K. Shu, D. Mahudeswaran, S. Wang, and H. Liu (2019b) Hierarchical propagation networks for fake news detection: investigation and exploitation. arXiv e-prints, pp. arXiv:1903.09196. Cited by: §1.
  • K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu (2017) Fake news detection on social media: a data mining perspective. SIGKDD Explorations Newsletter 19 (1), pp. 22–36. External Links: ISSN 1931-0145 Cited by: §1, §5.1, §5.
  • K. Shu, S. Wang, and H. Liu (2019c) Beyond news contents: the role of social context for fake news detection. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, Melbourne, VIC, Australia, pp. 312–320. External Links: ISBN 978-1-4503-5940-5 Cited by: §5.3.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. eprint arXiv:1312.6199. Cited by: §1.
  • E. Tacchini, G. Ballarin, M. L. Della Vedova, S. Moret, and L. de Alfaro (2017) Some like it hoax: automated fake news detection in social networks. arXiv e-prints, pp. arXiv:1704.07506. Cited by: §5.2.
  • U. Undeutsch (1967) Beurteilung der glaubhaftigkeit von aussagen. Handbuch der Psychologie, Band 11: Forensische Psychologie, pp. 26–181. Cited by: §5.1.2.
  • S. Volkova, K. Shaffer, J. Y. Jang, and N. Hodas (2017) Separating facts from fiction: linguistic models to classify suspicious and trusted news posts on twitter. In Proceedings of the 55th ACL, pp. 647–653. Cited by: §5.1.2.
  • S. Vosoughi, D. Roy, and S. Aral (2018) The spread of true and false news online. Science 359 (6380), pp. 1146–1151. External Links: ISSN 0036-8075 Cited by: §1, §3.
  • B. Wang, J. Gao, and Y. Qi (2016) A Theoretical Framework for Robustness of (Deep) Classifiers against Adversarial Examples. eprint arXiv:1612.00334. Note: arXiv: 1612.00334 External Links: Link Cited by: 1st item.
  • W. Y. Wang (2017) “Liar, liar pants on fire”: a new benchmark dataset for fake news detection. In Proceedings of the 55th ACL, pp. 422–426. Cited by: §5.1.2.
  • Y. Wang, F. Ma, Z. Jin, Y. Yuan, G. Xun, K. Jha, L. Su, and J. Gao (2018) EANN: event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, London, United Kingdom, pp. 849–857. External Links: ISBN 978-1-4503-5552-0 Cited by: §5.1.2.
  • K. Wu, S. Yang, and K. Q. Zhu (2015) False rumors detection on sina weibo by propagation structures. In 2015 IEEE 31st International Conference on Data Engineering, pp. 651–662. External Links: Document Cited by: §1.
  • L. Wu and H. Liu (2018) Tracing fake-news footprints: characterizing social media messages by how they propagate. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, Marina Del Rey, CA, USA, pp. 637–645. External Links: ISBN 978-1-4503-5581-0 Cited by: §1.
  • Y. Wu, P. K. Agarwal, C. Li, J. Yang, and C. Yu (2014) Toward computational fact-checking. Proc. VLDB Endow. 7 (7), pp. 589–600. External Links: ISSN 2150-8097 Cited by: §5.1.1.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv e-prints, pp. arXiv:1901.00596. Cited by: §1, §2, §2.
  • S. Yang, K. Shu, S. Wang, R. Gu, F. Wu, and H. Liu (2019) Unsupervised fake news detection on social media: a generative approach. Proceedings of the 33rd AAAI 33, pp. 5644–5651. Cited by: §5.2.
  • Y. Yang, L. Zheng, J. Zhang, Q. Cui, Z. Li, and P. S. Yu (2018) TI-CNN: convolutional neural networks for fake news detection. arXiv e-prints, pp. arXiv:1806.00749. Cited by: §5.1.2.
  • Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489. Cited by: §3.2.
  • R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4805–4815. Cited by: §1, §2, §2, §3.2.
  • J. Zhang, B. Dong, and P. S. Yu (2018) FAKEDETECTOR: effective fake news detection with deep diffusive neural network. arXiv e-prints, pp. arXiv:1805.08751. Cited by: §5.3.
  • X. Zhou, J. Wu, and R. Zafarani (2020) SAFE: similarity-aware multi-modal fake news detection. In Advances in Knowledge Discovery and Data Mining, pp. 354–367. External Links: ISBN 978-3-030-47436-2 Cited by: §5.1.2.
  • X. Zhou and R. Zafarani (2018) Fake news: a survey of research, detection methods, and opportunities. arXiv:1812.00315 [cs]. External Links: 1812.00315 Cited by: §1, §3, §5.1.
  • X. Zhou and R. Zafarani (2019) Network-based fake news detection: a pattern-driven approach. arXiv e-prints, pp. arXiv:1906.04210. Cited by: §1.

Appendix A Additional Experimental Results

Here we present the remaining experimental results.

a.1. Results of Models Trained on the Whole Dataset

We have run the same experiments on the whole datasets of PolitiFact and GossipCop, and our results also suggest that: (1) these models perform well only on the dataset on which they are trained (e.g., Fig. 9); (2) direct incremental training suffers from catastrophic forgetting (e.g., Fig. 9); (3) incremental training using GEM can mitigate the problem (Figs. 10 and 11). We did not run further experiments with EWC on the whole dataset, as the previous results in Section 4 have demonstrated that GEM works better than EWC in our case.

a.2. Results of Incrementally Trained Models using GEM on the Clipped Dataset

Fig. 12 demonstrates the performance of the models first trained on the clipped dataset of GossipCop and then incrementally on PolitiFact using GEM.

a.3. Results of Incrementally Trained Models using EWC on the Clipped Dataset

Table 2 (Table 3) demonstrates the performance of the models first trained on the clipped dataset of GossipCop (PolitiFact) and then incrementally on PolitiFact (GossipCop) using EWC.

(a) Sample size=100
(b) Sample size=200
(c) Sample size=300
Figure 12. Performance of models first trained on the clipped dataset of GossipCop and then on PolitiFact using GEM.
PolitiFact GossipCop
sample size Accuracy Precision Recall F1 Accuracy Precision Recall F1
100 0.787 0.735 0.683 0.693 0.801 0.801 0.784 0.788
0.780 0.728 0.674 0.688 0.787 0.782 0.775 0.778
0.796 0.747 0.706 0.721 0.755 0.757 0.739 0.741
0.815 0.791 0.713 0.736 0.764 0.758 0.741 0.746
0.823 0.797 0.732 0.753 0.751 0.750 0.734 0.737
200 0.782 0.746 0.649 0.667 0.800 0.790 0.792 0.791
0.812 0.806 0.689 0.714 0.754 0.746 0.737 0.739
0.798 0.768 0.685 0.705 0.764 0.760 0.743 0.747
0.803 0.769 0.708 0.725 0.752 0.740 0.725 0.730
0.819 0.786 0.736 0.753 0.762 0.754 0.744 0.746
300 0.783 0.749 0.672 0.690 0.779 0.777 0.762 0.766
0.797 0.755 0.700 0.717 0.768 0.763 0.745 0.748
0.804 0.765 0.706 0.725 0.753 0.743 0.730 0.734
0.809 0.774 0.720 0.737 0.753 0.748 0.729 0.732
0.801 0.757 0.711 0.727 0.755 0.750 0.724 0.729
Table 2. Performance of models first trained on PolitiFact and then on GossipCop using EWC.
PolitiFact GossipCop
sample size Accuracy Precision Recall F1 Accuracy Precision Recall F1
100 0.822 0.823 0.724 0.749 0.756 0.747 0.743 0.744
0.808 0.780 0.722 0.740 0.764 0.758 0.743 0.747
0.820 0.810 0.722 0.739 0.744 0.737 0.726 0.728
0.849 0.794 0.782 0.788 0.763 0.759 0.739 0.744
0.808 0.778 0.719 0.735 0.755 0.750 0.731 0.735
200 0.835 0.805 0.741 0.763 0.707 0.697 0.677 0.681
0.789 0.733 0.707 0.716 0.734 0.727 0.707 0.712
0.835 0.787 0.738 0.756 0.783 0.777 0.766 0.768
0.817 0.774 0.697 0.716 0.717 0.714 0.723 0.713
0.775 0.736 0.643 0.657 0.780 0.774 0.783 0.776
300 0.817 0.783 0.734 0.749 0.744 0.736 0.727 0.729
0.817 0.774 0.720 0.737 0.755 0.746 0.737 0.740
0.866 0.821 0.768 0.780 0.747 0.736 0.734 0.735
0.822 0.805 0.729 0.747 0.787 0.780 0.786 0.781
0.789 0.737 0.679 0.691 0.701 0.698 0.705 0.696
Table 3. Performance of models first trained on GossipCop and then on PolitiFact using EWC.