SGP: Spotting Groups Polluting the Online Political Discourse

10/16/2019 ∙ by Junhao Wang, et al. ∙ 0

Social media sites are becoming a key factor in politics. These platforms are easy to manipulate for the purpose of distorting information space to confuse and distract voters. It is of paramount importance for social media platforms, users engaged with online political discussions, as well as government agencies to understand the dynamics on social media, and identify malicious groups engaging in misinformation campaigns and thus polluting the general discourse around a topic of interest. Past works to identify such disruptive patterns are mostly focused on analyzing user-generated content such as tweets. In this study, we take a holistic approach and propose SGP to provide an informative birds eye view of all the activities in these social media sites around a broad topic and detect coordinated groups suspicious of engaging in misinformation campaigns.To show the effectiveness of SGP, we deploy it to provide a concise overview of polluting activity on Twitter around the upcoming 2019 Canadian Federal Elections, by analyzing over 60 thousand user accounts connected through 3.4 million connections and 1.3 million hashtags. Users in the polluting groups detected by SGP-flag are over 4x more likely to become suspended while majority of these highly suspicious users detected by SGP-flag escaped Twitter's suspending algorithm. Moreover, while few of the polluting hashtags detected are linked to misinformation campaigns, SGP-sig also flags others that have not been picked up on. More importantly, we also show that a large coordinated set of right-winged conservative groups based in the US are heavily engaged in Canadian politics.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. 13 polluting blocks detected by SGP in the Twitter space around the 2019 Canadian Federal Election. The popularity of the hashtags used within these groups reflects the common political creed. The background shows the interactions among these polluting groups through the followership network. We observe a big conservative block of multiple groups with different political creeds, which are heavily engaged with each other which encompasses the Trump MAGA US group and a domestic anti Trudeau group.

With most of the interaction in our societies moving online, social media websites now play a key role in democracy and politics. It is becoming clear from recent major political polls that these platforms can be easily manipulated to pollute the information space, from employing bots to generate automated content, to more sophisticated and complex information operation tactics. For example, the 2016 US Presidential Election is believed to have been swayed through interference from Russian trolls and bots (badawy2018analyzing). Such operations aim to distort information space to confuse and distract voters, disseminate propaganda and disinformation to foster divisions, and paralyze the decision making abilities of individuals (wilson2018assembling). The ultimate goals or motives of these operations might be hard to interpret, but their effect on public opinion, democracy and elections is clear (marwick2017media; bovet2019influence). Twitter, for example, reported possible engagement of 1.4 billion users with the suspected “trolls” from the Russian government funded Internet Research Agency (policy2018update). Therefore, it is crucial that we develop tools to identify polluting behaviour at an early stage, enabling us to proactively monitor the space and reach a healthy democratic society. Towards this goal, we present a novel approach to study the activity in complex social media space, and identify groups which are polluting the information space in an unexpectedly organized manner by amplifying each others’ voice and boosting each others’ influence. More specifically, SGP detects groups of users who are densely being followed by each other and also publish similar content through jointly embedding the connections between the users and the content they post. In extreme cases, bots generated from the same script behave in an almost identical or highly correlated manner (chavoshi2016identifying), and troll or sockpuppets who are being operated by the same person behind the scenes exhibit lock-step behaviour (kumar2017army).

Figure 2. Map of twitter around 2019 Canadian Election created by SGP, which provides a birds eye view of how the detected polluting groups of users engage in the discourse. Each colored node represents a group of users who are behaving in a strikingly similar way. Size of group nodes corresponds to the group engagement in Canadian politics. Size of the background nodes represents individual level engagement of users whereas the density of the background corresponds to how many users are engaging in that space. Polluting groups are marked by red. All groups that are engaging significantly with rest of the map are tagged with their SGP-signature hashtag, and their outgoing edges are plotted to show the spread of their influence. Spread of the polluting groups is marked by red.

The main contributions of SGP are four-fold, demonstrated by each of its four components:

  • SGP-map maps out large scale activity on social media by jointly embedding user connections and the content they post

  • SGP-flag detects groups of users which are posting similar content and are also densely connected to each other, a common indicator of misinformation campaigns, which we call polluting groups

  • SGP-signature characterizes the engagement of the polluting groups and finds their political creed

  • SGP-meso explains how different polluting groups engage with each other and the rest of the population

To demonstrate its effectiveness, we employ the proposed SGP method to spot polluting groups active around the upcoming 2019 federal elections in Canada. In particular, we monitor the activity within Twitter, the most commonly used platforms to mobilize public at the time of political unrest. fig. 1 shows thirteen polluting groups flagged by our proposed SGP in this domain, where fig. 2 shows how these groups engage in the overall discourse around this election.

SGP-map (fig. 1) detects polluting groups of users which are more than four times more likely to be suspended by Twitter (explained in detail further in the paper), however, most users skip the current filtering in place whereas evidently behaving strongly similar to the suspended users. On the other end, several of the hashtags these polluting groups use singled out by SGP-signature, polluting hashtags, are linked to the misinformation campaigns, in particular #notAbot and #Truedeaumustgo (emma19; Orr19). Another key observation made possible by SGP-meso is that there seems to be a large block of global right wing users which constitute multiple polluting groups that interact with each other and are highly active in Canadian politics. To the best of our knowledge we are the first group to report this observation, which is made possible through the bird’s eye view that SGP-map provides.

While these observations are interesting and help us better navigate the map of twitter activity around this important election, we want to emphasize that SGP provides a general and novel tool for computational social science and could be applied to study and investigate polluting groups engaged around any given topic. We verify this through extensive synthetic experiments, to show scalability and effectiveness of SGP in a controlled setting when the polluting groups are synthetically injected and hence known. Code and more information on data generation and collection are released at http:

2. Proposed Method

We model the activity within complex online societies as an attributed graph, in which edges represent connections between the users, and attributes associated to the nodes encode the content posted by their corresponding user. The main intuition behind this model is that looking at these as a whole, i.e. how users are connected along with the content they post, enables us to infer patterns which only emerge at this higher level. In particular, at this level we would be able to detect and understand the engagement of polluting groups.

More formally, we use a directed attributed graph , where . Here represents all the users, encodes the followership relationship between them, i.e. iff user is followed by user . The content posted by these users is modeled by , where shows how many times user has posted tweets using hashtag . We are using the followership and tweets explicitly to better explain this data representation within the context of our application, however, it is straightforward to see that this model is general and can encode any type of connection between users and content they post in other social media platforms. Table 1 summarized this and also presents the list of all symbols we use in this paper.

Figure 3. Resulting dense-block split given differently-sized dense-blocks on adjacency matrix and attribute matrix .

Given the adjacency matrix for graph , and the attribute matrix , we consider a polluting group as a group of user which (permuting nodes based on them) introduces block structure on both and . More specifically, we assume to be generated by a stochastic block model with disjoint partition that covers

, and probability matrix

, where is the probability of an edge between nodes in and nodes in . Let an element-wise function transform into (e.g. by converting the non-zero elements to which is hashtag has been used by user ). We also assume to be modeled by bipartite stochastic block model that has the same node partition as , as well as attribute partition . We don’t require to be disjoint or covering all attributes. Based on this construction, we define polluting blocks as subset of node partitions such that , where thresholds the edge probability.

Polluting groups do not cover all datapoints, but are in fact a few strikingly dense and small clusters. In particular, tiny clusters on the orders of with edge probability much higher than background which is very sparse. Finding these is a challenging task, since even in the decoupled format, the traditional algorithms that recover ground truth partition for bipartite stochastic block model could only discover clusters of size (mcsherry2001spectral). With certain assumptions on the dimension and construction of the bipartite graph, the current state of the art provably identifies cluster of size , which is still orders of magnitude greater than (NIPS2018_7643)

. We face a more complex problem since the partitions need to be jointly dense in both the stochastic block model for adjacency matrix, as well as bipartite stochastic model for binarized attribute matrix.

Symbols Definitions
a directed attributed graph
adjacency matrix
attribute matrix
embedding matrix
element-wise function that binarizes
binarized after applying
disjoint and covering partition of
partition of columns of or
edge probability matrix for partition
edge probability for a subset of nodes
edge probability threshold
one-hop neighborhood of node
projection of into its top

left singular value

subgraph induced by
Table 1. Symbols and Definitions

SGP is able to effectively recover these tiny clusters through an embedding-based dense block detection, which first jointly embeds the data onto Euclidean space, and then applies a centroid or density based clustering algorithm to find polluting blocks. We explain the four core component of SGP in the following:

SGP-map first jointly embeds and of into by preserving one-hop neighborhood structural similarity and attribute information. This is done through two steps: (1) project on its first

left singular vectors using truncated singular value decomposition; (2) message passing signals over

by summation aggregation. Note that we relax the procedure to binarize in step (1), because left singular values of could preserve more information than . As shown in experiments on synthetic data, this representation is very powerful for discovering tiny clusters.

More formally: given , SGP-map embeds as by summation aggregation of one-hop neighbor’s attributes projected onto top- left singular vectors:


In matrix form,


Given , SGP-map then applies a centroid or density based clustering algorithm to assign node embeddings to partitions , thus rendering a coarsened map of the data.

SGP-flag detects polluting blocks in by defining a metric over graph and subset of nodes , and then flag subsets whose metric is above a threshold as polluting block. More formally given , a disjoint covering partitions of , and a threshold , SGP-flag marks each as suspicious if


If using edge probability:


Using this metric, the partitions returned by SGP-flag contain nodes that are well connected above , and also share significantly more homogeneous attribute relative to a random sample. Besides edge probability, other metrics such as conductance can be used, but we found edge probability to work best in both synthetic and real experiments.

SGP-map and SGP-flag together create a coarsened bird’s eye view of the data with anomalies flagged.

SGP-signature creates a concise interpretable signature for each partition, both flagged and unflagged, by defining a metric over the binary attribute space and using it to rank the informativeness of each index of ordered attribute set in terms of potentially explaining its overly homogeneous connection and attribute usage, and then creates signature for each partition by finding the most informative attribute.

More formally, given and a disjoint covering partitions of , and ordered attribute set , SGP-signature defines a metric for each partition and index of attribute set as:


this compares the local and global relative usage frequency of . The group signature for is simply derived as:


Attribute is the most informative attribute that characterize a partition . The aforementioned three components of SGP are able to identify anomalous subgroups, as well as provide concise data summarization.


defines a higher-order metric to characterize the strength of partition (group) interactions. More formally given and a disjoint covering partitions of , SGP-meso defines a symmetrical pairwise metric over partitions:


characterizes the strength of connection between a pairs of partitions.

These four components of SGP make it a powerful tool for joint anomaly detection and data summarization on attributed graphs. Next we demonstrate its effectiveness, scalability and interpretability with synthetic experiments and real-world data application.

3. Experiments

In this section, we first verify the effectiveness of first two components of the proposed SGP method (SGP-map, SGP-flag) through a set of synthetic experiments with builtin ground-truth, which approximate the real-world problem and enable us to provide a quantitative evaluation. Next, we discuss the observations provided by applying all four components of SGP on real-world Twitter data and provide several evidences on the effectiveness of the SGP in unveiling the dynamics of polluting groups around the 2019 Canadian federal election.

3.1. Validation on Synthetic Data

Figure 4. SGP-map recovers jointly dense blocks. Top: the embedding that separates the dense block. Bottom: the block structure in the corresponding matrices. corresponds to the jointly dense blocks in and , plotted on (ground-truth), and shows the blocks recovered by SGP-map (results).

Synthetic data model:

We use stochastic block models (holland1983stochastic) to synthesize data since they provide explicit control over dense block numbers, sizes and edge probabilities, as well as the background graph topology. To generate joint matrices, and without the loss of generality, we assume the dimension of the binary attribute matrix to be the same size as the total number of nodes in the graph. Then we create both the adjacency and binary attribute matrices using stochastic block-model. We consider two setting where the dense blocks agree (simple) or disagree (hard, when blocks splits into more). Figure 3 gives an example of the generated matrices.

Parameter settings:

To evaluate SGP-map and SGP-flag on synthetic graphs which correspond to the real world Twitter graph, we consider 6 settings in which a small portion of the nodes are in dense blocks and the rest are not partitioned. The exact parameters are summarized in Table 2. For example in the first case, we create a graph with 1000 nodes, in which 150 nodes are assigned to dense clusters of size 30, each with edge probability of 0.4, while the rest of the graph is sparse with edge probability of 0.005. This matches what we observe in our Twitter data, where background edge probability is low and we are interested in finding small dense clusters of size . In the blocks-disagree case, for each block, we randomly sample from dense block size and edge probability options.

For all experiments, we set number of top left singular vectors for SGP-map’s projection , and set the edge probability threshold of SGP-flag as .

1,000 50,500 100,000
0.005 0.005 0.005
150 235 300
30 30 30
0.4 0.4 0.4
1,000 50,500 100,000
0.005 0.005 0.005
150 235 300
20, 40 20, 40 20, 40
0.1, 0.4, 0.7 0.1, 0.4, 0.7 0.1, 0.4, 0.7
Table 2. Parameters used for synthetic data generation

Illustration of results:

As an example, Figure 4 shows SGP-map results on a synthetic graph of 1,250 nodes with 350 anomalies in dense blocks of sizes 50 and 100, and these blocks disagree on adjacency and attribute matrix, thus creating splits. The top row shows UMAP(mcinnes2018umap) 2-D visualization of SGP-map embeddings, colored from left to right by block partitions on adjacency, attribute, joint and inferred joint space. The bottom row shows the corresponding partitions on the attribute matrix on second column and adjacency matrix on other columns. As shown on the bottom right corner of Figure 4, SGP-map is able to capture the injected dense blocks on the split level and thus forms block-diagonal structure on adjacency matrix ordered by the inferred joint partition.

Blocks-agree Blocks-disagree
Metric ARI NMI F1 Time (s) ARI NMI F1 Time (s)
1,000 nodes, 2,288 edges (15 % nodes in dense block)
SGP 0.99 0.00 0.98 0.01 0.99 0.00 0.20 0.03 0.80 0.15 0.75 0.11 0.83 0.13 0.21 0.05
SVD 0.99 0.00 0.98 0.01 0.99 0.00 0.21 0.04 0.80 0.14 0.75 0.10 0.83 0.13 0.22 0.06
SVD 0.96 0.05 0.95 0.05 0.97 0.04 0.17 0.05 0.82 0.12 0.77 0.08 0.86 0.11 0.23 0.09
Fraudar - - 0.85 0.27 0.06 0.00 - - 0.44 0.01 0.06 0.00
50,500 nodes, 1,277,467 edges, (0.47% dense block)
SGP 0.93 0.08 0.93 0.08 0.93 0.08 15.62 1.44 0.52 0.29 0.53 0.28 0.52 0.29 28.57 9.66
SVD 0.94 0.07 0.92 0.05 0.94 0.07 95.89 3.81 0.60 0.16 0.63 0.11 0.60 0.16 121.95 7.09
SVD 0.27 0.17 0.00 0.00 0.27 0.17 23.23 3.98 0.29 0.19 0.29 0.17 0.29 0.19 27.32 3.10
Fraudar - - 0.01 0.00 15.24 0.20 - - 0.01 0.00 15.24 0.20
100,000 nodes, 5,002,879 edges (0.25% nodes in dense block)
SGP 0.45 0.18 0.53 0.12 0.46 0.18 67.49 4.32 0.61 0.21 0.63 0.17 0.61 0.21 92.62 23.95
SVD 0.58 0.13 0.62 0.09 0.58 0.13 1081.68 160.23 0.64 0.19 0.67 0.14 0.65 0.19 1143.04 74.33
SVD 0.03 0.07 0.16 0.00 0.03 0.07 67.33 5.16 0.14 0.18 0.31 0.02 0.14 0.18 82.62 9.31
Fraudar - - 0.00 0.00 55.32 0.50 - - 0.00 0.00 55.34 0.30
Table 3. Performance on synthetic experiments. SGP-map: we evaluate our results using normalized mutual information (NMI) and adjusted random index (ARI) score of the inferred blocks versus the ground truth. SGP-flag: we also report the F1 score on whether the inferred partitions that are above our edge probability threshold correspond to the jointly dense blocks.

Comparison with baselines:

We compare against two contender methods from the classic and deep-learning based anomaly detection methods: Fraudar


, a dense block detection heuristics-based algorithm that only uses adjacency matrix; and DOMINANT


, which uses node reconstruction loss from graph convolution autoencoder trained on both adjacency and attribute matrix. We also consider two variants of

-based embedding: (1) applies truncated SVD on directly without message passing step. SGP’s performance is comparable with but with much better scalability; (2) only uses sub-sampled nodes for truncated SVD. This provides a negligible speedup, however its performance drops significantly compared to SGP. Results are reported in Table 3.

We can see that SGP outperforms Fraudar even when not utilizing the attribute information (blocks-agree), in which case the attribute and adjacency matrices contain the same information. Results from DOMINANT are not reported in the table as it fails to converge on all the synthetic datasets. We observe that the convolutional graph autoencoder training has high likelihood of divergence on sparse graphs with tiny dense blocks. We have also attempted using graph autoencoder and variational graph autoencoder (kipf2016variational) with different priors such as unit Gaussian and mixture of Gaussian on different graph neural net architectures including GraphSAGE (hamilton2017inductive) and graph attention network (velivckovic2017graph) for our embedding-based approach, but SGP beats their performance by a large margin, and hence the performance of these efforts are not reported. We conclude that for this specific task, SGP-map is superior to state-of-the-art deep learning based methods in terms of simplicity, scalability and performance. We find the embeddings created by SGP-map is the most powerful representation for discovering small blocks on sparse attributed graphs.

3.2. Performance on Real-World Data

Data collection:

Since April 2019, we started collecting tweets related to 2019 Canadian federal election through the Twitter streaming API filtered by an evolving set of relevant hashtags based on recent political events, the final set of hashtags is shown in Table 4. We refer to this set of hashtags as Canadian hashtags: .

For each Twitter user that used any of the hashtags in table 4, we collected all usernames of his or her followers, as well as a sample of historical tweets between April and October 2019. This can be of different size for different users due to our data collection pipeline. For each user, we also tracked all hashtag usages in his or her sampled tweets, and created a feature vector where each entry is the frequency of using the corresponding hashtag.

#cdnpoli #canpoli
#cpc #SenCA
#cdnleft #pttory
#ptbloc #gpc
#crtc #goc
#BlackFaceTrudeau #TrudeauMustResign
#BlackFace #BrownFace
#ScheerLies #elexn43
#NotasAdvertised #TrudeauTheHyprocrite
#ptlib #lpc
#ndp #lavscam
#ptndp #ptgreen
#cdnsen #cpac
#CdesCom #TrudeauBlackFace
#BrownFaceTrudeau #TrudeauWorstPM
#Scheer #Andysresume
#elxn43 #elxn19
Table 4. Hashtags used for crawling the data which are related to Canadian politics and the 2019 federal election.

We only keep users that tweeted at least once using a hashtag in . Then we further filter this set to contain users who have at least 1 follower or follow at least 1 other user. The resulting directed attributed graph has nodes, edges and unique hashtags.

Data preprocessing:

Because entries of

are highly skewed with some users using significantly more hashtags than others, and some hashtags significantly more popular than others, we apply doubly-normalized TF-IDF to give more importance to non-common hashtags, more specifically:


where is the total number of users, , and shows in how many users used the hashtag .

We consider the top 1,000 hashtags with highest mean TFIDF value to be important hashtags: , which is used later for SGP-signature. We measure each user’s engagement with Canadian politics by the ratio of at-least-once usage of Canadian hashtags:


Given the resulted matrices of user followerships and hashtags, we performed preliminary exploratory analysis to discover anomalous patterns, including investigating how degree of nodes corresponds to the number of hashtags they use or their hashtag usage entropy. Given a considerable effort, this type of traditional feature-based anomaly detection did not yield any significant insight, and hence is omitted from the results. This however signifies the importance of applying the proposed SGP method, which we discuss next.

Figure 5. SGP-map finds polluting groups of users exhibiting block-diagonal structure in both the adjacency (left) and attribute matrix (right) on the twitter data.
Figure 6. SGP-map puts users of the same political creed close together. Here nodes are the individual users, size of the nodes corresponds to the level of engagements of that particular node in the Canadian politics. Nodes are colored the same if they belong to the same micro-cluster.

Applying Sgp-map

Compared with synthetic data, some of the users in the twitter data don’t have any followers and naive aggregation of signals would set them to 0, thus we relax SGP-map by message-passing on both and and then concatenate to form final node embedding:


Quality of the Sgp-map

We first visualize subset of blocks identified by SGP-map in Figure 5, which shows a clear block structure for both (-diagonal) and on indices induced by these blocks, which indicates SGP-map’s ability to discover tightly connected user groups engaging with different sets of hashtags.

Next, we visualize the node embeddings resulted by SGP-map using UMAP projection to 2D in Figure 6. Size of the point for corresponds to , level of engagement in Canadian politics. Background nodes, those that do not get clustered into small clusters are plotted as grey with a lighter shade. Overlayed on each colored cluster of users , is the most descriptive hashtag for that cluster, created by SGP-signature. We observe that SGP-map embeds groups with similar political creeds close to each other, thus forming an informative map of Twitter: top middle occupied by American conservative groups indicated by #KAG; the center by international groups signified by #Chinese, #Iranian, #Paris; top right by pro-Scheer and anti-Trudeau groups; the middle right by anti-Scheer groups; the middle left by climate activist groups, evidenced by #climate and #AmazonRainforest, etc.

Figure 7. SGP-flag results corroborates with the users suspended by Twitter. Text size corresponds to the ratio of suspended users within all our detected group, whereas the size of the node corresponds to their level of engagement in the Canadian politics similar to fig. 2. Groups colored as red are the top 13 groups with the highest ratio of suspended users.

Quality of the Sgp-flag

Given the partitions created by SGP-map, SGP-flag sets edge probability threshold to flag subset of them as polluting. The block-diagonal adjacency matrix induced by the resulting 13 polluting blocks is visualized in Figure 1, where we observe siloed blocks such as #7, as well as interacting ones, #4 and #3, which are likely American conservative groups. Another observation is that the American conservative block #3 with four suspended users interacts with two smaller blocks with the hashtag signatures of #LavScam and #Scheer4PM, which are likely Canadian pro-Scheer and anti-Trudeau groups. This could be considered a potential foreign influence on the Canadian 2019 Election, which is concerning.

These 13 polluting blocks are marked as red in Figure 2, where we plot background nodes in the largest cluster similarly as in Figure 6; we also plot each other cluster as a colored point, with size proportional to


which is an indicator of its engagement with Canadian politics on a group level. We can see that the 13 polluting blocks are highly engaged with Canadian politics, evidenced by their node size, and are close to each other in the embedding space.

We have observed that within these 13 polluting blocks, 9 out of 327 users are suspended in the past 6 months, which is over 4 times more compared to the other users. In total 429 out of 69,709 users in our data got suspended. Many users in these 13 polluting blocks, that are highly similar to those suspended accounts have not yet been identified and suspended by twitter. However, SGP-flag is able to spot these users who are behaving in the same manner as the suspended ones.

Quality of the Sgp-signature

We apply SGP-signature to the set of important hashtags and return defined in Equation 6, the signature that characterize the political creed of users in the group . SGP-signature highlights the signature hashtag for each group in the the twitter maps of Figure 6 and 2. This characterizes the results in a concise manner, and explains the complex structure through which these groups are engaged in Canadian politics.

Contrasting fig. 7 and fig. 2, we observe that the signatures captured by SGP-flag on detected polluting blocks overlap with 8 out of 13 top suspended blocks’ signatures, including #Iranian, #KAG2020, #notAbot, #TrudeauMustGo and #Scheer4PM. This makes it a useful tool for spotting suspicious groups on social platforms. We also verify that two of these hashtags discovered by SGP-signature are known to be linked to misinformation campaigns (Orr19; emma19).

In particular, Figure 8 shows the relative group usage frequency for both Canadian hashtags and important hashtags for the SGP-flag-detected polluting block , which has the most suspended users. While block #3 is primarily an American conservative group by looking at the important hashtags, when zooming in on Canadian hashtags, a concerning observation emerges: this block of users has significantly more engagement with a specific Canadian hashtag: #TrudeauMustGo, which has been very recently found to be related to trending misinformation campaign against Candian 2019 Election (Orr19). Furthermore, another signature picked up by SGP-signature (#NotAbot) in the polluting blocks identified by SGP-map and SGP-flag has also been reported to engage in information campaign (Orr19).

These two hashtags have so far been the primarily used hashtag against 2019 Canadian election, and both have been detected before mainstream media coverage. This makes SGP-signature a powerful tool to assist in detecting trending misinformation campaigns before they make significant mark.

Figure 8. SGP-signature finds informative hashtag in both important hashtag set (left) and Canadian hashtag set (right) for polluting block #3. Plotted for both sides are top 10 hashtags that are used by block #3 more often on a group basis than the background.

Quality of the Sgp-meso

SGP-meso quantifies the strength of connection among all clusters, and thus enables the study of their spread and potential success. In figure 2, link between two clusters is plotted with line width proportional to their interaction defined in Equation 7; those that are connected to detected 13 polluting blocks are colored red, and other links are plotted as green. For any cluster with greater than a threshold, we overlay its SGP-signature signature on top.

One concerning observation from SGP-meso result is the obvious existence of a large international right-wing group spanning across America and Canada that actively engages with and potentially influences the Canadian 2019 Election. Their signature hashtags include #Scheer4PM, #KAG2020, #TrudeauMusGo, #LiberalsMustGo, etc. Another such concerning observation is the existence of aforementioned signature #notAbot, where almost all groups with this signature are linked with the detected 13 polluting blocks. Such homogeneity could be a sign of a misinformation campaign.

Less concerning but still interesting observation emerge when inspecting Figure 2 and 7 side by side, where SGP-flag identifies one out of three blocks signatured by #Iranian, and top suspension ratio reveals the other two blocks signatured by #Iranian. Yet there is no significant connections going outside of these three groups to other parts of the graph. Inspection of the users’ tweets in these clusters reveals that the accounts in these blocks are primarily concerned with immigration issues and are mostly created in February 2019, right before the passing of Bill 21, a Bill in Quebec that sets out framework for values test for skilled workers, which impacts immigration. The observed strong connection within a set of groups but weak or no connection to other parts of a graph, could be a sign of a failed amplifying strategy.

Scalability of Sgp:

Here we analyze time complexity of SGP.

The bulk of computation done by SGP are:

  • Computing truncated singular value decomposition of a large sparse attribute matrix . denotes number of nonzero entries in a sparse matrix. For large sparse matrices, augmented Lanczos bidiagonalization algorithm (butler2018integrating) is the default option for computing truncated singular value decomposition, whose time complexity is , where is nonzero entries in , is number of iterations, is number of top left singular vectors to project and is constant. When , the time complexity is approximately .

  • Computing sparse matrix product between and , whose time complexity is for standard implementation, as mentioned above.

Thus the resulting complexity of SGP is . Assuming and are in the same range, SGP scales linearly with the number of edges in . It is able to analyze realistic graph of size 100,000 in less than 2 minutes on a normal laptop.

4. Conclusions

In this paper we presented SGP, which discovers, characterizes and explains polluting groups in social media platforms. SGP is

Holistic: SGP jointly models user’s connection and content, i.e. it exploits both attribute and structural information, hence is able to uncover patterns at the high level and provide bird’s eye view of the activities in a social platform.

Effective: SGP performs better than both classic and deep-learning based baselines by a large margin. It is also able to identify polluting groups that are over 4 times more likely to get suspended.

Interpretable: As shown in Figure 2, the four components of SGP gives a concise summarization of complex interaction dynamics of normal and anomalous groups on Twitter, and is able to capture groups engaging in misinformation campaign, as well as how they interact with each other before mainstream media.

Scalable: SGP scales linearly with number of edges with reasonable assumptions.


We will open-source our code after acceptance.

5. Background and Related Work

There are many works that this paper builds upon. Here we briefly discuss the most relevant papers to understanding social media ecosystems, detecting anomalous users (e.g. bots, trolls) and how they interact with democracy and politics. We also highlight the most relevant papers to the methodology used to embed graph information (e.g. graph representation learning), and detect coordinated activity in these platforms (e.g. community detection, anomaly detection).

5.1. Online Misinformation

According to recent work on network theory (badawy2018analyzing; mitchell_gottfried_kiley_matsa_2014; wilson2018assembling), the vulnerability of social networks to information operations relates to the clustering of politicised online information spaces. This phenomenon, defined as ”echo-chambers”, describes the gathering of like-minded individuals into online communities. As illustrated by Marwick et al (marwick2017media), the defiance toward traditional media from part of the population leads to the emergence of alternative (possibly biased or fake) news sources. Analysing the fake news spreading processes on Twitter, Bovet et al (bovet2019influence) showed that trolls tend to map into small, politically biased networks of users, fed with multiple information sources, thus making normal users sensitive to hostile influence. Indeed, Stewart et al (stewart2018examining) identified trolls as polarising elements of echo-chambers, distorting the information space. Therefore, we hypothesise that trolls ground their activity mainly within small networks of densely connected users, while being active on politicised discussions to further polarise edge individuals.

5.2. Community Detection

Community detection in networks is a task that involves finding community structures within graphs. In particular, the problem involves a set of node classification tasks that attempt to ascertain an underlying clustered, segmented and relatively dense structure within a graph. Traditional approaches like modularity maximization, which measures the number of edges in identified communities in relation to the expected number of edges in an unorganized graph, suffer from small-resolution communities and do not scale well to contemporary social networks (lancichinetti2011limits). More recent research has identified the need for combining both the underlying structure of the nodes within the network as well as their inherent attributes (liu2015community). Liu et al (liu2015community) adopt the paradigm that a network graph results from interactions among nodes, and introduce the idea of content and influence propagation via random walks, analyzing the stable structure of this dynamical system to identify communities. Jia et al (jia2017node)

enhance node attributes by running k-nearest neighbors on the graph a priori and append this information to node representations, demonstrating that this alleviates graph sparsity issues and improves performance of community identification algorithms. Most recently, graph neural networks (GNNs) extends the convolutional neural network framework to graph structures by leveraging affine transformations of graph operators and node-wise or edge-wise activation functions. Chen

et al (chen2017supervised) introduce a new family of GNNs which rely on a non-backtracking graph operator defined on the line graph of edge adjacencies, facilitating scalable inference of communities on large, sparse graphs. A separate line of work aims to detect tiny communities. Neumann proposed an elegant and simple algorithm for provably finding community of size in a bipartite graph generated by bipartite stochastic blockmodel, by first clustering left-side nodes based on similarity of their neighbors, and then recover right-side partition based on degree thresholding (NIPS2018_7643).


Fraudar (hooi2016fraudar)

Dominant (ding2019deep)

DeepFD (8594881)

EigenSpoke (DBLP:conf/pakdd/PrakashSSMF10)

GNN-based methods

pcv (NIPS2018_7643)


Table 5. SGP matches all specs, while competitors miss one or more of the features.

5.3. Anomaly Detection

Anomaly detection is a well researched problem, but many techniques fail to be applicable on the extremely sparse graphs with a large set of nodes which characterize most modern social networks. ODDBALL is a classical approach that defines several metrics surrounding the density, weight, rank and eigenvalues associated with anomalous subgraphs, and computes these measures to identify anomalous blocks

(akoglu2010oddball). Another approach is presented in (DBLP:conf/pakdd/PrakashSSMF10) which detects persistent patterns, called EigenSpokes, which are found in large sparse social graphs. By plotting the singular vectors of these graph against each other (called EE-plots), clear, separate lines or spokes that often align with axes (EigenSpokes) are detected. EE-plots are indicative of fundamental clustering structures within these graphs. Alternatively, matrix factorization approaches have been extremely prolific in anomaly detection literature. For example, Tong and Lin adapt non-negative matrix factorization (NMF) by enforcing constraints to identify anomalies in the residual graph after typical factorization, thereby capturing anomalies in the original whole graph (tong2011non). In line with recent advances involving deep learning, a major contribution in anomaly detection follows from DeepFD, an architecture developed by Wang et al based on graph embeddings of both attributed and topological graphs (wang2018deep). Their work preserves graph structure and user behavior in order to improve adversarial robustness to fraudsters within networks of interest. With the popularization of graph neural networks, graph convolutional autoencoder that encodes and decodes both adjacency and attribute matrix is used to rank nodes in terms of anomaly, indicated by node reconstruction error (ding2019deep).

5.4. Dense Block Detection

The kind of lockstep behavior exhibited by agents who engage in information operations induces dense subgraphs within the larger graph (shin2016m), (shin2016corescope), (shin2017d). While such blocks are locally dense, they are often extremely sparse the relative to the entire graph, and thus detection can be a difficult task. M-Zoom is a classical approach to this problem, which iteratively finds and removes dense blocks to prevent duplicate block querying (shin2016m). Shin et al take an offline approach to the task in D-Cube, facilitating distributed, fast detection of dense blocks with provable guarantees on the accuracy of identifying blocks (shin2017d).

5.5. Graph Representation Learning

Graph representation learning aims to embed node, subgraph or the whole graph onto Euclidean space. This type of learning is inherently difficult as graphs are combinatorial structures with discretized nodes and edges. Thus, conventional learning modalities like neural networks often fail for these learning tasks as they rely on continuous representations of data. In particular, unsupervised graph representation learning is interesting as most graphs are not fully specified; connections between nodes within our data are often hidden or unknown, particularly in large scale graph structures such as social media networks. To this end, Kipf and Welling (kipf2016variational) develop the variational graph autoencoder that uses a graph convolutional network as an encoder which parameterizes a Gaussian latent distribution. A decoder network then reconstructs the full graph, and the authors demonstrate that such reconstruction from the latent embeddings predicts unseen or masked links in the original network with good accuracy. Moreover, their framework can be trained end to end through classical variational inference. More recently, Hamilton et al. demonstrate greatly improved performance through their more general GraphSAGE approach (hamilton2017inductive). GraphSAGE is able to perform inductive learning and generate node embeddings for previously unseen graphs. Critically, even in settings where node attributes are not made explicitly available, the GraphSAGE is extendable by computing additional node features such as degree from the network topology and substituting these as the node attributes.