Adversarial Socialbot Learning via Multi-Agent Deep Hierarchical Reinforcement Learning

by   Thai Le, et al.
Penn State University
University of Warwick

Socialbots are software-driven user accounts on social platforms, acting autonomously (mimicking human behavior), with the aims to influence the opinions of other users or spread targeted misinformation for particular goals. As socialbots undermine the ecosystem of social platforms, they are often considered harmful. As such, there have been several computational efforts to auto-detect the socialbots. However, to our best knowledge, the adversarial nature of these socialbots has not yet been studied. This begs a question "can adversaries, controlling socialbots, exploit AI techniques to their advantage?" To this question, we successfully demonstrate that indeed it is possible for adversaries to exploit computational learning mechanism such as reinforcement learning (RL) to maximize the influence of socialbots while avoiding being detected. We first formulate the adversarial socialbot learning as a cooperative game between two functional hierarchical RL agents. While one agent curates a sequence of activities that can avoid the detection, the other agent aims to maximize network influence by selectively connecting with right users. Our proposed policy networks train with a vast amount of synthetic graphs and generalize better than baselines on unseen real-life graphs both in terms of maximizing network influence (up to +18 +40 accuracy). During inference, the complexity of our approach scales linearly, independent of a network's structure and the virality of news. This makes our approach a practical adversarial attack when deployed in a real-life setting.



There are no comments yet.


page 1

page 2

page 3

page 4


Sparse Adversarial Attack in Multi-agent Reinforcement Learning

Cooperative multi-agent reinforcement learning (cMARL) has many real app...

Robust Reinforcement Learning as a Stackelberg Game via Adaptively-Regularized Adversarial Training

Robust Reinforcement Learning (RL) focuses on improving performances und...

Multi-Agent Reinforcement Learning: A Report on Challenges and Approaches

Reinforcement Learning (RL) is a learning paradigm concerned with learni...

Learning to Cope with Adversarial Attacks

The security of Deep Reinforcement Learning (Deep RL) algorithms deploye...

Causal Influence Detection for Improving Efficiency in Reinforcement Learning

Many reinforcement learning (RL) environments consist of independent ent...

The reinforcement learning-based multi-agent cooperative approach for the adaptive speed regulation on a metallurgical pickling line

We present a holistic data-driven approach to the problem of productivit...

Unsupervised robust nonparametric learning of hidden community properties

We consider learning of fundamental properties of communities in large n...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Socialbots refer to automated user accounts on social platforms that attempt to behave like real human accounts, often controlled by either automatic software, human, or a combination of both–i.e., cyborgs (Cresci, 2020). Different from traditional spambots, which may not have proper profiles or can be easily differentiated from regular accounts, socialbots often mimic the profiles and behaviors of real-life users by using a stolen profile picture or biography, building legitimate followships, replying to others, etc. (Cresci, 2020). Socialbots are often blamed for spreading divisive messages–e.g., hate speech, disinformation, and other low-credibility contents that have been shown to widen political divides and distrust among both online and offline communities (Cresci, 2020; Hindman and Barash, 2018; Le et al., 2020). To mitigate such harmful proliferation of socialbots, therefore, there has been extensive research, most of which focus on how to effectively detect them (Mou and Lee, 2020; Dong and Liu, 2018; Yang et al., 2020). However, these works usually follow the cat-and-mouse game where they passively wait for socialbot evasion to happen before they can react and develop a suitable detector (Cresci et al., 2021). Instead of following such a reactive scheme, however, proactively modeling socialbots and their adversarial behaviors on social platforms can better advance the next bot detection research.

In particular, we pose a question “Can socialbots exploit computational learning mechanism such as reinforcement learning to their advantage?” To our best knowledge, adversarial nature of socialbots has not yet been fully explored and studied. However, it is plausible that adversaries who own a farm of socialbots operate their socialbots according to certain strategies (or algorithms). Therefore, proactively simulating such a computational learning mechanism and understanding adversarial aspect of socialbots better would greatly benefit future research on socialbot detection.

In general, a socialbot has two main objectives that are adversarial in nature: (i) to facilitate mass propaganda propagation through social networks and (ii) to evade and survive under socialbot detectors. The first goal can be modeled as an NP-Hard influence maximization (IM) problem (Kempe et al., 2003) where the bot needs to build up its network of followers–i.e., seed users, overtime such that any new messages propagated from the bot through these users can effectively spread out and influence many other people. Simultaneously, it also needs to systematically constrain its online behaviors such that it will not easily expose itself to socialbot detectors. Although the IM problem has been widely studied by several works (Kempe et al., 2003; Chen et al., 2009; Kingi et al., 2020; Kamarthi et al., 2020), they only focus on maximizing the network influence given a fixed and static budget # of seed nodes (that is relatively small) and they assume that every node is equally acquirable. However, these assumptions are not practical in our context. Not only a socialbot needs to continuously select the next best seed node or follower over a long temporal horizon–i.e., potentially large budget of seed nodes, it also needs to consider that gaining the followship from a very influential actor–e.g., Elon Musk, is practically much more challenging than from a normal user. At the same time, a socialbot that optimizes its network of followers must also refrain from making suspicious behaviors–e.g., constantly following others, that can trigger the attention of bot detectors. Thus, learning how to navigate a socialbot is a very practical yet challenging task with two intertwined goals that cannot be separately optimized. Toward this challenge, in this paper, we formulate the Adversarial Socialbot Learning (ASL) problem and design a multi-agent hierarchical reinforcement learning (HRL) framework to tackle it. Our main contributions are as follows.

  • [leftmargin=]

  • First, we formulate a novel ASL problem as an optimization problem with constraints.

  • Second, we propose a solution to the ASL problem by framing it as a cooperative game of two HRL agents that represent two distinctive functions of a socialbot, namely (i) selecting the next best activity–e.g., tweet, retweet, reply, mention, and (ii) selecting the next best follower. We carefully design the RL agents and exploit unsupervised graph representation learning to minimize the potential computational cost resulted from a long time horizon and a large graph structure.

  • Third, we demonstrate that such RL agents can learn from synthetic graphs yet generalize well on real unseen graphs. Specifically, our experiments on a real-life dataset show that the learned socialbot outperforms baselines in terms of influence maximization while sustaining its longevity by continuously evading a strong black-box socialbot detector of 90% detection accuracy. During inference, in addition, the complexity of our approach scales linearly and is independent of a network’s structure and the virality of news.

  • Four, we release an environment under the Open AI’s gym (Brockman et al., 2016) library. This enables researchers to simulate various adversarial behaviors of socialbots and develop novel bot detectors in a proactive manner.

2. Related Work

2.1. Socialbots Detection

The majority of previous computational works on socialbots within the last decade (Mou and Lee, 2020; Yang et al., 2020; Dong and Liu, 2018; Rodríguez-Ruiz et al., 2020; Sayyadiharikandeh et al., 2020; Cai et al., 2017; Wu et al., 2021) primarily focus on developing computer models to effectively detect bots on social networks (Cresci, 2020; Cresci et al., 2021)

. These models are usually trained on a ground truth dataset using supervised learning algorithms–e.g., Random Forest, Decision Tree, SVM, to classify an individual social media account into a binary label–i.e., bot or legitimate 

(Cresci, 2020). Moreover, these learning algorithms usually depend on either a set of statistical engineered predictive features such as the number of followers, tweeting frequency, etc. (Yang et al., 2020; Sayyadiharikandeh et al., 2020; Cresci et al., 2017b)

, or a deep learning network where the features are automatically learned from unstructured data such as an account’s description text. Even though there are many possible features that can be used to detect socialbots, statistical features that can be directly extracted from user metadata provided by official APIs–e.g., Twitter API, are more practical due to their favorable computational speed in practice 

(Yang et al., 2020). In fact, many of the features that are utilized by the popular socialbot detection API botometer fall into this category. Moreover, we later also show that using simple statistical features derived from user metadata can help train a socialbot detector with around 90% prediction accuracy on a hold-out test set (Sec. 3.1). Regardless of how a socialbot detector extracts its predictive features, they are mainly designed following a reactive schema where they learn how to detect socialbots after they appear (thus a training dataset can be collected).

2.2. Adversarial Socialbot Learning

While previous works help us to understand better the detection aspect of socialbots, the learning aspect of them has not been widely studied (Cresci et al., 2021). Distinguished from learning how to detect socialbots using a stationary snapshot of their features, ASL computationally models the adversarial learning behaviors of socialbots over time. To the best of our knowledge, relevant works on this task are limited to (Cresci et al., 2019). This work adopts an evolution optimization algorithm to find different adversarial permutations from a fixed socialbot’ encoded activity sequence–e.g., “tweettweetretweet

reply,…”, and examine if such permutations can help improve the detection accuracy of a bot detector. However, such permutations, even though adversarial in nature, are just static snapshots of a socialbot and do not tell a whole story on how the bot evolves. In other words, we are still lacking a general computation framework that models the temporal dynamics of socialbots and their adversarial behaviors. Therefore, this paper aims to formally formulate their behaviors as a Markov Decision Process (MDP) 

(Howard, 1960) and designs an RL framework to train socialbots that can optimize their adversarial goals on real-life networks.

We investigate two adversarial objectives of a socialbot: influencing people while evading socialbot detection. While the first one can be modeled as an IM task on graph networks, traditional IM algorithms–e.g.,(Kempe et al., 2003; Chen et al., 2009; Kingi et al., 2020), assume that the number of seed nodes is relatively small and all nodes are equally acquirable, all of which are not applicable in the socialbot context as previously described. There have been also a few works–e.g., (Li et al., 2019a; Tian et al., 2020), that utilizes RL to IM task. Yet their scope is still limited to a single constraint on the budget number of seeds. Influence maximization under a temporal constraint–i.e., not to be detected lead to early termination in this case, is a non-trivial problem.

3. Problem Formulation

3.1. Social Network Environment

Network Representation and Influence Diffusion Model A social network includes users, their interactions and how they influence each other. We model this network as a directed graph . An edge between two users , denoted as , means can have influence on . also illustrates a piece of news can spread from to –i.e, follows (thus influences ).

To model the influence flow through , we adopt Independence Cascade Model (ICM) (Goldenberg et al., 2001a, b), which is the most commonly used in the context of a social network (Kimura and Saito, 2006; Jendoubi et al., 2017; Li et al., 2017). In ICM, a node is either active or inactive. Once a node is activated, it has a single opportunity to activate or influence its inactive neighbors with an uniform

activation probability

. At first, every node is inactive except a set of seed nodes . After that, as the environment rolls out throughout a sequence of discrete timesteps, the influence will propagate from through the network by activating different nodes in following and . The process ends when there is no additional activated nodes being activated (Kamarthi et al., 2020; Li et al., 2021). Hence is also the virality of news–i.e., how fast a piece of news can travel through . We then use to denote the social network .

Let denote by the spread function that measures how many nodes in a piece of information–e.g., fake news, can spread from via the ICM model. Given a fixed network structure and the news virality , different will result in different values of . Hence, selecting a good is decisive in optimizing the spread of influence on . However, choosing to maximize has already been proven to be an NP-Hard problem (Kempe et al., 2003).

Socialbots. A socialbot is then a vertex in that attempts to mimic human behaviors for various aims–e.g., spreading propaganda or low-credible contents through ,  (Shao et al., 2018; Cresci, 2020; Subrahmanian et al., 2016). It carries out a sequence of activities to simultaneously achieve two main objectives:

  1. [leftmargin=label=Obj. 0:]

  2. Optimizing its influence over by selectively collecting good seed nodes–i.e., followers, , over time

  3. Evading bots detectors–i.e., not to be detected and removed

These two goals are often in tension in that improving Obj 1 typically hurts Obj 2 and vice versa. That is while having a good network of followers enables a socialbot to spread disinformation to a large number of users at any time, having a high undetectability helps it to sustain this advantage over a long period. As socialbots are usually deployed in groups, and later coming socialbots can also easily inherit a previously established network of followers of a current one. If a bot is detected and removed from , not only it can lose its followers and expose itself to be used to develop stronger detectors, it can also risk revealing the identity of other bots–e.g., by way of guilt-by-association (Wang et al., 2017). This makes the sustainability achieved through Obj 2 distinguishably important from previous literature–e.g.,  (Kamarthi et al., 2020; Kempe et al., 2003; Wen et al., 2017), where the optimization of plays a more central role.

Feature Description
#tweets # of tweets posted by the user
#replies # of replies posted by the user
#retweets # of retweets posted by the user
#avg.tweets average # tweets posted per timestep
#avg.replies average # replies posted per timestep
#avg.retweets average # retweets posted per timestep
#retweet.ratio #retweets/#tweets
#replies.ratio #replies/#tweets
#retweet.replies.ratio #retweets/#replies
#mentions.ratio # unique mentions posted per tweet
Table 1. Predictive features of the socialbot detector .

Relationship between and . denotes the activity sequence–i.e., the DNA of the bot (Cresci et al., 2017a). includes four possible types of actions to be made at every timestep , namely tweet, retweet, reply or mention, and only the last three of which can directly interact with others to expand . Despite these actions are in the Twitter context, other platforms also provide similar functions. In practice, not

every node requires an equal effort to convert to a follower. For example, a bot needs to accumulate its reputability over time and interact more frequently to have an influencer–e.g., Elon Musk, rather than a normal user to become its follower. Since a real model underlining such observation is unknown, we model it using a simple heuristic:


where with hyper-parameter , is the number of times the socialbot is required by the environment to continuously interact with an influencer –i.e., high , for it to become a follower at . Intuitively, a bot with a good reputation overtime–i.e., a high number of followers at the timestep –i.e., , can influence others to follow itself more effortlessly than a newly created bot. Overall, encodes when and what type of interaction–i.e., retweet, reply or mention, to use to acquire a new follower , then decides the frequency of such interaction in . Thus, and is temporally co-dependent.

Socialbot Detection Model. Bot detectors are responsible for detecting and removing socialbots from . Let denote a model that predicts whether or not an account is a socialbot based on its activity sequence up to the timestep (). This sequence of ordered activity is then usually represented as an unordered list of statistical features such as number of replies, tweets per day, by socialbot detectors (Mou and Lee, 2020; Dong and Liu, 2018; Yang et al., 2020). In this paper, extracts and adopts several features (Table 1) from previous works for detection. Most of the features are utilized by the popular bot detection API Botometer (Davis et al., 2016). We train using the Random Forest (Svetnik et al., 2003) algorithm with supervised learning on a publicly available dataset (Yang et al., 2019; Mazza et al., 2019) 111 of nearly 15K Twitter accounts, half of which is labelled as socialbots. Here we assume that is a black-box model–i.e., we do not have access to its parameters. achieves nearly 90% in F1 score on an unseen test set. Since and are co-dependent, we can easily see that also has effects on the detectability of a socialbot. Note that to focus on the study of the adversarial aspect of socialbots, we had to resort to a certain combination of account features and the socialbot detection model. 90% in F1 score is also in line with SOTA detectors on a similar set of features (Mou and Lee, 2020).

Figure 1. An example of ACORN HRL framework. As the environment rolls out, AgentI () decides which type of activity (T, R, A or M) to perform. Whenever an interactive action (R, A, M) is selected, AgentII () then selects a new follower. Since the selected user at is an influencer, needs perform not once but times of action “A” to acquire (blue arrow). Whenever reaches an interval of , the bot detector is triggered (red arrow).

3.2. The ASL Problem and Objective Function

From the above analysis, this paper proposes to study the Adversarial Socialbot Learning (ASL) problem to achieve both Obj 1 and Obj 2. In other words, we aim to solve the following problem.

Problem: Adversarial Socialbot Learning (ASL) aims to develop an automatic socialbot that can exercise adversarial behaviors against a black-box bot detector while at the same time maximizing its influence on through a set of selective followers .

Specifically, we formulate this task as an optimization problem with the objective function as follows.

Objective Function: Given a black-box bot detection model and a social network environment what is characterized by , , , we want to optimize the objective function:

(2a) (2b) (2c) (2d)

Socialbot detector can run prediction on the socialbot every time it performs a new activity. However, and can potentially be very large. Thus, we assume that only runs detection every time new activities is added to (Eqn. 2b). This makes the earliest interval timestep at which a socialbot is detected and removed by (Eqn. 2b,c). Since is monotonically increasing on both and , to maximize , a socialbot cannot focus only either on Obj 1 or Obj 2. In other words, Eqn. (2d) encourages the socialbot to simultaneously optimize both objectives.

4. The Proposed Method: Acorn

4.1. Markov Decision Process Formulation

The ASL problem can be formulated as an MDP process which consists of a state set , an action set , a transition function , a reward function , a discount factor and the horizon . Since the space requirement for can be very large–i.e., for 4 possible activities and possible seed nodes, especially on a large network, this can make the task much more challenging to optimize due to potential sparse reward problem. To overcome this, we transformed this into a HRL framework of two functional agents, AgentI and AgentII, with a global reward (Figure 1). We call this ACORN (Adversarial soCialbOts leaRniNg) framework. While AgentI is responsible for deciding which type of activity among {tweet, retweet, reply, mention} to perform at each timestep , AgentII is mainly responsible for –i.e., to select which follower to accumulate, only when AgentI chooses to do so–i.e., retweet, reply, mention. This reduces the overall space of to only . Since and are co-dependent (Sec. 3.1), the two agents need to continuously cooperate to optimize both influence maximization and undetectability. It is noted that the Markov assumption behind this MDP is not violated because both influence function and detection probability at time only depends on statistical snapshot of the two agents at . This HRL task is then described in detail as follows.

State. Following  (Florensa et al., 2017; Li et al., 2019b; Konidaris and Barto, 2007), we assume that the state space can be factorized into bot-specific and network-specific , and , where is the state space of AgentI and AgentII, respectively. Specifically, encodes (i) the number of followers of the bot and (ii) a snapshot of at timestep . While can directly store the actual sequence, this potentially induces a computational and space overhead especially when becomes very large. Instead, we compact

into a fixed vector summarizing the frequency of each

tweet, retweet, reply, and mention action up to . This effectively limits the space complexity of to . Similarly, comprises of (i)  (Grover and Leskovec, 2016) which encodes the structure of to vectors of size , (ii) a statistical snapshot of and (iii) information regarding , encoded as:


Previous works have often encoded the network structures ((Gogineni et al., 2020; Zhang et al., 2020)

) via a parameterized Graph Neural Network (GCN) 

(Kipf and Welling, 2017) as part of the policy network. As this approach requires frequent parameter updates during training, instead, we adopt node2vec() as an alternative unsupervised method which requires the calculation only once. While can be encoded as a one-hot vector , we enrich it by multiplying it with the binary condition (Sec. 3.1), which then results in Eq. (3). This enables AgentII to select nodes accordingly with the current reputation of the bot .

Action and Policy. Similarly, we factor into two different action spaces for AgentI and AgentII, respectively. , are both encoded as one-hot vectors, representing one of four available activities and one of potential followers, respectively. We then have two policies , that control AgentI and AgentII, respectively.

Reward. Even though we can directly reward the RL agents with at every timestep , this calculation will incur large computational cost, especially when becomes large. Instead, therefore, we design an accumulative reward function that consists of a step reward and a delayed reward to incentivize RL agents to maximize (Eqn. 2) as follows.


where is the interval timestep at which the bot is detected and the episode is terminated. The step reward , which can be efficiently computed, is the marginal gain on the network influence given a new follower selected at . Using the step reward with a discount factor, , helps avoid the sparse reward problem and encourages good follower selection early during an episode. Since , it also encourages the bot to survive against bot detection longer–i.e., to maximize . In other words, as long as the socialbot survives–i.e., increases, in other to make new friendship, it will be able to influence more people. However, since is subadditive–i.e., , we then introduce the delayed reward at the end of each episode with a discounted factor as a reward adjustments for each node selection step.

Figure 2. Examples of a real (Left) and synthetic (Right) news propagation networks on Twitter with a similar star-like shape structure.

4.2. Parameterization

A policy network

is a Multi-Layer Perceptron (MLP) followed by a softmax function that projects

to a probability distribution of 4 possible activities. We can then sample

from such a distribution. A policy network

utilizes Convolutional Neural Network 

(Kalchbrenner et al., 2014) (CNN) to efficiently extract useful spatial features from the stack of representation vectors of all vertex calculated by node2vec() (Sec. 4.1), and MLP to extract features from the rest of the components of . The resulted vectors are then concatenated as the final feature vector. Instead of directly projecting this feature on the original action space of using an MLP, we adopt the parametric-action technique (Gauci et al., 2018; OpenAI et al., 2019) with invalid actions at each timestep –i.e., already chosen node, being masked.

4.3. Learning Paradigm

Learning algorithm. We train using the actor-critic Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017). It has a theoretical guarantee and is known to be versatile in various scenarios (Schulman et al., 2017; Song et al., 2021; Gogineni et al., 2020; Zhang et al., 2020). The actor refers to and

, as described above. Their critics share the same network structure but output a single scalar value as the estimated accumulated reward at


Learning on synthetic and evaluating on real networks. We evaluate our method on real world data. To make our RL model generalize well on unseen real networks (Figure 2, Left) with different possible configurations of , it is important to train our model on a sufficient number of diverse scenarios–i.e., training graphs. However, collecting such a train dataset often requires much time and efforts. Hence, we propose to train our model on synthetic graphs, which can be efficiently generated on the fly during the training (Kamarthi et al., 2020). To avoid distribution shifts between train and test graphs, we first collect a seed dataset of several news propagation networks and use their statistical properties () to spontaneously generate a synthetic graph (Figure 2, Right) for each training iteration. We describe this in detail in Section 5.

5. Experiment

5.1. Set-Up

Datasets. We collected a total of top-100 trending articles on Twitter from January 2021 to April 2021 and their corresponding propagation networks with a maximum of 1.5K nodes using the public Hoaxy API222 The majority of these articles are relevant to the events surrounding the 2020 U.S. presidential election and the COVID-19 pandemic. We also share the same observation with previous literature (Kamarthi et al., 2020; Sadikov et al., 2011) such that retweet networks tend to have star-like shapes. These networks have a high and a low value, suggesting multiple separate star-shape communities with few connections among them. Therefore, viral news usually originates from a few very influential actors in social networks and quickly propagates to their followers.

Figure 3. We generate synthetic networks that ensemble real networks’ structures on the fly to train ACORN and test it with real networks.
Figure 4. Performance comparison of a single socialbot under bot detection constraint.
% Steps % Steps % Steps
AgentI+H 0.63 0.43 1.2K 1K 0.68 0.36 1.2K 1K 0.73 0.31 1.2K 1K
AgentI+C 0.730.41 1.5K968 0.710.36 1.3K1K 0.770.30 1.3K1.1K
ACORN 0.99 0.10 2.1K 254 0.99 0.10 2.0K 276 0.99 0.10 2.0K 305
Table 2. Total survival timesteps v.s. network influence ratio after reaching

Training and Testing Set. Figure 3 illustrates how to utilize synthetic data during training. Since we observe that our framework generalizes better when trained with more complex graphs–i.e., more edges with high intra-community () and inter-community () edge probabilities, We first selected 10% of the collected real networks with the highest and as initial seed graphs–e.g., Figure 2, Left, to generate the training set and use the rest as the test set. Then, during training, we used the average statistics (, # of communities and their sizes) of the seed graphs to generate a stochastic, synthetic graph for each training episode of a maximum timesteps–e.g., Figure 2, Right. These two statistics are selected because they well capture the star-like shapes of a typical retweet network. Since the real activation probabilities of the collected networks are unknown, we found that using a fixed high value during training achieves the best results. We then reported the averaged results across 5 different random seeds on the remaining 90 real test networks with varied values and on a much longer horizon than . Note that this number of testing networks is much larger and more extensive than those of previous studies (Kamarthi et al., 2020; Wen et al., 2017; Kempe et al., 2003).

Baselines. Since there are no previous works that address the ASL problem, we combined different approximation and heuristic approaches for the IM task with the socialbot detector evasion feature that is provided by learned AgentI as baselines:

  • [leftmargin=]

  • AgentI+C. This baseline extends the Cost Effective Lazy Forward (CELF) (Leskovec et al., 2007) and exploits the submodularity of the spread function to become the first substantial improvement over the traditional Greedy method (Kempe et al., 2003) in terms of computational complexity.

  • AgentI+H. Since consists of several star-like communities, we also used a heuristic approach Degree (Kempe et al., 2003; Chen et al., 2009) that always selects the node with the largest out-degree that is available–i.e., user with the largest # of followers.

  • AgentI*+C and AgentI*+H train the first-level agent independently from the second-level agent and combined it with CELF or the heuristic approach Degree, respectively. These are introduced to examine the dependency between the trained AgentI and AgentII

Since the Greedy approach does not scale well with a large number of seeds, however, we excluded it from our experiments.

Models and Configurations. We used a fixed hyper-parameter setting. During training, we set , and

. We refer the readers to the appendix for detailed configurations for RL agents. We ran all experiments on the machines with Ubuntu OS (v18.04), 20-Core Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 93GB of RAM and a Titan Xp GPU 16GB. All implementations are written in Python (v3.8) with Pytorch (v1.5.1).

Figure 5. Empirical comparison of running time between CELF and AgentII (ACORN).
Figure 6. Insights on the learned policies.

5.2. Main Results

Network Influence Ratio. Figure 4 shows the network influence ratio–i.e., network influence over total number of users, under a bot detection environment given different number of budget seeds and values:


A high network influence ratio requires both (i) efficient followship selection and (ii) efficient detection evasion strategy. Overall, ACORN outperforms all baselines with different news virality ( values). However, ACORN underperforms when is low–e.g., in Figure 4. This is because AgentII learns not to connect with the most influential nodes early in the process. This can help prevent disrupting the sequence and lead to early detection, especially when it gets closer to the next prediction interval of .

The larger the value, the further–i.e., more hoops, a news can propagate through . Hence, as increases–i.e., the more viral a piece of news, utilizing the network structure to make new connections is crucial and more effective than simply selecting the most influential users. This is reflected in the inferior performance of AgentI+H when compared with AgentI+C, ACORN in Figure 4, . This means that ACORN is able to utilize the network structured capture by node2vec and postpone short-term incentives–i.e., makes friends with influential users, for the sake of long-term rewards. Overall, Acorn also behaves more predictably than baselines in terms of the influence ratio’s deviation across several runs.

Survival Timesteps. We then evaluated if a trained socialbot can survive even after collecting all followers. Table 2 shows that while we train a socialbot with a finite horizon , it can live on the network for a much longer period during testing. However, other baselines were detected very early. Since only three out of four activities–i.e., tweet, retweet, reply, and mention, allow to collect new followers, it is natural that socialbots need to survive much longer than steps–e.g., around 2.0K in Table 2, to accumulate all followers. This corresponds to 98%, 64%, and 56% of socialbots surviving–i.e., not detected, after reaching for Acorn, AgentI+C and AgentI+H, respectively. Our trained socialbot can also sustain much longer if we keep it going during testing, even with different detection intervals . This implies that AgentI can generalize its adversarial activities against toward unseen real-life scenarios.

Dependency between RL Agents. The above results also demonstrate the effects of co-training AgentI and AgentII. First, the heuristic and CELF method when paired with the learned AgentI (blue & green lines, Figure 4) performs much better than when paired with an independently trained (without AgentII) AgentI (yellow & black lines, Figure 4). This shows that AgentI, when trained with AgentII, becomes more versatile and can help a socialbot survive a much longer period of time, especially even when the socialbot only uses a heuristic node selection. However, AgentI performs the best when paired with AgentII. This shows that two RL agents successfully learn to collaborate, not only to evade the socialbot detection but also to effectively maximize its network influence. This further reflects the co-dependency between the roles of and as analyzed in Sec. 3.1.

Figure 7. Performance of multiple socialbots under bot detection constraint on a large network.

Computational Analysis. We compared the computational complexity of AgentII specifically with the CELF algorithm during inference. Even though CELF significantly improves from the traditional Greedy (Kempe et al., 2003) IM algorithm with the computational complexity of  (Tang et al., 2014) (assuming each call of takes and only one round of Monte Carlo simulation is needed), its computation greatly depends on , the size of the graph and becomes only computationally practical when is small. This is also similar to other traditional IM algorithms such as CELF++ (Goyal et al., 2011), TIM (Tang et al., 2014), and ASIM (Galhotra et al., 2015). To illustrate, CELF takes much more time to compute as increases especially with large –i.e., more nodes need to be reached when computing (Figure 5). However, with the complexity of the forward pass through , AgentII is able to scale linearly regardless of the network structure and the virality of the news during inference. Even though our framework requires to calculate the graph representation using node2vec, it is specifically designed to be scalable to be able to process large graphs (Rossi et al., 2018) and we only need to run it once.

Insights on the Learned Policies. We summarized the node selection strategies of all methods in Figure 6. We observed that both heuristic and CELF selects very influential nodes with many followers (high out-degrees) very early. Alternatively, AgentII acquires an array of normal users (low out-degrees) before connecting with influential ones. This results in early detection and removal of the baselines and sustainable survival of our approach. This shows that AgentII can learn to cope with the relationship constraint (Eqn. (1)) between and imposed by the environment. Moreover, the degrees of selected users by ACORN has a right long-tail distribution, which means that ACORN overall still tries to maximize its network influence early in the process.

5.3. Multiple Socialbots Results

We have evaluated our approach on different real-life news propagation graphs. These networks can be considered as sub-graphs of a much larger social network. In practice, different sub-graphs can represent different communities of special interests–e.g., politics, COVID-19 news, or different characteristics–e.g., political orientation. Since socialbots usually target to influence a specific group of users–e.g., anti-vaxxer, it is practical to deploy several bots working in tandem on different sub-graphs. To evaluate this scenario, we aggregated all 90 test sub-graphs into a large network of 135K nodes and used each learned socialbot for each sub-graph. Figure 7 shows that ACORN still outperforms other baselines especially later in the time horizon. Moreover, ACORN can efficiently scale to a real-life setting thanks to its linear running time and highly parallel architecture.

6. Discussion and Limitation

Our contribution goes beyond our demonstration such that one can train adversarial socialbots to effectively navigate real-life networks using an HRL framework. We will also publish a multi-agent RL environment for the ASL task under the gym library (Brockman et al., 2016). This environment will facilitate researchers to test different RL agents, examine and evaluate assumptions regarding the behaviors of socialbots, bot detection models, and the underlying influence diffusion models on synthetic and real-life news propagation networks. It remains a possibility that our proposed framework could be deliberately exploited to train and deploy socialbots to spread low-credibility content on social networks without being detected. To reduce any potential misuse of our work, we have also refrained from evaluating our framework with an actual socialbot detector API such as Botometer 333 However, ultimately, such misuse can occur (as much as the misuse of the latest AI techniques such as GAN or GPT is unavoidable). Yet, we firmly believe that the benefits of our framework in demonstrating the possibility of adversarial nature of socialbots, and enabling researchers to understand and develop better socialbot detection models far outweigh the possibility of misuse for developing “smarter” socialbots. In fact, by learning and simulating various adversarial behaviors of socialbots, we can now analyze the weakness of the current detectors. Moreover, we can also incorporate these adversarial behaviors to advance the development of novel bot detection models in a proactive manner (Cresci, 2020). Time-wise, this gives us a great advantage over the traditional reactive flow of developing socialbot detectors where researchers and network administrators are always one step behind the malicious bots developers (Cresci, 2020).

One limitation of our current approach is that we only considered statistical features of a bot detector that are relevant to four activities–i.e., tweet, retweet, reply, and mention (Table 1). While these features help achieve 90% of detection accuracy in F1 score on a real-life dataset, we hope to lay the foundation for further works to consider more complex network and content-based features (Efthimion et al., 2018; Mou and Lee, 2020; Yang et al., 2020; Mazza et al., 2019).

7. Conclusion and Future Work

This paper proposes a novel adversarial socialbot learning (ASL) problem where a socialbot needs to simultaneously maximize its influence on social networks and minimize the detectability of a strong black-box bot detector. We carefully designed and formulated this task as a cooperative game between two functional hierarchical reinforcement learning agents with a global reward. We demonstrated that the learned socialbots can sustain their presence on unseen real-life networks over a long period while outperforming other baselines in terms of network influence. During inference, the complexity of our approach also scales linearly with the number of followers and is independent of a network’s structures and the virality of the news. Our research is also the first step towards developing more complex adversarial socialbot learning settings where multiple socialbots can work together to obtain a common goal (Cresci, 2020). By simulating the learning of these socialbots under various realistic assumptions, we also hope to analyze their adversarial behaviors to develop effective detection models against more advanced socialbots in the future.


  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: 4th item, §6.
  • C. Cai, L. Li, and D. Zengi (2017) Behavior enhanced deep bot detection in social media. In 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 128–130. Cited by: §2.1.
  • W. Chen, Y. Wang, and S. Yang (2009) Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 199–208. Cited by: §1, §2.2, 2nd item.
  • S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi (2017a) Social fingerprinting: detection of spambot groups through dna-inspired behavioral modeling. IEEE Transactions on Dependable and Secure Computing 15 (4), pp. 561–576. Cited by: §3.1.
  • S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi (2017b) The paradigm-shift of social spambots: evidence, theories, and tools for the arms race. In Proceedings of the 26th international conference on world wide web companion, pp. 963–972. Cited by: §2.1.
  • S. Cresci, M. Petrocchi, A. Spognardi, and S. Tognazzi (2019) Better safe than sorry: an adversarial approach to improve social bot detection. In Proceedings of the 10th ACM Conference on Web Science, pp. 47–56. Cited by: §2.2.
  • S. Cresci, M. Petrocchi, A. Spognardi, and S. Tognazzi (2021) The coming age of adversarial social bot detection. First Monday. Cited by: §1, §2.1, §2.2.
  • S. Cresci (2020) A decade of social bot detection. Communications of the ACM 63 (10), pp. 72–83. Cited by: §1, §2.1, §3.1, §6, §7.
  • C. A. Davis, O. Varol, E. Ferrara, A. Flammini, and F. Menczer (2016) Botornot: a system to evaluate social bots. In Proceedings of the 25th international conference companion on world wide web, pp. 273–274. Cited by: §3.1.
  • G. Dong and H. Liu (2018)

    Feature engineering for machine learning and data analytics

    CRC Press. Cited by: §1, §2.1, §3.1.
  • P. G. Efthimion, S. Payne, and N. Proferes (2018) Supervised machine learning bot detection techniques to identify social twitter bots.

    SMU Data Science Review

    1 (2), pp. 5.
    Cited by: §6.
  • C. Florensa, Y. Duan, and P. Abbeel (2017) Stochastic neural networks for hierarchical reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §4.1.
  • S. Galhotra, A. Arora, S. Virinchi, and S. Roy (2015) Asim: a scalable algorithm for influence maximization under the independent cascade model. In Proceedings of the 24th International Conference on World Wide Web, pp. 35–36. Cited by: §5.2.
  • J. Gauci, E. Conti, Y. Liang, K. Virochsiri, Y. He, Z. Kaden, V. Narayanan, X. Ye, Z. Chen, and S. Fujimoto (2018)

    Horizon: facebook’s open source applied reinforcement learning platform

    arXiv preprint arXiv:1811.00260. Cited by: §4.2.
  • T. Gogineni, Z. Xu, E. Punzalan, R. Jiang, J. Kammeraad, A. Tewari, and P. Zimmerman (2020) TorsionNet: a reinforcement learning approach to sequential conformer search. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 20142–20153. External Links: Link Cited by: §4.1, §4.3.
  • J. Goldenberg, B. Libai, and E. Muller (2001a) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Marketing letters 12 (3), pp. 211–223. Cited by: §3.1.
  • J. Goldenberg, B. Libai, and E. Muller (2001b) Using complex systems analysis to advance marketing theory development: modeling heterogeneity effects on new product growth through stochastic cellular automata. Academy of Marketing Science Review 9 (3), pp. 1–18. Cited by: §3.1.
  • A. Goyal, W. Lu, and L. V. Lakshmanan (2011) Celf++ optimizing the greedy algorithm for influence maximization in social networks. In Proceedings of the 20th international conference companion on World wide web, pp. 47–48. Cited by: §5.2.
  • A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §4.1.
  • M. Hindman and V. Barash (2018) Disinformation, and influence campaigns on twitter. Knight Foundation: George Washington University. Cited by: §1.
  • R. A. Howard (1960) Dynamic programming and markov processes.. Cited by: §2.2.
  • S. Jendoubi, A. Martin, L. Liétard, H. B. Hadji, and B. B. Yaghlane (2017) Two evidential data based models for influence maximization in twitter. Knowledge-Based Systems 121, pp. 58–70. Cited by: §3.1.
  • N. Kalchbrenner, E. Grefenstette, and P. Blunsom (2014) A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 655–665. External Links: Link, Document Cited by: §4.2.
  • H. Kamarthi, P. Vijayan, B. Wilder, B. Ravindran, and M. Tambe (2020) Influence maximization in unknown social networks: learning policies for effective graph sampling. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent SystemsAdvances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), AAMAS ’20, Vol. 30, Richland, SC. External Links: ISBN 9781450375184 Cited by: §1, §3.1, §3.1, §4.3, §5.1, §5.1.
  • D. Kempe, J. Kleinberg, and É. Tardos (2003) Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 137–146. Cited by: §1, §2.2, §3.1, §3.1, 1st item, 2nd item, §5.1, §5.2.
  • M. Kimura and K. Saito (2006) Tractable models for information diffusion in social networks. In European conference on principles of data mining and knowledge discovery, pp. 259–271. Cited by: §3.1.
  • H. Kingi, L. D. Wang, T. Shafer, M. Huynh, M. Trinh, A. Heuser, G. Rochester, and A. Paredes (2020) A numerical evaluation of the accuracy of influence maximization algorithms. Social Network Analysis and Mining 10 (1), pp. 1–10. Cited by: §1, §2.2.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • G. D. Konidaris and A. G. Barto (2007) Building portable options: skill transfer in reinforcement learning.. In IJCAI, Vol. 7, pp. 895–900. Cited by: §4.1.
  • T. Le, S. Wang, and D. Lee (2020) MALCOM: generating malicious comments to attack neural fake news detection models. In 2020 IEEE International Conference on Data Mining (ICDM), Vol. , pp. 282–291. External Links: Document Cited by: §1.
  • J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance (2007) Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 420–429. Cited by: 1st item.
  • D. Li, M. Lowalekar, and P. Varakantham (2021) CLAIM: curriculum learning policy for influence maximization in unknown social networks. arXiv preprint arXiv:2107.03603. Cited by: §3.1.
  • H. Li, M. Xu, S. S. Bhowmick, C. Sun, Z. Jiang, and J. Cui (2019a) Disco: influence maximization meets network embedding and deep learning. arXiv preprint arXiv:1906.07378. Cited by: §2.2.
  • M. Li, X. Wang, K. Gao, and S. Zhang (2017) A survey on information diffusion in online social networks: models and methods. Information 8 (4), pp. 118. Cited by: §3.1.
  • S. Li, R. Wang, M. Tang, and C. Zhang (2019b) Hierarchical reinforcement learning with advantage-based auxiliary rewards. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 1407–1417. External Links: Link Cited by: §4.1.
  • M. Mazza, S. Cresci, M. Avvenuti, W. Quattrociocchi, and M. Tesconi (2019) Rtbust: exploiting temporal patterns for botnet detection on twitter. In Proceedings of the 10th ACM Conference on Web Science, pp. 183–192. Cited by: §3.1, §6.
  • G. Mou and K. Lee (2020) Malicious bot detection in online social networks: arming handcrafted features with deep learning. In Social Informatics, S. Aref, K. Bontcheva, M. Braghieri, F. Dignum, F. Giannotti, F. Grisolia, and D. Pedreschi (Eds.), Cham, pp. 220–236. External Links: ISBN 978-3-030-60975-7 Cited by: §1, §2.1, §3.1, §6.
  • OpenAI, C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang (2019) Dota 2 with large scale deep reinforcement learning. External Links: 1912.06680, Link Cited by: §4.2.
  • J. Rodríguez-Ruiz, J. I. Mata-Sánchez, R. Monroy, O. Loyola-Gonzalez, and A. López-Cuevas (2020) A one-class classification approach for bot detection on twitter. Computers & Security 91, pp. 101715. Cited by: §2.1.
  • R. A. Rossi, R. Zhou, and N. K. Ahmed (2018) Deep inductive graph representation learning. IEEE Transactions on Knowledge and Data Engineering 32 (3), pp. 438–452. Cited by: §5.2.
  • E. Sadikov, M. Medina, J. Leskovec, and H. Garcia-Molina (2011) Correcting for missing data in information cascades. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 55–64. Cited by: §5.1.
  • M. Sayyadiharikandeh, O. Varol, K. Yang, A. Flammini, and F. Menczer (2020) Detection of novel social bots by ensembles of specialized classifiers. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2725–2732. Cited by: §2.1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.3.
  • C. Shao, G. L. Ciampaglia, O. Varol, K. Yang, A. Flammini, and F. Menczer (2018) The spread of low-credibility content by social bots. Nature communications 9 (1), pp. 1–9. Cited by: §3.1.
  • Y. Song, M. Steinweg, E. Kaufmann, and D. Scaramuzza (2021) Autonomous drone racing with deep reinforcement learning. arXiv preprint arXiv:2103.08624. Cited by: §4.3.
  • V. S. Subrahmanian, A. Azaria, S. Durst, V. Kagan, A. Galstyan, K. Lerman, L. Zhu, E. Ferrara, A. Flammini, and F. Menczer (2016) The darpa twitter bot challenge. Computer 49 (6), pp. 38–46. Cited by: §3.1.
  • V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston (2003) Random forest: a classification and regression tool for compound classification and qsar modeling. Journal of chemical information and computer sciences 43 (6), pp. 1947–1958. Cited by: §3.1.
  • Y. Tang, X. Xiao, and Y. Shi (2014) Influence maximization: near-optimal time complexity meets practical efficiency. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp. 75–86. Cited by: §5.2.
  • S. Tian, S. Mo, L. Wang, and Z. Peng (2020) Deep reinforcement learning-based approach to tackle topic-aware influence maximization. Data Science and Engineering 5 (1), pp. 1–11. Cited by: §2.2.
  • B. Wang, N. Z. Gong, and H. Fu (2017) GANG: detecting fraudulent users in online social networks via guilt-by-association on directed graphs. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 465–474. Cited by: §3.1.
  • Z. Wen, B. Kveton, M. Valko, and S. Vaswani (2017) Online influence maximization under independent cascade model with semi-bandit feedback. pp. . External Links: Link Cited by: §3.1, §5.1.
  • Y. Wu, Y. Fang, S. Shang, J. Jin, L. Wei, and H. Wang (2021)

    A novel framework for detecting social bots with deep neural networks and active learning

    Knowledge-Based Systems 211, pp. 106525. Cited by: §2.1.
  • K. Yang, O. Varol, C. A. Davis, E. Ferrara, A. Flammini, and F. Menczer (2019)

    Arming the public with artificial intelligence to counter social bots

    Human Behavior and Emerging Technologies 1 (1), pp. 48–61. Cited by: §3.1.
  • K. Yang, O. Varol, P. Hui, and F. Menczer (2020) Scalable and generalizable social bot detection through data selection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 1096–1103. Cited by: §1, §2.1, §3.1, §6.
  • C. Zhang, W. Song, Z. Cao, J. Zhang, P. S. Tan, and X. Chi (2020) Learning to dispatch for job shop scheduling via deep reinforcement learning. In Advances in Neural Information Processing Systems, Vol. 33, pp. 1621–1632. Cited by: §4.1, §4.3.