Deep Reinforcement Learning for Task-driven Discovery of Incomplete Networks

by   Peter Morales, et al.
Northeastern University

Complex networks are often either too large for full exploration, partially accessible or partially observed. Downstream learning tasks on incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks and given resource collection constraints are of great interest. In this paper we formulate the task-specific network discovery problem in an incomplete network setting as a sequential decision making problem. Our downstream task is vertex classification.We propose a framework, called Network Actor Critic (NAC), which learns concepts of policy and reward in an offline setting via a deep reinforcement learning algorithm. A quantitative study is presented on several synthetic and real benchmarks. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4


Knowledge Transfer in Multi-Task Deep Reinforcement Learning for Continuous Control

While Deep Reinforcement Learning (DRL) has emerged as a promising appro...

An Alternative to Backpropagation in Deep Reinforcement Learning

State-of-the-art deep learning algorithms mostly rely on gradient backpr...

Actor-Critic Deep Reinforcement Learning for Dynamic Multichannel Access

We consider the dynamic multichannel access problem, which can be formul...

Modular Multitask Reinforcement Learning with Policy Sketches

We describe a framework for multitask deep reinforcement learning guided...

Understanding the Limitations of Network Online Learning

Studies of networked phenomena, such as interactions in online social me...

Reinforcement Learning for Multi-Objective Optimization of Online Decisions in High-Dimensional Systems

This paper describes a purely data-driven solution to a class of sequent...

Classification with Costly Features as a Sequential Decision-Making Problem

This work focuses on a specific classification problem, where the inform...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Complex networks are critical to many applications such as those in the social, cyber, and bio domains. We commonly have access to partially observed data. The challenge is to discover enough of the complex network so that we can perform a learning task well. The network discovery step is especially critical in the case when the learning task has the characteristics of the “needle in a haystack” problem. If the discovery process is not carefully tuned, the noise introduced, almost always, overwhelms the signal. This presents an optimization problem: how should we grow the incomplete network to achieve a learning objective on the network, while at the same time minimizing the cost of observing new data?

In this work we view the network discovery problem from a decision theoretic lens, where notions of utility and resource cost are naturally defined and jointly leveraged in a sequential, closed-loop manner. In particular, we will leverage Reinforcement Learning (RL) and its mathematical formalism, Markov Decision Processes (MDP). MDP approaches have been successfully used in many other application settings 

[1, 2, 3]. However, the use of decision theoretic approaches in the context of discovery of complex networks is novel and presents very interesting research opportunities. In particular, it requires learning effective models of reward that can capture properties of network structure at various topological scales and learning contexts. The network science community has defined many such topological and task quality metrics, but, to-date, they have not been leveraged in the context of guiding the process of discovery of a partially observed, incomplete network. We consider the task of selective harvesting on graphs [9], where the learning objective is to maximize the collection of nodes of a particular type, under budget constraints. We make the following contributions:

  • We introduce a deep RL framework for task-driven discovery of incomplete networks. This formulation allows us to learn offline-trained models of environment dynamics and reward.

  • We show that, for a variety of complex learning scenarios, the added feature of learning from closely related scenarios leads to substantial performance improvements relative to existing online discovery methods.

  • We present an efficient way of organizing the state of possible discovered networks based on personalized Pagerank. Our approach achieves substantial reductions in training and convergence time.

  • Our approach is model-free, yet is able to generalize well to unseen real network topologies and tasks.

2 Related Work

Our learning task falls under the category of finding the largest number of a particular type of node under budget constraints. The node type can be specified by the node attributes (for example, follower nodes on a twitter network), or they can be determined by node’s participation on a particular class of behavior (for example, membership to anomalous activity). Unlike the problem setting in [4], we do not assume access to the full topology of the network and therefore have to perform the learning task with partial information.

Discovering incomplete networks with limited resources has received a lot of attention in recent literature. The primary learning objective in these works is to increase the visibility of the network topology by either increasing the number of undiscovered nodes [5, 6, 7], or by increasing network coverage [8]. Our problem setting is the most similar to selective harvesting [9]. Our approach differs from [9]

by leveraging the Reinforcement Learning paradigm to estimate offline models of network discovery strategies (policy) and node utility (reward) that are state-aware. More specifically, our approach explicitly connects the utility of a discovery choice to the network state when that choice was made.

Reinforcement learning for tasks on complex networks is a relatively new perspective. Work in [15, 16] leverages Reinforcement Learning to engineer diffusion processes in networks assumed to be fully observed, while authors in [12] focus on the problem of graph partitioning. You et al. [11] leverage Reinforcement Learning to generate novel molecular graphs with desired domain-specified properties. There are connections to our problem setting. The graph generation is approached in a similar fashion to the network discovery problem, by iteratively expanding a seed graph via defined actions. There are, however, some important differences with our work. Since the application in [11] is molecular design, the size of the graphs they consider is very small. Their definition of reward and environment dynamics is tailored to the biochemical domain. Our approach is more general and can support discovery of different types of networks and different network sizes. Our notion of reward is also more general in that we do not utilize domain-specific properties to guide the learning process. De Cao and Kipf [13] similarly to [11] focus on small molecular graph generation, and furthermore, they do not consider the generation process as a sequence of actions. Finally, [14, 17]

leverage deep Reinforcement Learning techniques to learn a class of graph greedy optimization heuristics on fully observed networks.

3 Problem Definition

We start with the assumption that a network contains a target subnetwork representing a set of relevant vertices. The objective is to strategically explore and expand the network so that we optimize discovery of these relevant vertices. The decision making agent is initially given partial information about the network . A subset of those vertices have their relevance status revealed as well, with representing non-target vertices and representing target vertices. We assume our exploration starts from a seed vertex belonging to the partial target subnetwork. At each step, the agent can choose from a set of vertices that are observed, but whose label is unknown. We refer to this set of vertices as the boundary set . After selecting a vertex, the agent can gain knowledge of the vertex label, as well as of the identity of all its neighbors. An immediate reward is given if the selected vertex belongs to the target subnetwork.

This problem may be stated as a Markov Decision Process (MDP). An MDP is defined by the tuple :

  • The state space, , is the set of intermediate discovered networks.

  • The action space, , at each step, where is the set of boundary vertices at step .

  • The transition model,

    encodes how the network state changes by specifying the probability of state

    transitioning to given action , We do not model this transition function explicitly and take the model-free approach, where we iteratively define and approximate reward without having to directly specify the network state transition probabilities. We make this more precise in Section 4.

  • The local reward function, returns the reward gained by executing action in state and is defined as: if . The total cumulative, action-specific reward, also referenced as the value function , is defined as:


    with representing a discount factor that captures the utility of exploring future graph states. In the next section, we describe in detail our deep reinforcement learning algorithm.

Figure 1: Illustration of estimation of cumulative reward of state over a trajectory , and discount factor ; red nodes represent the node type we would like to discover: .

4 Network Actor Critic (NAC) Algorithm

4.1 Offline Learning and Policy Optimization

In our setting, learning happens offline over a training set of possible discovery paths. We use simulated instances of both background networks and target subnetworks to generate paths or trajectories over the network state space.

Each path represents an alternating sequence of discovered graph, action , taken over steps. Since in this setting we have access to the ground truth vertex labels, we can map each discovery path to the corresponding cumulative reward value using equation (1). An illustration is given in Figure 1.

Given the sampled trajectories, one of our learning objectives becomes to approximate the value function by minimizing the loss ,


We formulate this objective by taking the input tuples of discovered graphs , boundary nodes and corresponding cumulative reward values , such that . The approximated function can then be utilized to estimate the policy function

, which defines the action probability distribution at each state. In particular, we estimate the advantage of choosing one node versus another at state



This advantage is used to scale the policy gradient estimator, typically defined as, We utilize a proximal policy optimization (PPO) method [23] in order to compute this gradient. PPO methods are widely utilized for policy network optimization and have been demonstrated to achieve state of the art performance on graph tasks [11]. The objective function utilized is defined in equation 4,



is used to bound the loss function and help with convergence. During offline training, we modify this objective to encourage exploration and reduce the number of required training epochs to converge to a solution. For equation

5, denotes the entropy of policy in state and is used to balance exploitation vs exploration,


Both learning objectives 2 and 5 are jointly optimized via an actor critic training framework. This framework is detailed further below in the description of the Network Actor Critic (NAC) algorithm. To help with training times, multiple instantiations of agents are run simultaneously. Collected values are gathered from each agent and are stored in a buffer which is used to compute the losses for the value function and policy networks after a fixed time window of steps.

4.1.1 Training and Network Details

The NAC algorithm is updated differently during offline training versus online evaluation. During offline training, the ADAM optimizer [25] is used to update network parameters and for the policy and value function networks. In offline training, eight agents simultaneously carry out the anomaly discovery task on a unique network realization generated using the random graphs outlined in Table 2. During offline training, the hyper parameters used are: , , , , , and learning rate . For online evaluation a single agent and , , , , , and . The policy and value function networks are both comprised of 3 convolutional layers with 64 hidden channels and a final fully connnected layer.

4.2 Truncated Node Rank Embedding

One challenge that many reinforcement learning algorithms have to address is exploration of large state spaces. We consider the transformation of personalized Pagerank (PPR) [18] which produces a ranking of vertices and allows for more effective detection invariant structures among the potential network states [19, 20]. Furthermore, PPR fits perfectly into our sequential network discovery setting and has been shown to effectively highlight other target nodes related to the initial seed network. We use the PPR ranking to reorder the rows of the original adjacency matrix. We further truncate this adjacency matrix for additional efficiency gains and only retain the adjacency matrix defined by the top vertices. is a parameter we select and it defines the supporting network for computing potential discovery trajectories and long-term reward.

1 set hyper-parameters: exploration constant , learning rate , update window size ;
2 initialize: policy parameters , value function parameters , buffer ;
3 ;
4 for t=1,2,… do
5       ;
6       for agent=1,2,…,N do
7             ;
8             take action and save reward ;
9             ;
10             save to buffer ;
12       end for
13      if t modulo T is 0 then
14             Compute batch update tuples over horizon using ;
15             Batch update via using eq. 2;
16             Compute using eq. 3;
17             Batch update via using eq. 5;
19       end if
21 end for
Algorithm 1 Network Actor Critic (NAC)

5 Experiments

We evaluate our algorithm against several learning scenarios for both synthetic and realistic datasets. Next we describe our datasets and baselines used for comparison.

5.1 Datasets

Synthetic Datasets: We approach synthetic graph generation by individually modeling a background network (i.e., the network that does not contain any of the target nodes), and the foreground network (i.e., the network that only contains the target nodes and the interactions among them). We use two models to generate samples of background networks. Stochastic Block model (SBM) [26] is a common generative graph model that allows us to model community structure as dense subgraphs sparsely connected with the rest of the network. Lancichinetti–Fortunato–Radicchi (LFR) model [21]

is another frequently used generative model that, in contrast to SBM, allows us to simulate network samples with skewed degree distributions and skewed community sizes, and therefore is able to capture more realistic and complex properties of real networks. Finally, we use the

Erdős-Renyi (ER) model [26] to simulate the foreground network. ER is a simple generative model where vertices are connected with equal probability controlling the density of the foreground network. Parameter choices for all the models above are detailed in Table 2.

In order to create a background plus foreground network sample, we select a subset of the nodes from the background network that will represent the identity of the target nodes. We then simulate an ER subnetwork on these nodes and replace their background induced subnetwork with the ER subnetwork. We reference this process in the rest of the paper as embedding the foreground subnetwork.

Real Datasets: We analyzed two Facebook datasets [22] representing pages of different categories as nodes and mutual likes as edges. For both cases, we study the discovery of a target set of vertices, where we control how we generate and embed them in the background network. In particular, we embed a synthetic foreground subnetwork consisting of a denser (anomalous) ER graph with size and density . We also consider the Livejournal dataset [9]. This dataset represents an online social network with users representing the nodes, and their self-declared friendships representing the edges. For each user, there is also information on the groups they have joined. Similarly to [9], we use one of the listed groups as the target class. The Livejournal dataset represents a departure from the two Facebook datasets, both in terms of its much larger size, but also because the target class does not represent an anomaly. A few topological characteristics of the real networks described here, as well as details on their target class are listed in Table 2.

5.2 Baselines

We evaluate the NAC algorithm by comparing performance with two top performing online network discovery approaches. The Network Online Learning (NOL) [5] algorithm learns an online regression function that maximizes discovery of previously unobserved nodes for a given number of queries. We modify the objective of NOL to match our problem setting by requiring the discovery of previously unobserved nodes of a particular type. A second baseline we consider is the Directed Diversity Dynamic Thompson Sampling ([9] approach.

is stochastic multi-armed bandit approach that leverages different node classifiers and Thompson sampling to diversify the selection of a boundary node. Finally, we compare to a simple fixed node selection heuristic referenced in 

[9] called Maximum Observed Degree (MOD). At every decision step, MOD selects the node with the highest number of observed neighbors that have the desired label.

Model Type Parameters
SBM Background
LFR Background
ER Foreground
Table 1: Detailed list of parameter values used for synthetic networks. Number of vertices is represented by . SBM parameters are: represents the number of communities, the within-community edge probability for community , the across-community edge probability, such that . LFR parameters are: skewness parameters for degree and cluster size distributions respectively, represents the average network degree, represent the min and max values of degree distribution, and represent the sizes of smallest and largest clusters, and finally represent the size of the foreground subnetwork, number of foreground subnetworks and its edge probability, respectively.
Name # Nodes # Edges Target Type Target Size
Facebook Politician 5,908 41,729 Synthetic 80
Facebook TV Shows 3,892 17,262 Synthetic 80
Livejournal 4,000k 35,000k Real 1,400
Table 2: Characteristics of the real networks and corresponding target classes.

5.3 Learning Scenarios

In the first learning scenario, the goal is to detect a set of distributed anomalous vertices. They are represented by two cliques, each containing 40 vertices, that are embedded 2 to 3 hops away from each other. The training instances are networks generated by the SBM model, while the test cases are network instances generated by the LFR model. In this scenario, the discovery agent has to figure out 1) how to value longer exploration paths over the cost of including nodes not in target set, and 2) how to adjust to topological differences between training and testing instances. In Figure 2(a), we consider a test case where detactability of the two cliques with complete network information is relatively easy (average background density where the cliques are embedded is comparatively low). We observe that all the methods are able to find the first clique, yet all the baselines struggle once they enter the region where no clique nodes are present. The baselines eventually find some clique nodes, but, even then, they are unable to fully retrieve the second clique. NAC is able to leverage estimation of long-term reward and access to the offline policy to fully recover both cliques, and furthermore, is able to generalize to the more complex LFR topology.

In Figure 2(a), we consider a much harder case: embedding two disjoint dense subgraphs, each with density 0.2 in a background of density 0.05. These parameters are close to the detectability bound [24] for the complete network case. In this case, neither of the baselines learns how to recover the second clique. NAC goes through a longer exploration phase, but eventually learns how to grow the network to identify the second clique. In Figure 3(a) and (b), we illustrate how our model trained on synthetic background networks generalizes to realistic background topologies. For this scenario, we trained with instances from both the LFR and SBM models. We observe that NAC generalizes very well to the Facebook network topologies and is able to fully discover the target nodes.

(a) Easier target detectability
(b) Harder target detectability
Figure 2: NAC discovers two anomalous cliques that are not adjacent.
(a) Facebook Politician
(b) Facebook TV Shows
(c) Livejournal
Figure 3: NAC outperforms competitive online methods on real network topologies.

In our last learning scenario (Figure 3(c)), we illustrate how our model generalizes to a test case where both the background network and the target set are from real data. Our model has only seen target class examples represented by a dense ER model, yet is able to discover an online Livejournal group with 1400 users. We note the initial exploration cost, as NAC learns to adapt to the new target topology. Eventually, by query 850, is able to more efficiently discover the group members and by query 1400 fully recovers the whole group. In Figure 4(a)(b), we demonstrate how re-ordering the adjacency matrix of the observed network by the PPR score supports a faster model convergence during training time. We illustrate by analyzing the convergence behavior on the test case described in Figure 2(a), but the behavior is consistent for all the different test cases considered. Finally, in Figure 4(c), we illustrate, that NAC has learned strategies beyond picking a vertex with high ppr score. In particular, NAC has learned how to explore regions where delayed reward is critical (in this example, the region between the two disjoint cliques).

(a) Without PPR
(b) With PPR
(c) NAC vs PPR
Figure 4: NAC convergence on a test set, without and with ppr ranking (a,b). NAC queries do not always agree with highly ranked nodes (c).

6 Conclusions and Future Work

We introduced NAC, a deep RL framework for task-driven discovery of incomplete networks. NAC learns offline models of reward and network discovery policies based on a synthetically generated training set. NAC is able to learn effective strategies for the task of selective harvesting, especially for learning scenarios where the target class is relatively small and difficult to discriminate. We show that NAC strategies transfer well to unseen and more complex network topologies including real networks.

Our approach has opened up many interesting venues for future research. The effectiveness and convergence of our algorithm relies on being trained on a sufficiently representative training set. It is valuable to further explore and quantify the limits of transferability of synthetically generated training sets. Interestingly, our current framework is flexible enough to incorporate additional discovery strategies generated from other methods, as part of the offline training process. This feature can lead to more efficient discovery strategies, but we leave the careful analysis for future work. Selecting an effective approximation strategy is another topic for future research. NAC leverages Pagerank to quickly identify regions of relevance, but it is of great interest to identify other graph space embeddings that can support fast navigation through the network state space. Finally, the framework is general enough to support discovery for other network learning tasks. It is valuable to explore how a different learning objective changes the training, convergence, and generalizibility requirements.