1 Introduction
Social interactions between people play an important role in spreading information and behavioral changes. The problem of identifying a small set of influential nodes in a social network that can help in spreading information to a large group is termed as influence maximization (IM) [Kempe03]. It was widely used in applications such as viral marketing [Kempe03], rumor control [Budak11], etc, which use the online social network. In addition to these, IM has also found useful applications in domains involving real world physical social networks. Some of these applications include identifying peer leaders in homeless youth network to spread awareness about HIV [Wilder18, yadav2016using], identifying student leaders in school network to disseminate information on substance abuse [valente2007identifying], identifying users who can increase participation in microfinance [banerjee2013diffusion], etc. In the case of real world social networks, the network information is not readily available and it is generally gathered by individually surveying different people who are part of the network. As conducting such surveys is a time intensive process requiring substantial efforts from a dedicated team of social work researchers, it is not practically possible to have access to a complete network structure. Therefore, the influence maximization problem in the real world is coupled with the uncertain problem of discovering network using a limited survey budget (i.e., the number of people who can be queried).
Most of the existing work [Wilder18, Bryan18, yadav2016using] which addresses realworld influence maximization problems perform network discovery by surveying nodes while exploiting a specific network property such as community structure. CHANGE algorithm [Wilder18] is based on the principle of friendship paradox and performs network discovery by surveying a random node and one of its neighbour. Each node reveals the information about its neighbors upon querying. The subgraph obtained after querying a limited set of nodes is used to pick a set of influential nodes using an influence maximization algorithm. A recent work by kamarthi2019influence provides a reinforcement learning based approach to automatically train an agent for network discovery. They developed an extension to DQN referred to as GeometricDQN to learn policies for network discovery by extracting relevant graph properties, which achieves better performance than the existing approaches. As any other reinforcement learning approach, the work by kamarthi2019influence
needs to perform multiple interactions with the environment to perform exploration. As in the real world settings, the environment interactions are costly, the approach can be improved by reducing the environment interactions, i.e., by increasing the sample efficiency. This approach employs a myopic heuristic (new nodes discovered) to guide exploration and we employ goal directed learning to provide a forward looking (nonmyopic) heuristic.
In this work, we propose to model the network discovery problem as a goaldirected reinforcement learning problem. We take the advantage of the Hindsight Experience Replay [Andrychowicz2017] framework which suggests learning from failed trajectories of agent by replaying each episode with a different goal (e.g. the state visited by agent at the end of its failed trajectory) than the one agent was trying to achieve. This helps in increasing sample efficiency as agent can get multiple experiences for learning in a single environment interaction. To further improve the performance, we use the curriculum guided selection scheme proposed by Fang2019 to select the set of episodes for experience replay. While there have been some other works which focus on improving the sampleefficiency [sukhbaatar2017intrinsic, burda2018large, colas2019curious], most of them are designed for domain specific applications and unlike our curriculum guided selection scheme which adaptively controls the explorationexploitation tradeoff by gradually changing the preference on goalproximity and diversitybased curiosity, they only perform curiositydriven learning.
Contributions: In summary, following are the main contributions of the paper along different dimensions:

[leftmargin=*]

Problem: We convert the whole process of network discovery and influence maximization into a goal directed learning problem. Unlike standard goal directed learning problems where goal state is known, in this problem, the goal state is not given. We provide a novel heuristic to generate goals for our problem setting.

Algorithm: We propose a new approach CLAIM  Curriculum LeArning Policy for Influence Maximization in unknown social networks which by using Curriculum guided hindsight experience replay and goal directed GeometricDQN architecture can learn sample efficient policies for discovering network structure.

Experiments: We perform experiments in social networks from three different domains and show that by using our approach, the total number of influenced nodes can be improved by upto 7.51% over existing approach.
Notation  Description 

Entire Unknown Graph  
Set of nodes known initially  
Subgraph of discovered after queries  
Neighbors of vertex in graph  
All direct edges that connect a node in  
set and a node in set  
Set of nodes from graph selected by  
influence maximization algorithm  
Expected Number of nodes influenced in  
graph on choosing as the set of  
nodes to activate 
2 Problem Description
The problem considered in this work involves discovering a subgraph of the unknown network such that the set of peer leaders chosen from the discovered subgraph maximizes the number of people influenced by peer leaders. We now describe both the components of the problem, i.e., network discovery and influence maximization in detail. The notations used in the problem description are defined in Table 1.

Network Discovery Problem: The network discovery problem can be described as a sequential decision making problem where at each step, the agent queries a node from the discovered subgraph. The queried node reveals its neighbors, expanding the discovered subgraph. The process goes on for a fixed number of steps, determined by the budget constraint. Formally, initially we are given a set of nodes and the agent can observe all the neighbors of nodes in set . Therefore, . The agent has a budget of queries to gather additional information. For query, the agent can choose a node from and observe .
.
At the end of network discovery process, i.e., after queries, we get the final discovered subgraph . This graph is provided as an input to an IM algorithm.

Influence Maximization (IM) : IM is the problem of choosing a set of influential nodes in a social network who can propagate information to maximum nodes. In this paper, the information propagation over the network is modeled using Independent Cascade Model (ICM ) [Kempe03]
, which is the most commonly used model in the literature. In the ICM, at the start of the process, only the nodes in the set of chosen initial nodes are active. The process unfolds over a series of discrete time steps, where at every step, each newly activated node attempts to activate each of its inactive neighbors and succeeds with some probability
. The process ends when there are no newly activated nodes at the final step. After discovering the subgraph using network discovery process, we can use any standard influence maximization algorithm to find out the best set of nodes to activate based on the available information. lowalekar2016robust showed the robustness of the wellknown greedy approach [Kempe03] on medium scale social network instances, which is also served as the oracle in our paper.
Overall, given a set of initial nodes and its observed connections , our task is to find sequence of queries such that maximizes . Figure 1 shows the visual representation of the problem.
3 Background
In this section, we describe the relevant research, the MDP formulation and the GeometricDQN architecture used by kamarthi2019influence to solve the network discovery and influence maximization problem.
3.1 MDP Formulation
The social network discovery and influence maximization problem can be formally modelled as an MDP.

State: The current discovered graph is the state.

Actions: The nodes yet to be queried in network constitute the action space. So, set of possible actions is .

Rewards: Reward is only obtained at the end of episode, i.e,, after T steps. It is the number of nodes influenced in the entire graph using , i.e., . The episode reward is denoted by , where is the length of the episode (budget on the number of queries available to discover the network).
Training: To train the agent in the MDP environment, DQN algorithm is used but the original DQN architecture which takes only the state representation as an input and outputs the action values can not be used as the action set is not constant and depends on the current graph. Therefore, both state and action are provided as an input to DQN and it predicts the state action value. The DQN model can be trained using a single or multiple graphs. If we train simultaneously on multiple graphs, then the MDP problem turn out to be Partially observable MDP, as the next state is determined by both current state and current action as well as the graph we are using. The range of reward values also depends on the size and structure of the graph, therefore, the reward value is normalized when multiple graphs are used for training.
(1) 
3.2 GeometricDQN
As described in previous section, the state is the current discovered graph
and actions are the unqueried nodes in the current discovered graph. So, a good vector representation of the current discovered graph is required. It is also important to represent nodes such that it encodes the structural information of the node in the context of the current discovered graph. Figure
2 shows the GeometricDQN architecture which takes the state and action representation as input and outputs the values. The details about state and action representation are provided below.
[leftmargin=*]

State representation: The state is the current graph. and the GeometricDQN architecture uses Graph Convolutional Networks to generate graph embeddings. The graph is represented with the adjacency matrix and a node feature matrix in layer where is the number of features. The node features in the input layer of graph convolution network, i.e., are generated by using randomwalk based Deepwalk embeddings^{1}^{1}1Deepwalk learns node representations that are similar to other nodes that lie within a fixed proximity on multiple random walks. [Perozzi14].
Now, a Graph Convolutional layer derives node features using a transformation function , where represent the weights of the layer. Using the formulation in Ying19, the transformation function is given by
where means adjacency matrix with added selfconnections, i.e., (
is the identity matrix).
. To better represent the global representation of graph, differential pooling is used which learns hierarchical representations of the graph in an endtoend differentiable manner by iteratively coarsening the graph, using graph convolutional layer as a building block. The output of graph convolutional network is provided as an input to a pooling layer. 
Actions representation: DeepWalk node embeddings are also utilized for representing actions. We use to denote the deepwalk embeddings.
Therefore, if is the current graph (state) and is the current node to be queried (action), we represent state as and action as which are input to the network as shown in the Figure 2.
4 Our Approach  CLAIM
In this section, we present our approach CLAIM  Curriculum LeArning Policy for Influence Maximization in unknown social networks. We first explain how the problem can be translated into a Goal directed learning problem. The advantage of translating the problem into goal directed learning problem is that it allows us to increase sample efficiency by using the Curriculum guided Hindsight experience replay (CHER) [Fang2019]. CHER involves replaying each episode with pseudo goals, so the agent can get multiple experiences in a single environment interaction which results in increasing the sample efficiency.
To use goal directed learning in our setting, we first present our novel heuristic to generate goals and the modifications to the MDP formulation for goal directed learning. After that, we present our algorithm to generate curriculum learning policy using Hindsight experience replay.
4.1 Goal Directed Reinforcement Learning
In the Goal Directed or Goal Conditioned Reinforcement Learning [Andrychowicz2017, nair2018visual], an agent interacts within an environment to learn an optimal policy for reaching a certain goal state or a goal defined by a function on the state space in an initially unknown or only partially known state space. If the agent reaches the goal, then the reinforcement learning method is terminated, and it solves the goaldirected exploration problem.
In these settings, the reward that agent gets from the environment is also dependent on the goal that agent is trying to achieve. A goalconditioned function [schaul15] learns the expected return for the goal starting from state and taking action . Given a state , action , next state , goal and corresponding reward , one can train an approximate function parameterized by by minimizing the following Bellman error:
This loss can be optimized using any standard offpolicy reinforcement learning method [nair2018visual].
Generally, in these goaldirected reinforcement learning problems, a set of goal states or goals defined by a function on the state space is given and the agent needs to reach one of the goal states (goals). But in our setting, we do not have an explicit goal state given. To convert the network discovery and influence maximization problem to a goal directed learning problem, we introduce the notion of goals for our problem. We define goal as the expected long term reward, i.e., the expected value of the number of nodes which can be influenced in the network and any state which can achieve this goal value becomes our goal state. As we have limited query budget to discover the network, the goal value will be highly dependent on the initial subgraph. If we use the same value of goal for each start state, for some start states this common goal value will turn out to be a very loose upper bound (or very loose lower bound). Experimentally, we found that if the goal value is too far from the actual value which can be achieved, it negatively affects the speed of learning. So, we design a heuristic to compute a different goal for each start state.
4.1.1 Goal Generation Heuristic
As we need to generate goal at the start of each episode (i.e., before the agent starts interacting with the environment), we need to compute the goal value without making any queries to the environment. We assume that based on the domain knowledge, agent can get an estimate about the number of nodes (
) and edges ( in the network and also an estimate about average number of nodes which can be influenced in the network (irrespective of the start state) (). We now describe how we use these estimates to design our heuristic to compute the goal value for each start state.Figure 3 represents the steps for our heuristic. As the network is unknown to the algorithm, we assume a network structure and compute the diffusion probability based on the assumed network structure and estimates about the number of nodes, edges and average influence. By using the computed diffusion probability, given estimates and assumed network structure, we generate a goal value for a given initial subgraph.
We assume that the network is undirected and uniformly distributed, i.e, each node is connected to
nodes. We also assume a local tree structure as shown in Figure 3 ^{2}^{2}2These simplified assumptions work well to approximate the influence propagation. We also observe in our experiments that our heuristic outputs a value which is closer to actual value. to approximate the actual expected influence within the social network [chen2010scalable, wang2012scalable]. The root of the tree can be any of the nodes (initially given nodes) and each node will be part of only one of such trees. The influence propagation probability is assumed to be and is considered same for all edges. We now show how the value of can be computed based on the network structure assumption and available information.
Computing p’: We find a value of such that the expected influence in our treestructured network is similar to the estimate on the average value of influence . To compute the expected influence or expected number of nodes activated in the network, we need to know the number of layers in the tree structure. Therefore, we first compute the number of layers in our assumed tree structure. Let , which is the number of nodes at first layer. For subsequent layers, each node will be connected to nodes at the layer below it (one edge will be to the node at the above layer). We use to denote the quantity . As the total number of nodes in the graph is , sum of the number of nodes at all layers should be equal to . Let denotes the number of layers. Then,
(2) (3) Solving for gives . Now, we compute the expected number of nodes activated (influenced) in our assumed network with the propagation probability . Let denotes the expected number of nodes influenced in the network. Then,
(4) (5) If our assumed network is similar to actual network, the value of should be close to , i.e., the average number of nodes influenced in the network. Therefore, to find the value of , we perform a search in the probability space and use the value of which makes closest to .

Computing goal value for a given initial subgraph: Now, to compute the goal value for a given initial subgraph, we use the value computed above. The subgraph is known, i.e., the neighbors of nodes in set () are known. Therefore, the number of nodes at first layer is equal to , i.e., . For the next layer onwards, we assume a similar tree structure as before with each node connected to node at the layer below it. Therefore, to compute the goal value, we substitute as in equations 3 and 5 to compute the number of layers and influence value. We use the value of computed above and solve for . The value obtained is the influence value we can achieve for the given subgraph based on the assumptions and available information. We use the value of as our goal for the subgraph.
4.1.2 Modifications to the MDP formulation:
The state and action remain the same as before but due to the introduction of goals, the reward function is now parameterized by the goal. Let denote the reward obtained at timestep when the goal is . As we only get episode reward, therefore ^{3}^{3}3Normalizing the reward using the goal stabilizes the learning.,
(6) 
4.2 Algorithm
In this section, we describe the algorithm used to train the reinforcement learning agent. We use the algorithm and use Curriculum Guided Hindsight Experience Replay for improving the sample efficiency. Algorithm 1 describes the detailed steps. Figure 4 provides a visual representation.
We train using multiple training graphs. In each episode, we sample a training graph and then sample initial set of nodes . We generate the input state by computing the deepwalk embeddings at each timestep and use greedy policy to select the action, i.e., the node to be queried. In step 1, we store the transitions according to standard experience replay where we add the goal as well in the experience buffer.
Steps 11 are the first set of crucial steps to improve the sample efficiency, where as per the Hindsight Experience Replay technique proposed by Andrychowicz2017, we sample pseudo goals and in addition to storing the sample with the actual goal for the episode, we also store each sample by modifying the desired goal (which the agent could not achieve in the failed trajectory) with a pseudo goal . The reward with the pseudo goal is recomputed as per the Equation 6.
While there are multiple possible strategies to generate the set of pseudo goals [Andrychowicz2017], the most common strategy to generate the pseudo goals is to use the goal achieved at the end of episode. Therefore, in this work, we use as .
Step 1 is the second crucial step towards improving the sample efficiency where for sampling experiences from the replay buffer, we use a curriculum guided selection process which relies on the goalproximity and diversity based curiosity [Fang2019]. Instead of sampling experiences uniformly, we select a subset of experiences based on the tradeoff between goalproximity and diversity based curiosity. This plays an important role in guiding the learning process. A large proximity value enforces the training to proceed towards the desired goals, while a large diversity value ensures exploration of different states and regions in the environment. To sample a subset of size for replay from the experience buffer , the following optimization needs to be solved:
(7) 
where is the uniformly sampled subset of size from the buffer . Let as Fang2019 does. measures the proximity of the achieved goals in to its desired goal . The second term denotes the diversity of states and regions of the environment in . And the weight is used to balance the tradeoff between the proximity and the diversity. The tradeoff between the two values is balanced such that it enforces a humanlike learning strategy, where there is more curiosity in exploration in the earlier stages and later the weight is shifted to the goalproximity.
In our work, we define the proximity as the similarity between goal values and diversity based on the distance between visited states. This is because even though goal values (influence achieved) can be different, the states visited can still be very similar to each other. Formally, to define proximity, we use the difference between achieved goal and the desired goal as distance and subtract it from a large constant to get the similarity, i.e.,
(8) 
where is a large number to guarantee for all possible , and is the goal corresponding to experience in set . For defining diversity, we need to compute similarity between states, and the Geometric DQN architecture allows us to easily compute this value. Diversity is defined as follows
(9) 
where we use to denote the embedding vector of the state (representation of the graph in the embedding space) corresponding to the experience . The embedding vector of the state is the output of the graph convolution and pooling layer (input to FC1) in Figure 2. denotes the similarity score between the vector representations and is computed by taking the dot product of the vectors.
This definition of diversity is inspired by the facility location function [cornuejols1977uncapacitated, lin2009graph] which was also used by Fang2019. Intuitively, this diversity term is measuring how well the selected experiences in set can represent the experiences from . A large diversity score indicates that every achieved state in can find a sufficiently similar state in . A diverse subset is more informative and thus helps in improving the learning.
It has been shown that defined in equation 7 is a monotone nondecreasing submodular function ^{4}^{4}4It is a weighted sum of a nonnegative modular function () and a submodular function (). Please refer to the paper by Fang2019 for details. Therefore, even though exactly solving equation 7 is NPhard, due to the submodularity property, greedy algorithm can provide a solution with an approximation factor [nemhauser1978analysis]. The greedy algorithm picks top experiences from the buffered experiences . It will start by taking as an empty set and at each step, it will add the experience which maximizes the marginal gain. We denote the marginal gain for experience by and it is given by
(10) 
At the end of each episode, the tradeoff coefficient is multiplied by a discount rate , which produces the continuous shifting of weights from diversity to proximity score. Then effect of will go to zero when .
5 Experiments
The goal of the experiment section is to evaluate the performance of our approach CLAIM in comparison to following stateoftheart approaches:

Random  At each step, it randomly queries a node from available unqueried nodes.

CHANGE Algorithm by Wilder18

GeometricDQN (Baseline) Algorithm by kamarthi2019influence
Category  Train networks  Test networks 

Retweet  copen, occupy  israel, damascus, 
obama, assad  
Animal  plj, rob  bhp, kcs 
FSW  zone 1  zone 2, zone 3 
Network category  Retweet networks  Animals networks  FSW networks  

Test networks  israel  damascus  obama  assad  bhp  kcs  zone 2  zone 3 
OPT value  113.9  195.8  154.7  134.2  111.9  113.4  20.98  16.40 
Random value  31.17  84.71  40.81  69.44  36.80  54.39  13.26  12.31 
CHANGE value  32.42  92.41  48.61  69.77  35.87  54.52  12.60  10.51 
Geometric DQN  37.33  105.2  52.01  75.12  40.12  60.81  13.65  12.35 
CLAIM approach  38.55  113.1  54.67  77.49  42.25  64.58  13.94  12.48 
Improve percent  3.27%  7.51%  5.11%  3.15%  5.31%  6.20%  2.12%  1.05% 
Comparison of influence score of our proposed approach and existing approaches for each test network. For each network, a paired ttest is performed and
indicates statistical significance of better performance at level, at level, and at level.Dataset: The first network is the Retweet Network from twitter [Rossi14]. The second network is Animal Interaction networks which are a set of contact networks of field voles (Microtus agrestis) inferred from mark–recapture data collected over 7 years and from four sites [davis2015spatial]. The third network is a realworld physical network between Female Sex Workers (FSW) in a large Asian city divided into multiple zones. This is a confidential dataset physically collected by a nonprofit by surveying different female sex workers recently. The goal in FSW networks is to discover the network and select a subset of FSW from the discovered network to be enrolled in the HIV prevention programs. The enrolled FSWs should be such that they can pass on the information (influence) maximum FSWs in the complete network. For each family of network, we divide them into train and test data as shown in Table 2.
Experimental Settings: Our experimental settings are similar to the settings used in kamarthi2019influence. There are 5 nodes in the set . All nodes in and their neighbors are known. We have further budget of queries to discover the network. After getting the final subgraph , we pick 10 nodes to activate using greedy influence maximization algorithm. We use as the diffusion probability for all the edges.
5.1 Results
To demonstrate sample efficiency, we measure the performance of our approach against past approaches by the average number of nodes influenced over 100 runs under a fixed number of queries. Here are the key observations:

Average influence value: Table 3 shows the comparison of number of nodes influenced by different algorithms. Each algorithm selects the set of nodes to activate from the discovered graph. As shown in the table, our approach consistently outperforms all existing approaches across different networks. CLAIM learns a better policy in the same number of episodes and hence more sample efficient. We would like to highlight here that even a small consistent improvement in these settings is very important as it can ensure more life safety (as an example by educating people about HIV prevention).

Effect of density of the initial subgraph: The number of nodes which can be influenced in the graph is highly dependent on the position of initial subgraph in the whole social network. Therefore, we also test the performance of CLAIM against the baseline approach on the dense and sparse initial subgraphs (we identify the initial subgraph as dense or sparse based on the ratio of ). We compare the average influence values as shown in Figure 5. CLAIM outperforms the baseline in most of the cases, except the sparse case in the damascus network. The reason for this may be that the damascus network is an extremely sparse network, and it has some specific structure property that leads this result.

Ablation Study: We also present the detailed results for our ablation study over all datasets in Table 4. We observe the effect of adding each additional component in CLAIM one by one. First, we add only goal as a feature to the baseline model. Next, we add the Hindsight Experience Replay and finally we add the curriculum guided selection of experiences for replay. These results indicate that a single component can not guarantee a better result for all networks, and we need all three components to improve the performance across multiple datasets.

Stability check: We check the stability of CLAIM by comparing the performance of models trained using different random seeds. We train three models for both baseline and CLAIM. Table 5 shows the mean and deviation of influence value for different networks. CLAIM not only achieves high mean it also provides a low deviation reflecting the stability of approach.
Network category Retweet Networks Animals networks FSW networks Test networks israel damascus obama assad bhp kcs zone 2 zone 3 Baseline (Geometric DQN) 37.33 105.2 52.01 75.12 40.12 60.81 13.65 12.35 Goaldirected Geometric DQN 36.24 110.5 51.61 73.68 41.59 62.80 13.79 12.32 Hindsight Experience Replay 37.79 109.4 53.51 76.32 42.00 64.64 13.81 12.48 Our proposed approach (CLAIM) 38.55 113.1 54.67 77.49 42.25 64.58 13.94 12.48 Table 4: Ablation study for each test network Networks\Method Geometric DQN CLAIM israel damascus obama assad bhp kcs zone 2 zone 3 Table 5: Stability of our approach compared to the baseline on different sets of 100 runs 
Property insight: We also explore the properties of the selected nodes to further investigate why CLAIM performs better. We look at degree centrality measures, closeness centrality measures, and betweenness centrality measures of the nodes queried in the underlying graph. In particular, we conduct experiment using assad, a retweet network with sparsely interconnected stargraph. As we can see in Figure 6, compared to the baseline approach, on an average, CLAIM can recognize nodes with higher degree, closeness and betweenness centrality. As a result, CLAIM is able to discover a bigger network. The higher degree centrality, higher closeness centrality and higher betweenness also show that CLAIM can explore nodes which plays an important role in influence maximization problem. Besides, these values are large at the beginning which means that CLAIM tends to explore a bigger graph first, and then leverage the available information with the learned graph to find complex higherorder patterns in the graphs that enable it to find key nodes during the intermediate timesteps, and finally utilise all the information to expand the discovered graph at the end.
6 Discussion
We provide a justification for the choices made in the paper.

Network structure assumption for goal generation: As we have no prior information about the networks except the initial nodes, we need to make some assumption to compute the goal value. We make the assumption of network being uniformly distributed and use a tree structure to approximate the information propagation as most networks observed for these problems have similar structure or can be converted in these forms with minimal loss of information.

Goodness of heuristic used for goal generation: Experimentally, we observe that the goal value computed by our heuristic is closer to the actual value. For example, for the training network copen, the achieved influence value by the model after training is at most within 20% of goal value computed using heuristic. In addition, most of the achieved influence value is much closer and is smaller than the computed goal. In the future, we will investigate different ways to generate a goal with proven upper bound.
7 Conclusion
In this work, we proposed a sample efficient reinforcemernt learning approach for network discovery and influence maximization problem. Through detailed experiments, we show that our approach outperforms existing approaches on real world datasets. In future, we would like to extend this work to consider multiple queries at each timestep.