1 Introduction
Graphs are a pervasive representation that arises naturally in a variety of disciplines; however, their nonEuclidean structure has traditionally proven challenging for machine learning and decisionmaking approaches. The emergence of the Graph Neural Network learning paradigm
Scarselli et al. (2009)and geometric deep learning more broadly
Bronstein et al. (2017)have brought about encouraging breakthroughs in diverse application areas for graphstructured data: relevant examples include combinatorial optimization
Vinyals et al. (2015); Bello et al. (2016); Khalil et al. (2017), recommendation systems Monti et al. (2017); Ying et al. (2018) and computational chemistry Gilmer et al. (2017); Jin et al. (2018); You et al. (2018); Bradshaw et al. (2019).A recent line of work focuses on goaldirected graph construction, in which the aim is to build or modify the topology of a graph (i.e., add a set of edges) so as to maximize the value of a global objective function. Unlike classic graph algorithms, which assume that the graph topology is static, in this setting the graph structure itself is dynamically
changed. As this task involves an element of exploration (optimal solutions are not known a priori), its formulation as a decisionmaking process is a suitable paradigm. Modelfree reinforcement learning (RL) techniques have been applied in the context of the derivation of adversarial examples for graphbased classifiers
Dai et al. (2018) and generation of molecular graphs You et al. (2018). Darvariu et al. (2020) have formulated the optimization of a global structural graph property as an MDP and approached it using a variant of the RLS2V Dai et al. (2018) algorithm, showing that generalizable strategies for improving a global network objective can be learned, and can obtain performance superior to prior approaches Beygelzimer et al. (2005); Schneider et al. (2011); Wang and Van Mieghem (2008); Wang et al. (2014) in some cases. However, when applying such approaches to improve the properties of realworld networked systems (such as infrastructure networks), two challenges become apparent:
Inability to account for spatial properties of the graphs: optimizing the topology of the graph alone is only part of the problem in many cases. A variety of realworld networks share the property that nodes are embedded in space, and this geometry has a strong relationship with the types of topologies that can be created Gastner and Newman (2006b); Barthélemy (2011). Since there is a cost associated with edge length, connections tend to be local, and longrange connections must be justified by some gain (e.g., providing connectivity to a hub). Furthermore, objective functions defined over nodes’ positions (such as efficiency) are key for understanding their organization Latora and Marchiori (2001).

Scalability: existing methods based on RL are challenging to scale, due to the sample complexity of current training algorithms, the linear increase of possible actions in the number of nodes, and the complexity of evaluating the global objectives (typically polynomial in the number of nodes). Additionally, training data (i.e., instances of realworld graphs) are scarce and we are typically interested in a specific starting graph (e.g., a particular infrastructure network to be improved).
In this paper, we set out to address these shortcomings. For the first time in this emerging line of work, we consider the construction of spatial graphs as a decisionmaking process that explicitly captures the influence of space on graphlevel objectives, realizable links between nodes, and connection budgets. Furthermore, to address the scalability issue, we propose to use planning methods in order to plan an optimal set of edges to add to the graph, which sidesteps the problem of sample complexity since we do not need to learn a policy. We adopt the Monte Carlo Tree Search framework – specifically, the UCT algorithm Kocsis and Szepesvári (2006) – and show it can be applied successfully in planning graph construction strategies. We illustrate our approach at a high level in Figure 1. Finally, we propose several improvements over the basic UCT method in the context of spatial networks. These relate to important characteristics of this family of problems: namely, their singleagent, deterministic nature; the inherent tradeoff between the cost of edges and their contribution to the global objective; and an action space that is linear in the number of nodes in the network. Our proposed approach, Spatial Graph UCT (SGUCT), is designed with these characteristics in mind.
As objective functions, in this study, we consider the global network properties of efficiency and robustness to targeted attacks. While these represent a variety of practical scenarios, our approach is broadly applicable to any other structural property. We perform an evaluation on synthetic graphs (generated by a spatial growth model) and several realworld graphs (internet backbone networks and metro transportation networks), comparing SGUCT to UCT as well as a variety of baselines that have been proposed in the past. Our results show that SGUCT performs best out of all methods in all the settings we tested; moreover, the performance gain over UCT is substantial (24% on average and up to 54% over UCT on the largest networks tested in terms of a robustness metric). In addition, we conduct an ablation study that explores the impact of the various algorithmic components.
2 Preliminaries
MDPs and Planning.Markov Decision Processes (MDPs) are widely adopted for effective formalization of decision making tasks. The decision maker, usually called an agent, interacts with an environment. When in a state , the agent must take an action out of the set of valid actions, receiving a reward governed by the reward function . Finally, the agent finds itself in a new state , depending on a transition model
that governs the joint probability distribution
of transitioning to state after taking action in state . This sequence of interactions gives rise to a trajectory , which continues until a terminal state is reached. In deterministic MDPs, there exists a unique state s.t. . The tuple defines this MDP, where is a discount factor. We also define a policy , a distribution of actions over states. There exists a spectrum of algorithms for constructing a policy, ranging from modelbased algorithms (which assume knowledge of the MDP) to modelfree algorithms (which require only samples of agentenvironment interactions). In the cases in which the full MDP specification or a model are available, planning can be used to construct a policy, for example using forward search (Russell and Norvig, 2010, Chapter 10).Monte Carlo Tree Search. Monte Carlo Tree Search (MCTS) is a modelbased planning technique that addresses the inability to explore all paths in large MDPs by constructing a policy from the current state
. Values of states are estimated through the returns obtained by executing simulations from the starting state. Upper Confidence Bounds for Trees (UCT), a variant of MCTS, consists of a tree search in which the decision at each node is framed as an independent multiarmed bandit problem. At decision time, the
tree policy of the algorithm selects the child node corresponding to action that maximizes , where is the sum of returns obtained when taking action in state , is the number of parent node visits, the number of child node visits, and is a constant that controls the level of exploration Kocsis and Szepesvári (2006). In the standard version of the algorithm, the returns are estimated using a random default policy when expanding a node. MCTS has been applied to great success in connectionist games such as Morpion Solitaire Rosin (2011), Hex Nash (1952); Anthony et al. (2017), and Go, which was previously thought computationally intractable Silver et al. (2016, 2018).Spatial Networks. We define a spatial network as the tuple . is the set of vertices, and is the set of edges. is a function that maps nodes in the graph to a set of positions . We require that admits a metric , i.e., there exists a function defining a pairwise distance between elements in . The tuple defines a space, common examples of which include Euclidean space and spherical geometry. associates a weight with each edge: a positive realvalued number that denotes its capacity.
Global Objectives in Spatial Networks. We consider two global objectives for spatial networks that are representative of a wide class of properties relevant in realworld situations. Depending on the domain, there are many other global objectives for spatial networks that can be considered, to which the approach that we present is directly applicable.
Efficiency. Efficiency is a metric quantifying how well a network exchanges information. It is a measure of how fast information can travel between any pair of nodes in the network on average, and is hypothesized to be an underlying principle for the organization of networks Latora and Marchiori (2001). Efficiency does not solely depend on topology but also on the spatial distances between the nodes in the network. We adopt the definition of global efficiency as formalized by Latora and Marchiori, and let , where denotes the cumulative length of the shortest path between vertices and . To normalize this quantity, we divide it by the ideal efficiency , and possible values are thus in .^{1}^{1}1It is worth noting that efficiency is a more suitable metric for measuring the exchange of information than the inverse average path length between pairs of nodes. In the extreme case where the network is disconnected (and thus some paths lengths are infinite), this metric does not go to infinity. More generally, this metric is better suited for systems in which information is exchanged in a parallel, rather than sequential, way Latora and Marchiori (2001). Efficiency can be computed in by aggregating path lengths obtained using the FloydWarshall algorithm.^{2}^{2}2In practice, this may be made faster by considering dynamic shortest path algorithms, e.g. Demetrescu and Italiano (2004).
Robustness. We consider the property of robustness, i.e., the resilience of the network in the face of removals of nodes. We adopt a robustness measure widely used in the literature Albert et al. (2000); Callaway et al. (2000) and of practical interest and applicability based on the largest connected component (LCC), i.e., the component with most nodes. In particular, we use the definition in Schneider et al. (2011), which considers the size of the LCC as nodes are removed from the network. We consider only the targeted attack case as previous work has found it is more challenging Albert et al. (2000); Darvariu et al. (2020). We define the robustness measure as , where denotes the fraction of nodes in the LCC of after the removal of the first nodes in the permutation (in which nodes appear in descending order of their degrees). Possible values are in . This quantity can be estimated using Monte Carlo simulations and scales as .
It is worth noting that the value of the objective functions typically increases the more edges exist in the network (the complete graph has both the highest possible efficiency and robustness). However, constructing a complete graph is wasteful in terms of resources, and so it may be necessary to balance the contribution of an edge to the objective with its cost. The method that we propose explicitly accounts for this tradeoff, which is widely observed in infrastructure and brain networks Gastner and Newman (2006a); Bullmore and Sporns (2012).
3 Proposed Method
In this section, we first formulate the construction of spatial networks in terms of a global objective function as an MDP. Subsequently, we propose a variant of the UCT algorithm (SGUCT) for planning in this MDP, which exploits the characteristics of spatial networks.
3.1 Spatial Graph Construction as an MDP
Spatial Constraints in Network Construction. Spatial networks that can be observed in the real world typically incur a cost to edge creation. Take the example of a power grid: the cost of a link can be expressed as a function of its geographic distance as well as its capacity. It is vital to consider both aspects of link cost in the process of network construction. We let denote the cost of edge and be the cost of a set of edges . We consider to capture the notion that longer, higher capacity connections are more expensive – although different notions of cost may be desirable depending on the domain. To ensure fair comparisons between various networks, we normalize costs to be in .
Problem Statement. Let be the set of labeled, undirected, weighted, spatial networks with nodes. We let be an objective function, and be a modification budget. Given an initial graph , the aim is to add a set of edges to such that the resulting graph satisfies:
(1) 
MDP Formulation. We next define the key elements of the MDP:
State: The state is a 3tuple containing the spatial graph , an edge stub , and the remaining budget . can be either the empty set or the singleton , where . If the edge stub is nonempty, it means that the agent has “committed" in the previous step to creating an edge originating at the edge stub node.
Action: For scalability to large graphs, we let actions correspond to the selection of a single node in (thus having at most choices). We enforce spatial constraints in the following way: given a node , we define the set of connectable nodes that represent realizable connections. We let:
which formalizes the idea that a node can only connect as far as a proportion of its longest existing connection, with fixed based on the initial graph . This has the benefit of allowing longrange connections if they already exist in the network. Given an unspent connection budget , we let the set consist of those connectable nodes whose cost is not more
than the unspent budget.^{3}^{3}3Depending on the type of network being considered, in practice there may be different types of constraints on the connections that can be realized. For example, in transportation networks, there can be obstacles that make link creation impossible, such as prohibitive landforms or populated areas. In circuits and utility lines, planarity is a desirable characteristic as it makes circuit design cheaper. Such constraints can be captured by the definition of and enforced by the environment when providing the agent with available actions . Conversely, defining recovers the simplified case where no constraints are imposed. Letting the degree of node be , available actions are defined as:
Transitions: The deterministic transition model adds an edge to the graph topology every two steps. Concretely, we define it as , where
Reward: The final reward is defined as and all intermediary rewards are .
Episodes in this MDP proceed for an arbitrary number of steps until the budget is exhausted and no valid actions remain (concretely, ). Since we are in the finite horizon case, we let . Given the MDP definition above, the problem specified in Equation 1 can be reinterpreted as finding the trajectory that starts at such that the final reward is maximal – actions along this trajectory will define the set of edges .
3.2 Algorithm
The formulation above can, in principle, be used with any planning algorithm for MDPs in order to identify an optimal set of edges to add to the network. The UCT algorithm, discussed in Section 2, is one such algorithm that has proven very effective in a variety of settings. We refer the reader to Browne et al. (2012) for an indepth description of the algorithm and its various applications. However, the generic UCT algorithm assumes very little about the particulars of the problem under consideration, which, in the context of spatial network construction, may lead to suboptimal solutions. In this section, we identify and address concerns specific to this family of problems, and formulate the Spatial Graph UCT (SGUCT) variant of UCT in Algorithm 1. The evaluation presented in Section 4 compares SGUCT to UCT and other baselines, and contains an ablation study of SGUCT’s components.
Best Trajectory Memorization (BTM). The standard UCT algorithm is applicable in a variety of settings, including multiagent, stochastic environments. For example, in twoplayer games, an agent needs to replan from the new state that is arrived at after the opponent executes its move. However, the singleagent (puzzle), deterministic nature of the problem considered means that there is noneed to replan trajectories after a stochastic event: the agent can plan all its actions from the very beginning in a single step. We thus propose the following modification over UCT: memorizing the trajectory with the highest reward found during the rollouts, and returning it at the end of the search. We name this Best Trajectory Memorization, shortened BTM. This is similar in spirit (albeit much simpler) to ideas used in Reflexive and Nested MCTS for deterministic puzzles, where the best move found at lower levels of a nested search are used to inform the upper level Cazenave (2007, 2009).
CostSensitive Default Policy.
The standard default policy used to perform outoftree actions in the UCT framework is based on random rollouts. While evaluating nodes using this approach is free from bias, rollouts can lead to highvariance estimates, which can hurt the performance of the search. Previous work has considered handcrafted heuristics and learned policies as alternatives, although, perhaps counterintuitively, learned policies may lead to worse results
Gelly and Silver (2007). As initially discussed in Section 2, the value of the objective functions we consider grows with the number of edges of the graph. We thus propose the following default policy for spatial networks: sampling each edge with probability inversely proportional to its cost. Formally, we let the probability of edge
being selected during rollouts be proportional to , where denotes the level of bias. reduces to random choices, while selects the minimum cost edge. This is very inexpensive from a computational point of view, as the edge costs only need to be computed once, at the start of the search.Action Space Reduction. In certain domains, the number of actions available to an agent is large, which can greatly affect scalability. Previous work in RL has considered decomposing actions into independent subactions He et al. (2016), generalizing across similar actions by embedding them in a continuous space DulacArnold et al. (2015), or learning which actions to eliminate via a supervision signal provided by the environment Zahavy et al. (2018). Previous work on planning considers progressively widening the search based on a heuristic Chaslot et al. (2008) or learning a partial policy for eliminating actions in the search tree Pinto and Fern (2017).
Concretely, with the MDP definition used, the action space grows linearly in the number of nodes. This is partly addressed by the imposed connectivity constraints: once an edge stub is selected (equivalently, at odd values of
), the branching factor of the search is small since only connectable nodes need to be considered. However, the number of actions when selecting the origin node of the edge (even values of ) remains large, which might become detrimental to performance as the size of the network grows (we illustrate this in Figure 2). Can this be mitigated?We consider limiting the nodes that can initiate connections to a specific subset – which effectively prunes away all branches in the search tree that are not part of this set. Concretely, let a reduction policy be a function that, given the initial graph , outputs a strict subset of its nodes.^{4}^{4}4Learning a reduction policy in a datadriven way is also possible; however, the supervision signal needed (i.e., results of node rankings over multiple MCTS runs) is very expensive to obtain. Furthermore, since we prioritize performance on specific graph instances over generalizable policies, simple statistics may be sufficient. Still, a learned reduction policy that predicts an entire set at once may be able to identify better subsets than individual statistics alone. We consider this a worthwhile direction for future work. Then, we modify our definition of allowed actions as follows: under a reduction policy , we define
We investigate the following class of reduction policies: a node is included in if and only if it is among the top nodes ranked by a local node statistic . More specifically, we consider the listed below, where
. Since the performance of reduction strategies may depend heavily on the objective function, we treat it as a hyperparameter to be optimized.

[leftmargin=*]

Degree (DEG): ; Inverse Degree (ID): ; Number of Connections (NC):

Best Edge (BE): ; BE Cost Sensitive (BECS):

Average Edge (AE): ; AECS: .
4 Experiments
4.1 Experimental Protocol
Dataset  Graph  

Internet  Colt  146  178 
GtsCe  130  169  
TataNld  141  187  
UsCarrier  138  161  
Metro  Barcelona  135  159 
Beijing  126  139  
Mexico  147  164  
Moscow  134  156  
Osaka  107  122 
Definitions of Space and Distance. For all experiments in this paper, we consider the unit 2D square as our space, i.e. we let and the distance be Euclidean distance. In case the graph is defined on a spherical coordinate system (as is the case with physical networks positioned on Earth), we use the WGS84 variant of the Mercator projection to project nodes to the plane; then normalize to the unit plane. For simplicity, we consider uniform weights, i.e., .
Synthetic and RealWorld Graphs. As a means of generating synthetic graph data, we use the popular model proposed by Kaiser and Hilgetag (2004), which simulates a process of growth for spatial networks. Related to the Waxman model Waxman (1988), in this model the probability that a connection is created is proportional to its distance from existing nodes. The distinguishing feature of this model is that, unlike e.g. the Random Geometric Graph Dall and Christensen (2002), this model produces connected networks: a crucial characteristic for the types of objectives we consider. We henceforth refer to this model as KaiserHilgetag (shortened KH). We use and , which yields sparse graphs with scalefree degree distributions – a structure similar to road infrastructure networks. We also evaluate performance on networks belonging to the following realworld datasets, detailed in Table 1: Internet (a dataset of internet backbone infrastructure from a variety of ISPs Knight et al. (2011)) and Metro (a dataset of metro networks in major cities around the world Roth et al. (2012)). Due to computational budget constraints, we limit the sizes of networks considered to .
Setup. For all experiments, we allow agents a modification budget equal to a proportion of the total cost of the edges of the original graph, i.e., . We use . We let for synthetic graphs and
for realworld graphs, respectively. Confidence intervals are computed using results of
runs, each initialized using a different random seed. Rollouts are not truncated. We allow a number of node expansions per move equal to (a larger number of expansions can improve performance, but leads to diminishing returns), and select as the move at each step the node with the maximum average value (commonly referred to as MaxChild). Full details of the hyperparameter selection methodology and the values used are provided in the Appendix.Baselines. The baselines we compare against are detailed below. These methods represent the major approaches that have been considered in the past for the problem of goaldirected graph modifications: namely, considering a local node statistic Beygelzimer et al. (2005), a shallow greedy criterion Schneider et al. (2011), or the spectral properties of the graph Wang and Van Mieghem (2008); Wang et al. (2014). We do not consider previous RLbased methods since they are unsuitable for the scale of the largest graphs taken into consideration, for which the time needed to evaluate the objective functions makes training prohibitively expensive.

[leftmargin=*]

Random (): Randomly selects an available action.

Greedy (): Selects the edge that gives the biggest improvement in : formally, edge that satisfies . We also consider the costsensitive variant Greedy_{CS}, for which the gain is offset by the cost: .

MinCost (): Selects edge that satisfies .

LBHB (): Adds an edge between the node with Lowest Betweenness and the node with Highest Betweeness; formally, letting the betweeness centrality of node be , this strategy adds an edge between nodes and .

LDP (): Adds an edge between the vertices with the Lowest Degree Product, i.e., vertices that satisfy .

ERes (): Adds an edge between vertices with the highest pairwise Effective Resistance, i.e., nodes that satisfy . is defined as , where is the pseudoinverse of the graph Laplacian Wang et al. (2014).
4.2 Evaluation Results
Synthetic Graph Results. In this experiment, we consider KH graphs each of sizes . The obtained results are shown in the top half of Table 2. We summarize our findings as follows: SGUCT outperforms UCT and all other methods in all the settings tested, obtaining 13% and 32% better performance than UCT on the largest synthetic graphs for the efficiency and robustness measures respectively. For , UCT outperforms all baselines, while for the performance of the Greedy baselines is superior to UCT. Interestingly, MinCost yields solutions that are superior to all other heuristics and comparable to searchbased methods while being very cheap to evaluate. Furthermore, UCT performance decays in comparison to the baselines as the size of the graph increases.
Realworld Graph Results. The results obtained for realworld graphs are shown in the bottom half of Table 2 (an extended version, split by individual graph instance, is shown in Table 5 in the Appendix). As with synthetic graphs, we find that SGUCT performs better than UCT and all other methods in all settings tested. The differences in performance between SGUCT and UCT are 10% and 39% for and respectively. We note that the Greedy approaches do not scale to the larger realworld graphs due to their complexity: edges need to be considered at each step, in comparison to the required by UCT and SGUCT.
Objective  

25  50  75  25  50  75  
Random  0.128  0.089  0.077  0.031  0.033  0.035 
Greedy  0.298  0.335  0.339  0.064  0.078  0.074 
Greedy_{CS}  0.281  0.311  0.319  0.083  0.102  0.115 
LDP  —  —  —  0.049  0.044  0.040 
FV  —  —  —  0.051  0.049  0.049 
ERes  —  —  —  0.054  0.057  0.052 
MinCost  0.270  0.303  0.315  0.065  0.082  0.099 
LBHB  0.119  0.081  0.072  —  —  — 
UCT  0.288  0.307  0.311  0.092  0.112  0.120 
SGUCT (ours)  0.305  0.341  0.352  0.107  0.140  0.158 
Random  LDP  FV  ERes  MinCost  LBHB  UCT  SGUCT (ours)  

Internet  0.036  —  —  —  0.096  0.039  0.137  0.145  
Metro  0.013  —  —  —  0.049  0.007  0.056  0.064  
Internet  0.014  0.025  0.015  0.021  0.072  —  0.083  0.128  
Metro  0.009  0.012  0.013  0.020  0.048  —  0.068  0.085 
Ablation Study. We also conduct an ablation study in order to assess the impact of the individual components of SGUCT, using the same KH synthetic graphs. The obtained results are shown in Table 3, where SGUCT_{BTM} denotes UCT with best trajectory memorization, SGUCT_{MINCOST} denotes UCT with the costbased default policy, SGUCT_{ϕ−q} for in denotes UCT with a particular reduction policy , and represents the percentage of original nodes that are selected by . We find that BTM indeed brings a net improvement in performance: on average, 5% for and 11% for . The benefit of the costbased default policy is substantial (especially for ), ranging from 4% on small graphs to 27% on the largest graphs considered, and grows the higher the level of bias. This is further evidenced in Figure 3, which shows the average reward obtained as a function of . In terms of reduction policies, even for a random selection of nodes, we find that the performance penalty paid is comparatively small: a 60% reduction in actions translates to at most 15% reduction in performance, and as little as 5%; the impact of random action reduction becomes smaller as the size of the network grows. The bestperforming reduction policies are those based on a node’s gains, with BECS and AECS outperforming UCT with no action reduction. For the objective, a poor choice of bias can be harmful: prioritizing nodes with high degrees leads to a 32% reduction in performance compared to UCT, while a bias towards lowerdegree nodes is beneficial.
Objective  

25  50  75  25  50  75  
UCT  0.288  0.307  0.311  0.092  0.112  0.120 
SGUCT_{BTM}  0.304  0.324  0.324  0.106  0.123  0.128 
SGUCT_{MINCOST}  0.299  0.327  0.333  0.105  0.131  0.153 
SGUCT_{RAND80}  0.284  0.305  0.303  0.091  0.111  0.119 
SGUCT_{RAND60}  0.271  0.288  0.288  0.089  0.107  0.115 
SGUCT_{RAND40}  0.238  0.263  0.271  0.083  0.102  0.110 
SGUCT_{DEG40}  0.237  0.262  0.255  0.069  0.086  0.092 
SGUCT_{INVDEG40}  0.235  0.268  0.283  0.094  0.114  0.124 
SGUCT_{NC40}  0.234  0.268  0.262  0.071  0.087  0.092 
SGUCT_{BE40}  0.286  0.304  0.297  0.088  0.108  0.115 
SGUCT_{BECS40}  0.290  0.316  0.319  0.097  0.115  0.121 
SGUCT_{AE40}  0.286  0.302  0.297  0.088  0.103  0.114 
SGUCT_{AECS40}  0.289  0.317  0.319  0.098  0.117  0.126 
5 Discussion
Limitations. While we view our work as an important step towards improving evidencebased decision making in infrastructure systems, we were not able to consider realistic domainspecific concerns due to our abstraction level. Another limitation is the fact that, due to our limited computational budget, we opted for handcrafted reduction policies. Particularly the impact of the number of nodes selected by the reduction policy was not explored. Furthermore, as already discussed in Section 3.2, a learningbased reduction policy able to predict an entire set at once may lead to better performance. We believe that it is possible to scale our approach to substantially larger networks by considering a hierarchical representation (i.e., viewing the network as a graph of graphs) and by further speeding up the calculation of the objective functions (currently the main computational bottleneck).
Societal Impact and Implications. The proposed method is suitable as a framework for the optimization of a variety of infrastructure networks, a problem that has potential for societal benefit. Beyond the communication and metro networks used in our evaluation, we list road networks, water distribution networks, and power grids as relevant application areas. We cannot foresee any immediate negative societal impacts of this work. However, as with any optimization problem, the objective needs to be carefully analyzed and its implications considered before realworld deployment: for example, it may be the case that optimizing for a specific objective might lead to undesirable results such as the introduction of a “rich get richer” effect Merton (1968), resulting in wellconnected nodes gaining more and more edges than less wellconnected ones.
6 Conclusions
In this work, we have addressed the problem of spatial graph construction: namely, given an initial spatial graph, a budget defined in terms of edge lengths, and a global objective, finding a set of edges to be added to the graph such that the value of the objective is maximized. For the first time among related works, we have formulated this task as a deterministic MDP that accounts for how the spatial geometry influences the connections and organizational principles of realworld networks. Building on the UCT framework, we have considered several concerns in the context of this problem space and proposed the Spatial Graph UCT (SGUCT) algorithm to address them. Our evaluation results show performance substantially better than UCT (24% on average and up to 54% in terms of a robustness measure) and all existing baselines taken into consideration.
This work was supported by The Alan Turing Institute under the UK EPSRC grant EP/N510129/1. The authors declare no competing financial interests with respect to this work.
References
 Albert et al. [2000] Réka Albert, Hawoong Jeong, and AlbertLászló Barabási. Error and attack tolerance of complex networks. Nature, 406(6794):378–382, 2000.
 Anthony et al. [2017] Thomas Anthony, Zheng Tian, and David Barber. Thinking Fast and Slow with Deep Learning and Tree Search. In NeurIPS, 2017.
 Barthélemy [2011] Marc Barthélemy. Spatial networks. Physics Reports, 499(13), 2011.
 Bello et al. [2016] Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural Combinatorial Optimization with Reinforcement Learning. arXiv:1611.09940, 2016.
 Beygelzimer et al. [2005] Alina Beygelzimer, Geoffrey Grinstein, Ralph Linsker, and Irina Rish. Improving Network Robustness by Edge Modification. Physica A, 357:593–612, 2005.
 Bradshaw et al. [2019] John Bradshaw, Brooks Paige, Matt J. Kusner, Marwin H. S. Segler, and José Miguel HernándezLobato. A Model to Search for Synthesizable Molecules. In NeurIPS, 2019.
 Bronstein et al. [2017] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric Deep Learning: Going beyond Euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
 Browne et al. [2012] Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, et al. A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
 Bullmore and Sporns [2012] Ed Bullmore and Olaf Sporns. The economy of brain network organization. Nature Reviews Neuroscience, 13(5):336–349, 2012.
 Callaway et al. [2000] Duncan S. Callaway, M. E. J. Newman, Steven H. Strogatz, and Duncan J. Watts. Network robustness and fragility: Percolation on random graphs. Phys. Rev. Lett., 85:5468–5471, 2000.
 Cazenave [2007] Tristan Cazenave. Reflexive montecarlo search. In Computer Games Workshop, 2007.
 Cazenave [2009] Tristan Cazenave. Nested montecarlo search. In IJCAI, 2009.
 Chaslot et al. [2008] Guillaume M. JB. Chaslot, Mark H. M. Winands, H. Jaap Van Den Herik, Jos W. H. M. Uiterwijk, and Bruno Bouzy. Progressive Strategies for MonteCarlo Tree Search. New Mathematics and Natural Computation, 04(03):343–357, 2008.
 Dai et al. [2018] Hanjun Dai, Hui Li, Tian Tian, Xin Huang, Lin Wang, Jun Zhu, and Le Song. Adversarial attack on graph structured data. In ICML, 2018.
 Dall and Christensen [2002] Jesper Dall and Michael Christensen. Random geometric graphs. Physical Review E, 66(1):016121, 2002.
 Darvariu et al. [2020] VictorAlexandru Darvariu, Stephen Hailes, and Mirco Musolesi. Improving the Robustness of Graphs through Reinforcement Learning and Graph Neural Networks. arXiv:2001.11279, 2020.
 Demetrescu and Italiano [2004] Camil Demetrescu and Giuseppe F. Italiano. A new approach to dynamic all pairs shortest paths. Journal of the ACM (JACM), 51(6):968–992, 2004.
 DulacArnold et al. [2015] Gabriel DulacArnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. Deep Reinforcement Learning in Large Discrete Action Spaces. In ICML, 2015.
 Fiedler [1973] Miroslav Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23(2):298–305, 1973.
 Gastner and Newman [2006a] Michael T. Gastner and M. E. J. Newman. Shape and efficiency in spatial distribution networks. Journal of Statistical Mechanics: Theory and Experiment, 2006(01):P01015, 2006a.
 Gastner and Newman [2006b] Michael T. Gastner and M. E. J. Newman. The spatial structure of networks. The European Physical Journal B, 49(2):247–252, 2006b.
 Gelly and Silver [2007] Sylvain Gelly and David Silver. Combining online and offline knowledge in UCT. In ICML, 2007.
 Gilmer et al. [2017] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural Message Passing for Quantum Chemistry. In ICML, 2017.
 Hagberg et al. [2008] Aric Hagberg, Pieter Swart, and Daniel S. Chult. Exploring network structure, dynamics, and function using networkx. In SciPy, 2008.
 He et al. [2016] Ji He, Mari Ostendorf, Xiaodong He, Jianshu Chen, Jianfeng Gao, Lihong Li, and Li Deng. Deep Reinforcement Learning with a Combinatorial Action Space for Predicting Popular Reddit Threads. In EMNLP, 2016.
 Hunter [2007] J. D. Hunter. Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007.

Jin et al. [2018]
Wengong Jin, Regina Barzilay, and Tommi Jaakkola.
Junction Tree Variational Autoencoder for Molecular Graph Generation.
In ICML, 2018.  Kaiser and Hilgetag [2004] Marcus Kaiser and Claus C. Hilgetag. Spatial growth of realworld networks. Physical Review E, 69(3):036103, 2004.
 Khalil et al. [2017] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In NeurIPS, 2017.
 Knight et al. [2011] Simon Knight, Hung X. Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew Roughan. The Internet Topology Zoo. IEEE Journal on Selected Areas in Communications, 29(9):1765–1775, 2011.
 Kocsis and Szepesvári [2006] Levente Kocsis and Csaba Szepesvári. Bandit Based MonteCarlo Planning. In ECML, 2006.
 Latora and Marchiori [2001] Vito Latora and Massimo Marchiori. Efficient Behavior of SmallWorld Networks. Physical Review Letters, 87(19):198701, 2001.
 McKinney et al. [2011] Wes McKinney et al. pandas: a foundational Python library for data analysis and statistics. Python for High Performance and Scientific Computing, 14(9):1–9, 2011.
 Merton [1968] Robert K. Merton. The matthew effect in science. Science, 159(3810):56–63, 1968.
 Monti et al. [2017] Federico Monti, Michael M. Bronstein, and Xavier Bresson. Geometric Matrix Completion with Recurrent MultiGraph Neural Networks. In ICML, 2017.
 Nash [1952] John Nash. Some games and machines for playing them. Technical report, Rand Corporation, 1952.
 Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, et al. Pytorch: An imperative style, highperformance deep learning library. In NeurIPS, 2019.
 Pinto and Fern [2017] Jervis Pinto and Alan Fern. Learning partial policies to speedup mdp tree search via reduction to iid learning. The Journal of Machine Learning Research, 18(1):2179–2213, 2017.
 Rosin [2011] Christopher D. Rosin. Nested Rollout Policy Adaptation for Monte Carlo Tree Search. In IJCAI, 2011.
 Roth et al. [2012] Camille Roth, Soong Moon Kang, Michael Batty, and Marc Barthelemy. A longtime limit for world subway networks. Journal of The Royal Society Interface, 9(75):2540–2550, 2012.
 Russell and Norvig [2010] Stuart J. Russell and Peter Norvig. Artificial Intelligence: a Modern Approach. Prentice Hall, third edition, 2010.
 Scarselli et al. [2009] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The Graph Neural Network Model. IEEE Transactions on Neural Networks, 20(1):61–80, January 2009.
 Schneider et al. [2011] Christian M. Schneider, André A. Moreira, Joao S. Andrade, Shlomo Havlin, and Hans J. Herrmann. Mitigation of malicious attacks on networks. PNAS, 108(10):3838–3841, 2011.
 Silver et al. [2016] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Silver et al. [2018] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay. Science, 362(6419):1140–1144, 2018.
 Vinyals et al. [2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer Networks. In NeurIPS, 2015.
 Wang and Van Mieghem [2008] Huijuan Wang and Piet Van Mieghem. Algebraic connectivity optimization via link addition. In Proceedings of the Third International Conference on BioInspired Models of Network Information and Computing Systems (Bionetics), 2008.
 Wang et al. [2014] Xiangrong Wang, Evangelos Pournaras, Robert E. Kooij, and Piet Van Mieghem. Improving robustness of complex networks via the effective graph resistance. The European Physical Journal B, 87(9):221, 2014.

Waskom [2021]
Michael L. Waskom.
Seaborn: statistical data visualization.
Journal of Open Source Software
, 6(60):3021, 2021.  Waxman [1988] B.M. Waxman. Routing of multipoint connections. IEEE Journal on Selected Areas in Communications, 6(9):1617–1622, 1988.

Ying et al. [2018]
Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and
Jure Leskovec.
Graph Convolutional Neural Networks for WebScale Recommender Systems.
In KDD, 2018.  You et al. [2018] Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, and Jure Leskovec. Graph Convolutional Policy Network for GoalDirected Molecular Graph Generation. In NeurIPS, 2018.
 Zahavy et al. [2018] Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J. Mankowitz, and Shie Mannor. Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning. In NeurIPS, 2018.
Appendix
Appendix A Additional Evaluation Details
Impact of Action Subsets on UCT Results. We include an additional experiment related to the Action Space Reduction problem discussed in Section 3.2. We consider, starting from the same initial graph, a selection of subsets of size 40% of all nodes, obtained with a uniform random . We show the empirical distribution of the reward obtained by UCT with different sampled subsets in Figure 4. Since the subset that is selected has an important impact on performance, a reduction policy yielding highreward subsets is highly desirable (effectively, we want to bias subset selection towards the upper tail of the distribution of obtained rewards).
Hyperparameters. Hyperparameter optimization for UCT and SGUCT is performed separately for each objective function and synthetic graph model / realworld network dataset. For synthetic graphs, hyperparameters are tuned over a disjoint set of graphs. For realworld graphs, hyperparameters are optimized separately for each graph. We consider an exploration constant . Since the ranges of the rewards may vary in different settings, we further employ two means of standardization: during the tree search we instead use as the final reward , and further standardize by multiplying with the average reward observed at the root in the previous timestep – ensuring consistent levels of exploration. The hyperparameters for the ablation study are bootstrapped from those of standard UCT, while for the SGUCT_{MINCOST} variant is optimized separately. These results are used to reduce the hyperparameter search space for SGUCT for both synthetic and realworld graphs. Values of hyperparameters used are shown in Table 4. For estimating we use Monte Carlo simulations.
Extended Results. Extended results for realworld graphs, split by graph instance, are shown in Table 5.
Appendix B Reproducibility
Implementation. We implement all approaches and baselines in Python using a variety of numerical and scientific computing packages Hunter [2007], Hagberg et al. [2008], McKinney et al. [2011], Paszke et al. [2019], Waskom [2021], while the calculations of the objective functions (efficiency and robustness) are performed in a custom C++ module as they are the main speed bottleneck. In a future version, we will release the implementation as Docker containers together with instructions that enable reproducing (up to hardware differences) all the results reported in the paper, including tables and figures.
Data Availability. In terms of the realworld datasets, the Internet dataset is publicly available without any restrictions and can be downloaded via the Internet Topology Zoo website, http://www.topologyzoo.org/dataset.html. The Metro dataset was originally collected by Roth et al. [2012], and was licensed to us by the authors for the purposes of this work. A copy of the Metro dataset can be obtained by others by contacting its original authors for licensing (see https://www.quanturb.com/data).
Infrastructure and Runtimes. Experiments were carried out on an internal cluster of 8 machines, each equipped with 2 Intel Xeon E52630 v3 processors and 128GB RAM. On this infrastructure, all experiments reported in this paper took approximately 21 days to complete.
Objective  

Experiment  Graph  Agent  
Internet  Colt  SGUCT  0.05  0.05  AECS40  AECS40  25  25 
UCT  0.1  0.1  —  —  —  —  
GtsCe  SGUCT  0.1  0.05  AECS40  AECS40  25  25  
UCT  0.25  0.1  —  —  —  —  
TataNld  SGUCT  0.05  0.05  AECS40  AECS40  25  25  
UCT  0.1  0.1  —  —  —  —  
UsCarrier  SGUCT  0.05  0.05  AECS40  AECS40  25  25  
UCT  0.05  0.1  —  —  —  —  
Metro  Barcelona  SGUCT  0.05  0.05  AECS40  AECS40  25  25 
UCT  0.05  0.05  —  —  —  —  
Beijing  SGUCT  0.05  0.05  AECS40  AECS40  25  25  
UCT  0.05  0.05  —  —  —  —  
Mexico  SGUCT  0.05  0.05  AECS40  AECS40  25  25  
UCT  0.25  0.05  —  —  —  —  
Moscow  SGUCT  0.25  0.1  AECS40  AECS40  25  25  
UCT  0.05  0.05  —  —  —  —  
Osaka  SGUCT  0.05  0.05  AECS40  AECS40  25  25  
UCT  0.05  0.25  —  —  —  —  
KH25  —  SGUCT  0.05  0.25  AECS40  AECS40  25  25 
UCT  0.1  0.1  —  —  —  —  
SGUCT_{MINCOST}  0.1  0.1  —  —  25  25  
KH50  —  SGUCT  0.05  0.05  AECS40  AECS40  25  25 
UCT  0.05  0.25  —  —  —  —  
SGUCT_{MINCOST}  0.05  0.25  —  —  10  25  
KH75  —  SGUCT  0.05  0.05  AECS40  AECS40  25  25 
UCT  0.05  0.1  —  —  —  —  
SGUCT_{MINCOST}  0.05  0.1  —  —  25  25 
Random  LDP  FV  ERes  MinCost  LBHB  UCT  SGUCT (ours)  

Graph  
Internet  Colt  0.081  —  —  —  0.127  0.098  0.164  0.199  
GtsCe  0.017  —  —  —  0.082  0.014  0.110  0.125  
TataNld  0.020  —  —  —  0.078  0.015  0.102  0.110  
UsCarrier  0.026  —  —  —  0.097  0.026  0.171  0.178  
Metro  Barcelona  0.020  —  —  —  0.063  0.003  0.067  0.076  
Beijing  0.008  —  —  —  0.028  0.003  0.041  0.046  
Mexico  0.007  —  —  —  0.032  0.011  0.037  0.041  
Moscow  0.011  —  —  —  0.038  0.007  0.043  0.053  
Osaka  0.017  —  —  —  0.082  0.010  0.093  0.102  
Internet  Colt  0.007  0.005  0.006  0.009  0.075  —  0.055  0.089  
GtsCe  0.023  0.048  0.017  0.031  0.099  —  0.098  0.155  
TataNld  0.017  0.011  0.002  0.013  0.074  —  0.093  0.119  
UsCarrier  0.010  0.035  0.038  0.033  0.041  —  0.085  0.125  
Metro  Barcelona  0.020  0.010  0.009  0.036  0.071  —  0.076  0.115  
Beijing  0.004  0.003  0.002  0.001  0.037  —  0.055  0.062  
Mexico  0.007  0.003  0.005  0.011  0.038  —  0.051  0.068  
Moscow  0.013  0.033  0.042  0.034  0.031  —  0.090  0.109  
Osaka  0.003  0.008  0.011  0.015  0.064  —  0.066  0.072 