Graphs are mathematical abstractions that can be used to model a variety of systems, from infrastructure and biological networks to social structures. Various methods and tools for analyzing networks have been developed; these have been often used for understanding the systems themselves. These range from mathematical models of how families of graphs are generated [Watts and Strogatz1998, Barabási and Albert1999] to measures of node centrality for capturing the roles of vertices [Bianchini et al.2005] and global characteristics of the networks [Newman2018] just to name a few.
A measure that has attracted significant interest from the network science community and practitioners is robustness [Newman2003] (sometimes also called resilience), which is typically defined as the capacity of the graph to withstand random failures, targeted attacks on key nodes, or some combination thereof. Under these attack strategies, a network is considered robust if a significant fraction of nodes have to be removed before it breaks into more than one connected component [Cohen et al.2000], its diameter (mean shortest pairwise node distance) increases [Albert et al.2000], or the size of its largest connected component diminishes [Beygelzimer et al.2005]. Previous works have studied the robustness of communication networks such as the Internet [Cohen et al.2001] and infrastructure networks such as those for transportation and energy distribution [Cetinay et al.2018]. Optimal configurations under attack strategies have also been discovered – for example, under the joint objective of resilience to both attack strategies, the optimal network has a bi-modal or tri-modal degree distribution [Valente et al.2004].
However, building a robust network from scratch is impractical, since networks are generally designed with a specific purpose in mind. For this reason, prior works have addressed the problem of modifying existing networks in order to improve their robustness. beygelzimer_improving_2005 [beygelzimer_improving_2005] approach this problem by considering edge addition or rewiring, based on random or preferential (with respect to the degree of a node) modifications. In [Schneider et al.2011] the authors propose a “greedy” modification scheme based on random edge selection and swapping if the resilience metric improves. While simple and interpretable, these strategies may not yield the best solutions or generalize across networks with varying characteristics and sizes. Certainly, better solutions may be found by exhaustive search, but the combinatorial complexity of exploring all the possible topologies and the cost of computing the metric render this strategy infeasible.
In order to address this problem, we pose the question of whether generalizable robustness improvement strategies can be learned
. We formalize the process of modifying the edges of a graph in order to maximize the value of a global objective function as a Markov Decision Process (MDP). In this framework, an agent is given a fixed budget of modifications (such as edge additions) to make to a graph, receiving rewards that are proportional to the improvement measured through the objective function. In particular, we adopt two measures of robustness in the presence of random and targeted attacks as objective functions. Reinforcement Learning (RL) is used to learn policies for performing these modifications. Methods such as Deep Q-Network (DQN)[Mnih et al.2015]
have recently shown great promise in tackling high-dimensional decision-making problems by using deep neural networks as a function approximator. Recently, with the emergence of Graph Neural Network (GNN) architectures that are able to operate on graph-structured data, RL algorithms have also been applied to great success to NP-hard combinatorial optimization problems. More specifically, we investigate a novel framework for enhancing graph robustness, starting from an approach first introduced in[Dai et al.2018]
in the context of adversarial attacks on graph neural network classifiers. In this work, the authors consider the problem of changing a network in order to fool external classifiers, as it has been done in the past for image classification. In this work, we are interested instead in changing the network structure in order to optimize a characteristic of the graph itself.
To the best of our knowledge, this is the first work that addresses the problem of learning how to build robust graphs using (Deep) RL. Since it addresses the problem of building robust networks with a DQN, we name our approach RNet–DQN. We believe that the contribution of this work is also methodological. The proposed approach can be applied to a variety of problems based on different definitions of resilience or modeled using objective functions associated to different characteristics of graphs. Robustness may be considered a case study in this regard.
The remainder of the paper is structured as follows. We provide the formalization of the problem as an MDP and define the robustness measures in Section 2. Section 3 provides a description of the proposed representation of states and actions for RL methods based on function approximation. We describe our experimental setup in Section 4, and discuss our main results in Section 5. In Section 6 we offer an analysis of the limitations of the present work and possible avenues for future research. Finally, in Section 7 we review and compare the key works in this area and conclude in Section 8.
2 Modeling Graph Robustness for Reinforcement Learning
MDPs are a formalization of decision making processes. In an MDP, the decision maker, called agent, interacts with an environment. The agent, upon finding itself in a state must take an action out of the set valid ones. For each action taken, the agent receives a reward which it seeks to maximize through its decisions over time, governed by the reward function . Finally, the agent finds itself in a new state , depending on a transition model
that governs the probabilityof transitioning to state after taking action in state .
This sequence of interactions gives rise to a trajectory in the case of episodic tasks. The tuple fully defines this MDP, where is a discount factor. We also define a policy , a distribution of actions over states, which fully defines the behavior of the agent. Given a particular policy , the state-action value function is defined as the expected return when starting from , taking action , and subsequently following policy . There are several methods that are able to iteratively derive the optimal state-action value function and policy (e.g., Generalized Policy Iteration).
Modeling Graph Construction.
Let be the set of labeled, undirected, unweighted, connected graphs with nodes; is then a subset of this set in which graphs have edges. Let be an objective function, and be a modification budget. Given an initial graph , the aim is to perform a series of edge additions to the graph such that the resulting graph satisfies:
This can be seen as a sequential decision making problem in which an agent has to take actions with the goal of improving each of the intermediate graphs that arise in the sequence . We consider tasks to be episodic; each episode proceeds for at most steps until the agent has exhausted its action budget or there are no valid actions (e.g., the graph is fully connected). An episode visualization is shown in Figure 1. We map our problem to an MDP as follows:
State: The state is , i.e., the graph at time .
Action: In our formulation, actions correspond to edge additions. Specifically, we write for the addition of an edge between vertices with labels , .
Transitions: The transition function is deterministic. is fully determined by and .
Reward: The reward is defined as follows:
Definition of Objective Functions for Robustness.
In this study, we are interested in the robustness of graphs as objective functions. Given a graph , we let the critical fraction be the minimum fraction of nodes that have to be removed from in some order for it to become disconnected (i.e., have more than one connected component). The higher this fraction is, the more robust the graph can be said to be. Certainly, the order in which nodes are removed can have an impact on , and corresponds to the two attack strategies. We consider both random permutations of nodes in , as well as permutations , which are subject to the constraint that the nodes must appear in the order of their degree in this permutation, i.e.,
We define the objective functions in the following way:
Expected Critical Fraction to Random Removal:
Expected Critical Fraction to Targeted Removal:
To obtain an estimate of these quantities, we generatepermutations and average over the critical fraction computed for each permutation:
We use and to mean their estimates obtained in this way in the remainder of this paper.
3 Learning How to Build Robust Graphs with Function Approximation
We will now discuss the design and implementation of a scalable solution for learning how to build robust graphs starting from the modeling of the problem described in Section 2. While this formulation may allow us to work with a tabular RL method, the number of states quickly becomes intractable – for example, there are approximately labeled, unweighted, connected graphs with 20 vertices. We thus require a means of considering graph properties that are label-agnostic, permutation-invariant, and generalize across similar states and actions. Graph Neural Network architectures address these requirements. In particular, in our solution we use a graph representation based on structure2vec (S2V) [Dai et al.2016], a GNN architecture inspired by mean field inference in graphical models. Given an input graph where nodes
have feature vectorsand edges have feature vectors , its objective is to produce for each node a feature vector
that captures the structure of the graph as well as interactions between neighbors. This is performed in several rounds of aggregating the features of neighbors and applying a non-linear activation function(such as a neural network or kernel function) parametrized by . For each round , the network simultaneously applies an update of the form:
where is the neighbourhood of node . Once node-level embeddings are obtained, permutation-invariant embeddings for a subgraph can be derived by summing the node-level embeddings: .
We use Q-learning [Watkins and Dayan1992], which estimates the state-action value function introduced earlier and derives a deterministic policy that acts greedily with respect to it. The agent interacts with the environment and updates its estimates according to:
In the case of high-dimensional state and action spaces, solutions that use a neural network for estimating have been successful in a variety of domains ranging from general game-playing to continuous control [Mnih et al.2015, Lillicrap et al.2016]. In particular, we use the DQN algorithm: a sample-efficient method that improves on neural fitted Q-iteration by use of an experience replay buffer and an iteratively updated target network for state-action value function estimation. We opt for the Double DQN variant introduced in [Van Hasselt et al.2016], which addresses the “optimistic” action-value estimates of standard DQN by use of two different model instances for greedy action selection and state-action value prediction. We use two-step returns, which can speed up and improve the learning process significantly.
A key problem is the representation of the state . One possibility is to represent the state by computing the S2V embedding for the subgraph consisting of all nodes that are linked by an edge added up to time and representing the actions by concatenating the embeddings for nodes and at time . However, we note that defining actions in this way does not scale to large graphs, since at every step actions have to be considered. Instead, we follow the approach introduced in [Dai et al.2018] and we decompose each action into two steps and . corresponds to the selection of the first node linked by an edge, and to the second. In this way, the agent has only to consider a much more manageable number of actions at each timestep.
4 Experimental Setup
We compare against the following baselines:
Random: Randomly selects an available action.
Greedy: Uses lookahead and selects the action that gives the biggest improvement in the estimated value of over one edge addition.
We study performance on graphs generated by the following models:
We generate a set of graphs using the 2 graph models above. We train the agent using graphs with vertices and a number of edge additions equal to 1% of total possible edges (2 for ). At each step, the algorithm proceeds over all graphs in . During training, we assess the performance of the agent on a disjoint set every 100 steps. We evaluate the performance of the algorithm and baselines over a disjoint set of graphs . We use . Graphs are generated using a wrapper around the networkx Python package [Hagberg et al.2008]. In case the candidate graph returned by the generation procedure is not connected, it is rejected, and another one generated until the set reaches the specified cardinality. To estimate the value of the objective functions, we use a number of permutations . When repeating the evaluation for larger graphs, we consider and scale (for ER), and depending on .
is divided into batches of size . Training proceeds for steps. We use a learning rate and a discount factor since we are in the finite horizon case. We use a value of the exploration rate that we decay linearly from to for the first steps, then fix for the rest of the training. We use latent variables and a hidden layer of size
. The only hyperparameter we tune is the number of message passing rounds, for , selecting the agent with the best performance over .
Training, hyperparameter optimization and evaluation are performed separately for each graph family and objective function . Since deep RL algorithms are notoriously dependent on parameter initializations and stochastic aspects of the environment [Henderson et al.2018], we aggregate the results over 5 runs of the training-evaluation loop.
In Table 1, we present the main results of our experimental evaluation. For , the policies learned by RNet–DQN yield solutions that outperform the greedy approach in both the ER and BA cases. Results for improving are not as strong but still considerably better than random. This is important since the RNet–DQN learned policies generalize across a disjoint set of test graphs, while the greedy solution has to be calculated individually for each graph in this set. Across all types of graphs and performance measures, RNet–DQN performed statistically significantly better than random.
Efficient Scaling to Larger Graphs.
The most desirable property of the proposed approach is the ability to train models on graphs of small size, and use them to perform predictions on larger graphs. Learning on smaller graphs can be performed more quickly owing to the reduced workload in constructing the learning representation, the lower number of possible actions, and the fact that the objective function can be evaluated faster. Thus, we use the models trained on graphs of size as described in Section 4 and evaluate their and the baselines’ performance on graphs with up to vertices (only up to for greedy due to computational cost, see next paragraph). We show the results obtained in Figure 2. We find that the performance of RNet–DQN decreases relatively little for both objectives when considering BA graphs. For ER graphs, the performance decays rapidly, and the learned policies perform worse than random for graphs of size and up. This suggests that the properties of ER graphs in terms of robustness fundamentally change as their size grows, and the features learned by RNet–DQN are no longer relevant for improving the objective functions beyond a particular size multiplier.
It is also important to compare the computational costs of our approach versus the baselines in order to understand the trade-offs. Empirically, we observed that the greedy baseline becomes simply too expensive to evaluate beyond . We also measured the average decision time for the different agents when performing the evaluation. We display these results in Figure 3. While the speeds of the different agents are not directly comparable given the different components involved in the implementation, we note that the greedy baseline scales much worse, since its decision time rises sharply.
We next analyze the computational complexity of the proposed approach. Assume that in order to compute the objective function for a single graph instance we need to perform operations. There are actions to consider at each step. For the greedy agent, there are thus computations involved in one step for picking an action. In contrast, the RNet–DQN agent does not need to evaluate the objective function explicitly after training. As explained in Section 3, the MDP decomposition into two steps means that the complexity for each edge addition is ; while performing the forward pass in the network to obtain the state-action value estimate is an operation. Thus, the RNet–DQN agent performs operations at each step. In the worst case, where , this means a complexity of ).
In the case of the two objective functions taken into consideration, for each computation of the critical fraction for a given permutation, we need to calculate the number of connected components (an operation) for each of the nodes in the permutation to be removed. Since we use a number of permutations equal to , we thus obtain a complexity of . In the worst case, this means evaluating the graph objective function has complexity . Hence, the greedy agent can take up to per step, which is simply too expensive for non-trivially sized graphs. This large difference in cost is also captured by the empirical measurement shown earlier.
It is worth noting that the analysis above does not account for the cost of training the model, whose complexity can be hard to determine as its runtime depends on several hyperparameters as well as the problem. Nevertheless, the training involves evaluating the objective function once per timestep for each training graph. The approach is thus advantageous in situations where predictions need to be made over many graphs or the model can be scaled to large graphs for which computing the objective function is expensive.
6 Limitations and Future Work
In this section, we discuss some of the limitations of the proposed approach and potential future directions of this work. Since we are using a Deep Reinforcement Learning approach, the typical caveats for this class of algorithms apply. We observed quite significant differences in performance for different random initializations of the network weights under the same hyperparameters; while we report average performance in our figures, sometimes significantly better solutions were found. We are confident that smarter exploration strategies, tailored to the objective functions and graph structure at hand, can lead to solutions that are more consistent under different initializations. We have also experimented with larger number of edge additions per episode but the amount of noise involved made the algorithm very unstable in these scenarios.
While our solution is able to learn generalizable strategies and can be considered efficient in terms of computational complexity, it necessarily trades off performance on individual graph instances compared to the greedy solution. However, the two solutions are not necessarily mutually exclusive: the model-based approach in this work can be used to provide a prior to the greedy search regarding the edge additions that are likely to be promising, reducing the space of actions that have to be considered. Indeed, in some real-world networks (for example, transportation) where there is an underlying geometry to the graph, the addition of edges may be limited by practical concerns such as a constraint on pairwise distances.
The applicability of the proposed framework is not limited to robustness. Indeed, it supports an arbitrary objective function defined on graphs. Potential examples of functions frequently used in the network science community are communicability and efficiency [Newman2018]. Additionally, the approach can be advantageous when the objective function is expensive to evaluate – this is the case with dynamic processes involving simulations on networks such as traffic or epidemics [Barrat et al.2008]. While in this work we have only considered the topological structure, the GNN framework allows node and edge features to be included if available. These have the potential to improve performance if related to the objective function.
In the present work, we only examined the addition of edges as possible actions. One can also address the task of removing edges in a graph – Braess’s paradox [Braess1968] suggests removal may counterintuitively lead to improved efficiency. Allowing for the removal of edges together with additions can give rewiring, which would greatly increase the space of possible graphs that can be constructed in this way.
7 Related Work
Network resilience to random errors and targeted attacks is first discussed by albert_error_2000 [albert_error_2000], who examine the average shortest path distance as a function of the number of removed nodes. Performing an analysis of two scale-free communication networks, they find that this type of network has good robustness to random failure but is vulnerable to targeted attacks. A more extensive investigation by holme_attack_2002 [holme_attack_2002] analyzes the robustness of several real-world networks as well as some generated by means of synthetic models. The authors investigate different attack strategies based on degree and betweenness centrality, finding that generally recomputing the centralities after the removal of nodes can yield more efficient attack strategies. Various analytical results have been obtained that describe the breakdown thresholds of network models under the two attack strategies [Cohen et al.2000, Cohen et al.2001]. As previously discussed, several works have considered the problem of improving the resilience of existing networks [Beygelzimer et al.2005, Schneider et al.2011, Schneider et al.2013].
Graph Neural Networks.
Neural network architectures able to deal not solely with Euclidean but also with manifold and graph data have been developed in recent years [Bronstein et al.2017], and applied to a variety of problems where their capacity for representing structured, relational information can be exploited [Battaglia et al.2018]
. The research community has approached NP-hard graph problems such as Minimum Vertex Cover and the Traveling Salesman Problem using modern recurrent neural network architectures with attention; approximate solutions have been found to combinatorial optimization problems by framing them as a supervised learning[Vinyals et al.2015] or RL [Bello et al.2016] task. Combining GNNs with RL algorithms have yielded models capable of solving several graph optimization problems with the same architecture while generalizing to graphs an order of magnitude larger than those used during training [Khalil et al.2017]. However, the specific problems do not involve the dynamic improvement of the graph itself. More recently, the problem of modifying a graph by adding edges in such a way as to fool a graph or node-level classifier was investigated by dai_adversarial_2018 [dai_adversarial_2018]. Indeed, in this work, the authors are not interested in studying graphs’ properties, but in finding ways of disguising network changes in order to fool classifiers, similar to adversarial samples in the case of image classifiers based on deep neural networks. We use the approach discussed in [Dai et al.2018], called RL–S2V, as a starting point for our work.
Improving an Objective Function on a Graph.
The problem of building a graph with certain desirable properties was perhaps first recognized in the context of designing neural network architectures such that their performance is maximized [Harp et al.1990]. Recently, approaches have emerged that use RL [Zoph and Le2017]Liu et al.2018]
to discover architectures that can deliver state-of-the-art performance on several computer vision benchmarks.
In this work, we have addressed the problem of improving the robustness of graphs in presence of random and targeted removal of nodes by learning how to add edges in an effective way. We have modeled the problem of improving the value of an arbitrary global objective function as a Markov Decision Process and we have approached it using Reinforcement Learning and a Graph Neural Network architecture. To the best of our knowledge, this is the first work that addresses the problem of learning how to build robust graphs using (Deep) Reinforcement Learning.
We have evaluated our solution, named RNet-DQN, considering graphs generated through the Erdős–Rényi and Barabási–Albert models. Our experimental results indicate that this method can perform significantly better than making random additions and, in some cases, exceed a greedy baseline. This novel approach offers several advantages: the learned policies can transfer to out-of-sample graphs as well as graphs larger in size than those used during training (for scale-free networks), without having to estimate the objective function post-training. This is important because the naïve greedy solution can be prohibitively expensive to compute for large graphs. Instead, our approach is highly scalable, offering an speed-up with respect to it at evaluation time.
Finally, it is worth noting that our contribution is also methodological. The proposed approach can be applied to other problems based on different definitions of resilience or considering fundamentally different objective functions representing other characteristics of graphs. Learning how to improve graph robustness may be considered as a case study of the more general methodology proposed in this paper.
This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1.
- [Albert et al.2000] Réka Albert, Hawoong Jeong, and Albert-László Barabási. Error and attack tolerance of complex networks. Nature, 406(6794):378–382, 2000.
- [Barabási and Albert1999] Albert-László Barabási and Réka Albert. Emergence of Scaling in Random Networks. Science, 286(5439):509–512, 1999.
- [Barrat et al.2008] Alain Barrat, Marc Barthelemy, and Alessandro Vespignani. Dynamical Processes on Complex Networks. Cambridge University Press, 2008.
- [Battaglia et al.2018] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Álvaro Sánchez-González, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
- [Bello et al.2016] Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural Combinatorial Optimization with Reinforcement Learning. arXiv:1611.09940 [cs, stat], 2016.
- [Beygelzimer et al.2005] Alina Beygelzimer, Geoffrey Grinstein, Ralph Linsker, and Irina Rish. Improving Network Robustness by Edge Modification. Physica A, 357:593–612, 2005.
- [Bianchini et al.2005] Monica Bianchini, Marco Gori, and Franco Scarselli. Inside PageRank. ACM Trans. Internet Technol., 5(1):92–128, February 2005.
- [Braess1968] Dietrich Braess. Über ein paradoxon aus der verkehrsplanung. Unternehmensforschung, 12(1):258–268, 1968.
[Bronstein et al.2017]
Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre
Geometric Deep Learning: Going beyond Euclidean data.IEEE Signal Processing Magazine, 34(4):18–42, July 2017.
- [Cetinay et al.2018] Hale Cetinay, Karel Devriendt, and Piet Van Mieghem. Nodal vulnerability to targeted attacks in power grids. Applied Network Science, 3(1):34, 2018.
- [Cohen et al.2000] Reuven Cohen, Keren Erez, Daniel ben Avraham, and Shlomo Havlin. Resilience of the Internet to Random Breakdowns. Physical Review Letters, 85(21):4626–4628, 2000.
- [Cohen et al.2001] Reuven Cohen, Keren Erez, Daniel ben Avraham, and Shlomo Havlin. Breakdown of the Internet under Intentional Attack. Physical Review Letters, 86(16):3682–3685, 2001.
- [Dai et al.2016] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In ICML, 2016.
- [Dai et al.2018] Hanjun Dai, Hui Li, Tian Tian, Xin Huang, Lin Wang, Jun Zhu, and Le Song. Adversarial attack on graph structured data. In ICML, 2018.
- [Erdős and Rényi1960] Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.
- [Hagberg et al.2008] Aric Hagberg, Pieter Swart, and Daniel S. Chult. Exploring network structure, dynamics, and function using networkx. In SciPy, 2008.
[Harp et al.1990]
Steven A. Harp, Tariq Samad, and Aloke Guha.
Designing application-specific neural networks using the genetic algorithm.In NeurIPS, 1990.
- [Henderson et al.2018] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, et al. Deep reinforcement learning that matters. In AAAI, 2018.
- [Holme et al.2002] Petter Holme, Beom Jun Kim, Chang No Yoon, and Seung Kee Han. Attack vulnerability of complex networks. Physical Review E, 65(5), 2002.
- [Khalil et al.2017] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In NeurIPS, 2017.
- [Lillicrap et al.2016] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, et al. Continuous control with deep reinforcement learning. In ICLR, 2016.
- [Liu et al.2018] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. In ICLR, 2018.
- [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- [Newman2003] M. E. J. Newman. The Structure and Function of Complex Networks. SIAM Review, 45(2):189–190, 2003.
- [Newman2018] M. E. J. Newman. Networks. Oxford University Press, 2018.
- [Paszke et al.2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
- [Schneider et al.2011] Christian M. Schneider, André A. Moreira, José S. Andrade, Shlomo Havlin, and Hans J. Herrmann. Mitigation of malicious attacks on networks. PNAS, 108(10):3838–3841, 2011.
- [Schneider et al.2013] Christian M. Schneider, Nuri Yazdani, Nuno A. M. Araújo, Shlomo Havlin, and Hans J. Herrmann. Towards designing robust coupled networks. Nature Scientific Reports, 3(1), December 2013.
- [Valente et al.2004] André X. C. N. Valente, Abhijit Sarkar, and Howard A. Stone. Two-Peak and Three-Peak Optimal Complex Networks. Physical Review Letters, 92(11), 2004.
- [Van Hasselt et al.2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, 2016.
- [Vinyals et al.2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer Networks. In NeurIPS, 2015.
- [Watkins and Dayan1992] Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
- [Watts and Strogatz1998] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature, 393(6684):440, June 1998.
- [Zoph and Le2017] Barret Zoph and Quoc V. Le. Neural Architecture Search with Reinforcement Learning. In ICLR, 2017.