1 Introduction
In the recent past, deep learning has proved to be an extremely valuable tool for robotics. Harnessing the power of deep neural networks has emerged as a successful approach to designing policies that map sensor inputs to control outputs for complex tasks. These include, but are not limited to, learning to play video games
[1, 2], learning complex control policies for robot tasks [3] and learning to plan with only sensor information [4, 5, 6]. While these works are impressive, they often concern themselves with controlling a single entity. However, in many real world applications, especially in fields like robotics there exists a need to control multiple robots interact with each other in cooperative or competitive settings. Examples include warehouse management with teams of robots [7], multirobot furniture assembly [8], and concurrent control and communication for teams of robots [9]. In such a scenario, as the number of robots increases, the dimensionality of the input space and the control space both increase making it much harder to learn meaningful policies.This paper looks to tackle the problem of learning individual control policies by exploiting the underlying graph structure among the robots. We start with the hypothesis that the difficulty of learning scalable policies for multiple robots can be attributed to two key issues: dimensionality and partial information. Consider an environment with
robots (This work uses bold font when talking about a collection of items, vectors and matrices). Each robot receives partial observations of the environment. In order for a robot to learn a meaningful control policy, it must interact with some subset of all agents,
. Finding the right subset of neighbors to learn from is in itself a challenging problem. Further, in order to ensure that the method scales, as the number of robots increase, one needs to ensure that the cardinality of the subset of neighbors , remains fixed or grows very slowly.To solve these problems for large scale multirobot control, we draw inspiration from convolutional neural networks (CNNs). CNNs consist of sequentially composed layers where each layer comprises of banks of linear time invariant filters to extract local features along with pooling operations to reduce the dimensionality. Convolutions regularize the linear transform to exploit the underlying structure of the data and hence constrain the search space for the linear transform that minimizes the cost function. However, CNNs cannot be directly applied to irregular data elements that have arbitrary pairwise relationships defined by an underlying graph. To overcome this limitation, there has been the advent of a new architecture called graph convolutional networks (GCNs)
[10, 11, 12]. Similar to CNNs, GCNs consist of sequentially composed layers that regularize the linear transform in each layer to be a graph convolution with a bank of graph filters and the weights of the filter are learned by minimizing some cost function.When controlling a large swarm of robots operating in a Euclidean space , the underlying graph can be defined as where is the set of nodes and is the set of edges. For example, we define each robot to be a node and add an edge between robots if they are are less than distance apart. This graph acts as a support for the data vector where is the state representation of robot . Now, our GCN exploits this graph and at each node aggregates information from its neighbors (See Section 2.2). This information is propagated forward to the next layer through a non linear transformation. The output of the final layer is given by , where are independent policies for the robots. Similar to standard reinforcement learning, we execute these policies in the environment, collect a centralized reward and use policy gradients [13] to update the weights of policy network. We call this algorithm Graph Policy Gradients (GPG) (See Fig. 1).
One possible concern with our proposed algorithm is that, when working with onpolicy methods for a large number of robots, it is highly likely that, due to exploration/entropy the graph might change during training. Such a setting can result in an explosion in the number of possible graphs that one needs to learn over as the number of agents increases. We show that in light of this, our choice of GCNs for policy parametrization is well motivated because it can be shown that graph convolutions are permutation equivariant, i.e if one were to reorder the node ordering of the graph and the corresponding graph signal, then the output of the graph convolution does not change [14]. This is important because it reduces the dimensionality of the problem and helps training converge faster when training with many robots.
To demonstrate the efficacy of our proposed algorithm, we perform experiments on simulated formation flying with robots. Designing controllers and trajectories for formation flying is an important problem in multirobot literature [15, 16] and is in fact a good test bed for other multirobot control problems [17]. In our experiments, it is shown that GPG is able to converge as the number of robots are increased in comparison to state of the art onpolicy reinforcement learning algorithms that employ fully connected networks to parametrize the policy. We also show that our graph filters are able to learn valid local features by training them on a small number of robots and transferring the behavior to a very large number of robots. For example, we train a GCN for three robots to maintain formation, avoid collisions and follow their trajectories. Then, we initialize a swarm of many robots and use this same graph filter over the entire graph to generate desired behavior. We show that the desired formation flying behavior is achieved without updating the weights for the larger swarm thus achieving zeroshot transfer. Lastly, the ability of GPG to adapt to more complex dynamics and control is demonstrated.
2 Methodology
2.1 Preliminaries
We pose learning the controller for a large number of robots as a policy learning problem in a collaborative Markov team [18]. The team is composed of robots generically indexed by which at any given point in time occupy a position in configuration space and must choose an action in action space. The team and environment are assumed to be Markovian. Thus, if one were to collect the robot configurations in the vector and actions in the vector
then, the evolution of the system is completely determined by the conditional transition probability
. The transition dynamics are also assumed to be the same for all agents, so that if we swap two of them in configuration and action space we expect to see the same statistical evolution. This is a natural consequence when, the robotic swarm is assumed to be composed of homogeneous robots. In this work, there exists no external communication between robots. Instead, we define a graph that encodes information about each robots neighbors. Robot can directly communicate only with its one hop neighbors. Communication with neighbors that are further away is possible only indirectly. Each robot has access to its own states and. In this paper, we do not design which nodes each robot should talk to or what each robot must communicate. Each robot must learn a policy (probability distribution) from which the robot samples actions;
. Let . As robots operate in the environment, they collect a centralized global reward . Collapsing everything, we are interested in computing such that the expected sum of rewards over some time horizon is maximized:(1) 
where are the parameters of . At a high level, the global reward encodes desired behavior for the swarm. As the number of robots increase, the dimensionality of and increases too making the problem of learning nontrivial. Our proposition here to instead learn only compounds the difficulty of the problem as the size of the graph grows exponentially as the number of robots increase ( for robots). In the next section, we discuss using a graph convolutional network as a parametrization for to overcome this problem by extracting local information from the graph structure.
2.2 Graph Convolutional Networks
Consider a graph described by a set of nodes denoted , and a set of edges denoted . This graph is considered as the support for a data signal where the value is assigned to node . The relation between and is given by a a matrix called the graph shift operator. The elements of given as respect the sparsity of the graph, i.e , . Valid examples for are the adjacency matrix, the graph laplacian, and the random walk matrix. In this paper, we consider the normalized graph Laplacian similar to [10]:
(2) 
A key property of is that it is assumed symmetric, with decomposition where
is the eigenvector matrix and
is the eigenvalue matrix of
. defines a map between graph signals that represents local exchange of information between a node and its onehop neighbors. More concretely, if the set of neighbors of node is given by then :(3) 
Eq. 3 performs a simple aggregation of data at node from its neighbors that are onehop away. The aggregation of data at all nodes in the graph is denoted . By repeating this operation, one can access information from nodes located further away. For example, aggregates information from its hop neighbors (see Fig 2). Now one can define the spectral localized graph convolution as :
(4) 
where is a linear shift invariant graph filter [19] with coefficients . Similar to CNNs the output of the GCN is fed into a pointwise nonlinear function. Thus, the final form of the graph convolution is given as:
(5) 
where is a pointwise nonlinearity. A visualization of a GCN can be seen in Fig. 2.
2.3 Permutation Equivariance of Graph Convolutional Networks
To control robots, we propose defining a graph where each robot is a node and robots within distance of each other are connected by an edge. The robots are all initialized with random policies and by exploring different actions, they learn what policies best optimize the global reward. Such exploration can change the ordering of the configuration . This can lead to an explosion in the possible number of graphs that our policies have to learn over. However, [14] proves a key property for graph convolutional filters. Given a set of permutation matrices :
(6) 
where the operation permutes the elements of the vector then, it can be shown that :
Theorem 1:
Let graph be defined with a graph shift operator . Further, define to be the permuted graph with for and any it holds that :
(7) 
Proof. See Appendix
A consequence of Theorem 1 is that the output of the graph convolution does not change under reordering of the graph nodes as long as the topology of the graph stays the same. Intuitively, if the graph exhibits several nodes that have the same graph neighborhoods, then the graph convolution filter can be translated to every other node with the same neighborhood. When learning control for a large number of robots, this property helps in reducing dimensionality of the problem.
2.4 Formation Flying
We choose formation flying as a test bed for controlling a large number of robots. In this work, the robots are optimized to produce desired behavior in terms of trajectory following. This behavior can easily be modified from formation flying to other multirobot objectives such as flocking [20], information gathering [21], collaborative mapping [22] and multirobot coverage [23].
Consider a two dimensional Euclidean space with homogeneous point mass robots indexed by . In later experiments, we use the robot models defined in the AirSim simulator [24] for our experiments. Each robot has a desired goal position in the euclidean space. Collectively for all robots, all goal locations are denoted . At time , the robot’s position in the plane is given by . The state representation for robot that is used by our learning architecture consists only of its own relative position to the goal, i.e . We choose relative positions to goals as a representation of the robots, in order to maintain permutation invariance. It is important to note here that robot cannot arbitrarily communicate with any other robot in the swarm. Instead we define a graph that encodes pairwise relationships between robots. Robot can directly communicate only with its one hop neighbors. Communication with neighbors that are further away is possible only indirectly. For each robot, given an action , the state of the robot evolves according to some stationary dynamics distribution with conditional density . In this work, we work with continuous control only since it poses a much harder problem and is also more realistic when working with robots. Two conditions are encoded for robots to maintain formation. These are collision avoidance, and waypoint reaching for robots. The necessary and sufficient condition to ensure collision avoidance between robots is given as :
(8) 
where is the euclidean distance and is a userdefined minimum distance between robots. Let us also define an assignment matrix as :
(9) 
where is some threshold region of acceptance. The necessary and sufficient condition for all robots to be cover their assigned goals at some time is then:
(10) 
where I
is the identity matrix. With these definitions in place, we now define the problem statement considered in this paper:
Problem 1. Given an initial set of robot configurations , a graph defining relationships between robots and some goals g, compute a set of policies such that executing actions results in a sequence of states that satisfy Eq.8 and at time , satisfy the assignment constraint in Eq.10.
2.5 Graph Policy Gradients
We proposing solving the statement in Problem 1 by exploiting the underlying graph structure. Given an initial swarm of robots, a graph can be defined by setting each robot to be a node. Next, an edge is added between nodes, if:
(11) 
where is some user defined threshold to connect two robots. It is assumed that at time , the position of robot (given as ) and as a consequence the relative position of robot (given as ) to its own goal is known precisely. The relative positions of all the robots at time are collected into the vector . The graph defined earlier, acts as the support for . We do not evolve the graph over time since interchanging homogeneous nodes results in an equivalent graph convolution as discussed in Section 2.3. To ensure that the topology stays constant, the number of neighbors for each node is kept fixed, i.e robot is connected to its say three or four nearest neighbors. We also experiment with a time varying graph, in our simulations and it is observed that for a small number of robots, policies using computed time varying graphs converge slower than fixed graphs (see Fig. 4). However, as the number of robots is increased, policies computed using time varying graphs do not converge whereas policies trained using a fixed graph still converge to desired behavior. One possible reason for this is that when the number of robots is small, the number of possible graphs is also small and it is easier to learn over this small number of graphs. As the number of robots increase, the number of possible graphs increase exponentially leading to a very large number of graphs that the robot needs to learn over.
To compute the policies, a GCN architecture with layers is initialized. At each layer according to Eq.5, the output is given as:
(12) 
where is a pointwise nonlinear function, and
. In practice, the final layer outputs are parameters of Gaussian distributions from which actions are sampled. Intuitively at every node, the GCN architecture aggregates information and uses this information to compute policies. In order to satisfy the constraints given in Eq.
8 and Eq.10, we formulate a centralized reward structure of the form:(13) 
Each robot receives the same reward an attempts to learn a policy that best optimizes this reward. It is assumed that the policies for the robots are independent. Thus, the overall loss function for all the robots as given in Eq.
1(14) 
where now represents the filter weights of the GCN for . Consider a trajectory . Since the reward along a trajectory is the same for all robots and all robot policies are assumed independent, using direct differentiation the policy gradient is given as :
(15) 
For the full algorithm, please see the appendix. The weights
, are then updated using any variant of stochastic gradient descent. We call this algorithm Graph Policy Gradients (GPG). In the next section we demonstrate how GPG is able to learn meaningful policies even as the number of robots increase and is also able to transfer filters learned from a small number of robots to a larger number of robots.
3 Experiments
To investigate the performance of GPG, we design experiments to answer three key questions :

Can GPG learn policies that achieve desired behavior as the number of robots increase?

How well do the graph filters learned by GPG transfer to a large number of robots?

Does GPG work with more complex dynamics and controls?
3.1 Training for Point Mass Formation Flying with GPG
We first look to establish if GPG can train for formation flying for simple point mass dynamics as the number of robots are increased. The state of every robot is its absolute position to its goal and the action at time for robot is the change in and position in the plane. Such a setting necessitates communication between robots. One could argue that the robots simply have to learn to take actions such that tends to zero. However, consider the example in Fig 3. When the goals are reasonably far away and each robot can take actions that could cause it to collide with its neighbors. Each robot must communicate with its neighbors who in turn must communicate with theirs in order to achieve collision free trajectories that take robots to their assigned goals.
To establish relevant baselines, we choose vanilla policy gradients (VPG) that uses the same policy gradient hyperparameters as GPG but differs from GPG in that it uses a 2 layer fully connected network for policy paramterization. We also compare with Proximal Policy Optimization (PPO)
[25] a state of the art on policy method for learning continuous policies. In this paper, we employ a batched version of PPO for faster computation [26]. The PPO baseline also employs fully connected networks. Another choice of baseline used in this paper PPO with a recurrent network architecture since the problem on hand requires communication between agents and shared memory is one possible mechanism. The training curves for our experiments with 3,5 and 10 robots can be seen in Fig 5. We observe right away that GPG is able to converge in all three cases. VPG does not produce any meaningful behavior and both variants of PPO only produce partial results.In the experiments above, we maintain a static graph while training in light of the permutation equivariance property of GCNs. While static graphs are used during training to speed up the training process, during testing, the graphs are dynamic since the graph convolution yields the same result. We also experiment with a dynamic graph that evolves as the robots move in space. As hypothesized and guided by the intuition of Theorem 1, using dynamic graphs increases sample complexity of the problem whereas the static graph converges faster. This can be seen in Fig 4.
3.2 Zero Shot Policy Transfer for Formation Flying
It can be observed from Fig 5 that while GPG is able to converge as the number of robots increase, the number of samples required also increases. Training GPG for ten robots alone takes over nine hours on a 12GB NVIDIA 2080Ti GPU whereas we wish to learn policies for hundreds of robots. One possible solution is to realize that the filters trained extract local features and the same filter can be used when working with many more robots. To demonstrate this, we train a filter with a small number of robots (depending on the formation we are interested in) and use the same filter weights without any gradient updates on the larger swarm.
Another point to note is that when training for the smaller swarm, we only train for goals that are a small distance away to reduce the number of samples (it is easier to discover goals that are close by than it is to discover goals much further away) required for training. However, during execution these policies are able to converge even when the goals are hundred units away. A snapshot of some of the formations we can produce using this zero shot transfer can be seen in Fig. 6. In Fig. 6 (left) and (center), policies are trained on three robots and utilize only 1hop neighbor information. The figure of eight formation necessitates more communication especially when robots at the edges cross over to form the figure of eight. In Fig. 6 (right) the filters are learned by training with five robots with each robot utilizing 3 hop neighbor information.
With these results, it can be concluded that GPG is capable of learning complex behaviors for a small number of robots. These behaviors can then be transferred to larger more complex swarms. To the best of our knowledge there exists no other method to learn continuous control policies that can achieve similar results for so many robots.
3.3 Complex dynamics and control
In the experiments considered above, actions for robot are simply the change in position in the plane. Further, it is assumed that full control of robots can be achieved and at every step robots have zero velocities. These assumptions are infeasible in the real world. Thus, we also conduct experiments to test the efficacy of GPG on more realistic dynamics and control.
To simulate this, we use the AirSim simulator introduced in [24]. Robots are now assumed to have finite mass and inertia and obey single integrator dynamics. The state for each robot used by GPG is now given as where is the velocity at time for robot . Thus, the state for each robot is not just its relative position but also includes its current velocity. The actions now represent change in velocity in the plane. Similar to the point mass experiments, policies are first learned for a small number of robots in AirSim. In addition to the non simplistic dynamics, the difficulty of training policies in AirSim is further compounded by the fact that AirSim is a real time simulator and produces an order of magnitued fewer training points as compared to the pointmass simulation over a fixed period of time. Thus, it is even more imperative when working with such a simulator to be able to learn policies for a small number of robots that require fewer samples and then be able to transfer these policies to a larger number of robots for which direct training would otherwise be infeasible. GPG being a model free algorithm faces no additional difficulty in adapting to the single integrator dynamics. A snapshot of results produced using GPG in AirSim can be seen in Fig 4. In this case, the baselines used in Section 3.1 are unable to learn reasonable behaviors within 50K episodes. This can be attributed to the fact that the observations in AirSim are noisier and the baselines most likely need a lot more samples to converge.
4 Related Work
In the recent past, there has been an immense amount of interest in learning policies for a large collection of agents. Literature in multiagent RL has argued for a centralized training decentralized execution scheme [27, 28] where each agent has its own policy network but during train time, it has access to an additional critic network that coordinates information among all agents. Naturally, such a scheme is not scalable as the input size of the critic network grows as the number of robots increase in addition to the inherent complexity of training hundreds of separate policies simultaneously. Another line of thinking in this field, has been the idea to approximate the policies of all the agents using metalearning [29, 30]. These works while impressive, still run into the problem of having to execute rollouts for a large number of agents to learn a meaningful metarepresentation. Closest to our work, perhaps is the work of [31] where the authors propose using CNNs on an encoded graph representation of the agents. In contrast to our work, the policies learned are discrete and the convolutions do not fully exploit the local graph structure.
5 Conclusion
The problem of learning policies that map state representation to control commands for a large number of robots is immensely challenging, even if the robots are homogeneous. In this work, we propose using the underlying graph structure as additional information. To exploit information from the graph we propose using GCNs to parametrize the policies. Using GCNs we are able to learn local filters defined on the graph that extract local information. The additional benefit of using these local feature extractors is that they can be used to alleviate the problem of needing a large number of rollouts as the number of robots increase as the number of robots increase as demonstrated by the experiments in Section 3.2. It might even be possible to add other sensor representations such as camera images or lidar instead of requiring exact positions of the robots in space. These could be first processed using standard architectures such as CNNs and then be fed into GPG to learn policies for a large swarm of robots directly from onboard sensors. We leave this for future work.
References
 Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

Mnih et al. [2016]
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
and K. Kavukcuoglu.
Asynchronous methods for deep reinforcement learning.
In
International Conference on Machine Learning
, pages 1928–1937, 2016.  Levine et al. [2015] S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend training of deep visuomotor policies. arXiv preprint arXiv:1504.00702, 2015.
 Pathak et al. [2017] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiositydriven exploration by selfsupervised prediction. arXiv preprint arXiv:1705.05363, 2017.
 Khan et al. [2018] A. Khan, C. Zhang, N. Atanasov, K. Karydis, V. Kumar, and D. D. Lee. Memory augmented control networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyfHgI6aW.
 Gupta et al. [2017] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive mapping and planning for visual navigation. arXiv preprint arXiv:1702.03920, 2017.
 Enright and Wurman [2011] J. Enright and P. R. Wurman. Optimization and coordinated autonomy in mobile fulfillment systems. 2011.
 Knepper et al. [2013] R. A. Knepper, T. Layton, J. Romanishin, and D. Rus. Ikeabot: An autonomous multirobot coordinated furniture assembly system. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 855–862. IEEE, 2013.
 Stephan et al. [2017] J. Stephan, J. Fink, V. Kumar, and A. Ribeiro. Concurrent control of mobility and communication in multirobot systems. IEEE Transactions on Robotics, 33(5):1248–1254, October 2017.
 Kipf and Welling [2016] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Gama et al. [2018] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro. Convolutional neural network architectures for signals supported on graphs. IEEE Transactions on Signal Processing, 67(4):1034–1049, 2018.
 Wu et al. [2019] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
 Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Gama et al. [2019] F. Gama, J. Bruna, and A. Ribeiro. Stability properties of graph neural networks. arXiv preprint arXiv:1905.04497, 2019.
 Turpin et al. [2012a] M. Turpin, N. Michael, and V. Kumar. Trajectory design and control for aggressive formation flight with quadrotors. Autonomous Robots, 33(12):143–156, 2012a.
 Turpin et al. [2012b] M. Turpin, N. Michael, and V. Kumar. Decentralized formation control with variable shapes for aerial robots. In 2012 IEEE international conference on robotics and automation, pages 23–30. IEEE, 2012b.
 Kumar and Michael [2012] V. Kumar and N. Michael. Opportunities and challenges with autonomous micro aerial vehicles. The International Journal of Robotics Research, 31(11):1279–1291, 2012.
 Littman [1994] M. L. Littman. Markov games as a framework for multiagent reinforcement learning. In Machine Learning Proceedings 1994, pages 157–163. Elsevier, 1994.
 [19] S. Segarra, A. G. Marques, and A. Ribeiro. Optimal graphfilter design and applications to distributed linear network operators. IEEE Transactions on Signal Processing, 65(15):4117–4131.
 Zavlanos et al. [2007] M. M. Zavlanos, A. Jadbabaie, and G. J. Pappas. Flocking while preserving network connectivity. In 2007 46th IEEE Conference on Decision and Control, pages 2919–2924. IEEE, 2007.
 Schlotfeldt et al. [2018] B. Schlotfeldt, D. Thakur, N. Atanasov, V. Kumar, and G. J. Pappas. Anytime planning for decentralized multirobot active information gathering. IEEE Robotics and Automation Letters, 3(2):1025–1032, 2018.
 Ko et al. [2003] J. Ko, B. Stewart, D. Fox, K. Konolige, and B. Limketkai. A practical, decisiontheoretic approach to multirobot mapping and exploration. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003)(Cat. No. 03CH37453), volume 4, pages 3232–3238. IEEE, 2003.

Choset [2001]
H. Choset.
Coverage for robotics–a survey of recent results.
Annals of mathematics and artificial intelligence
, 31(14):113–126, 2001.  Shah et al. [2018] S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: Highfidelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pages 621–635. Springer, 2018.
 Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Hafner et al. [2017] D. Hafner, J. Davidson, and V. Vanhoucke. Tensorflow agents: Efficient batched reinforcement learning in tensorflow. arXiv preprint arXiv:1709.02878, 2017.
 Lowe et al. [2017] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
 Foerster et al. [2018] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multiagent policy gradients. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Parisotto et al. [2019] E. Parisotto, S. Ghosh, S. B. Yalamanchi, V. Chinnaobireddy, Y. Wu, and R. Salakhutdinov. Concurrent meta reinforcement learning. arXiv preprint arXiv:1903.02710, 2019.
 Khan et al. [2018] A. Khan, C. Zhang, D. D. Lee, V. Kumar, and A. Ribeiro. Scalable centralized deep multiagent reinforcement learning via policy gradients. arXiv preprint arXiv:1805.08776, 2018.
 Jiang et al. [2018] J. Jiang, C. Dun, and Z. Lu. Graph convolutional reinforcement learning for multiagent cooperation. arXiv preprint arXiv:1810.09202, 2018.
Appendix A Permutation Equivariance
The proof for Theorem 1 was first given in [14]. We reiterate it here due to its importance to this body of work.
Given a set of permutation matrices :
(16) 
where the operation permutes the elements of the vector then, it can be shown that :
Theorem 1 Let graph be defined with a graph shift operator . Further, define to be the permuted graph with for and any it holds that :
(17) 
Proof for Theorem 1
Given that is a permutation matrix. This implies
is also an orthogonal matrix. This implies
. Thus,(18) 
Then,
(19) 
Finally, using
(20) 
Hence, proved.
Appendix B Experiments in AirSim
For the experiments in AirSim, we assume the robots to obey single integrator dynamics.
Robot ’s position in the plane at time is given as and its velocity at time is given as . The action chosen by the robot gives the change in velocities. The state of robot then evolves according to :
(21) 
where is sampling time. To speed up training, the velocities are clipped in a range of m/s. The robots are initialized along a straight line with a fixed distance between them. An external controller is used to arm the robots and for takeoff. Once all the robots are at a fixed height, control is handed over to GPG. Further, when the robots reach their goals, they do not enter hover mode. GPG outputs small velocities for the robots and as a result when robots are close to their goals, they tend to oscillate around the goal point.
Appendix C Hyperparameters
For the point mass experiments, when learning onehop neighbor information, we use a 2 layer GCN with a tanh nonlinearity. The input layer consists of 16 hidden units and is fed the state representation . The output of this layer is then fed into a layer and a layer both of which have 16 hidden units. Actions for the robots are sampled using these parameters. We use the Deep Graph Library (DGL) to represent our GCNs. The learning rate is set to e
and uses the Adam optimizer. For the VPG baseline, the learning rate and the optimizer remain the same. The network is parametrized by fully connected layers where the input to the first layer is the state representation and consists of 64 hidden units followed by a ReLU non linearity. The output of this first layer is then fed into separate
andlayers each consisting of 64 hidden units and ReLU activation functions. We invite the reader to take a look at the code attached with this paper for more details.
Appendix D Graph Policy Gradients
Algorithm 1 summarizes the full graph policy gradients algorithm.
Initialize robots and policies parametrized by a GCN whose weights are given by and graph with graph shift operator