Graph Policy Gradients for Large Scale Robot Control

by   Arbaaz Khan, et al.

In this paper, we consider the problem of learning policies to control a large number of homogeneous robots. To this end, we propose a new algorithm we call Graph Policy Gradients (GPG) that exploits the underlying graph symmetry among the robots. The curse of dimensionality one encounters when working with a large number of robots is mitigated by employing a graph convolutional neural (GCN) network to parametrize policies for the robots. The GCN reduces the dimensionality of the problem by learning filters that aggregate information among robots locally, similar to how a convolutional neural network is able to learn local features in an image. Through experiments on formation flying, we show that our proposed method is able to scale better than existing reinforcement methods that employ fully connected networks. More importantly, we show that by using our locally learned filters we are able to zero-shot transfer policies trained on just three robots to over hundred robots.


page 8

page 12


Graph Policy Gradients for Large Scale Unlabeled Motion Planning with Constraints

In this paper, we present a learning method to solve the unlabelled moti...

Large Scale Distributed Collaborative Unlabeled Motion Planning with Graph Policy Gradients

In this paper, we present a learning method to solve the unlabelled moti...

Zero-shot generalization using cascaded system-representations

This paper proposes a new framework named CASNET to learn control polici...

Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs

We consider the problem of zero-shot recognition: learning a visual clas...

PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication

Graph Convolutional Networks (GCNs) is the state-of-the-art method for l...

RMP2: A Structured Composable Policy Class for Robot Learning

We consider the problem of learning motion policies for acceleration-bas...

Distributed Map Classification using Local Observations

We consider the problem of classifying a map using a team of communicati...

1 Introduction

In the recent past, deep learning has proved to be an extremely valuable tool for robotics. Harnessing the power of deep neural networks has emerged as a successful approach to designing policies that map sensor inputs to control outputs for complex tasks. These include, but are not limited to, learning to play video games 

[1, 2], learning complex control policies for robot tasks [3] and learning to plan with only sensor information [4, 5, 6]. While these works are impressive, they often concern themselves with controlling a single entity. However, in many real world applications, especially in fields like robotics there exists a need to control multiple robots interact with each other in co-operative or competitive settings. Examples include warehouse management with teams of robots [7], multi-robot furniture assembly [8], and concurrent control and communication for teams of robots [9]. In such a scenario, as the number of robots increases, the dimensionality of the input space and the control space both increase making it much harder to learn meaningful policies.

This paper looks to tackle the problem of learning individual control policies by exploiting the underlying graph structure among the robots. We start with the hypothesis that the difficulty of learning scalable policies for multiple robots can be attributed to two key issues: dimensionality and partial information. Consider an environment with

robots (This work uses bold font when talking about a collection of items, vectors and matrices). Each robot receives partial observations of the environment. In order for a robot to learn a meaningful control policy, it must interact with some subset of all agents,

. Finding the right subset of neighbors to learn from is in itself a challenging problem. Further, in order to ensure that the method scales, as the number of robots increase, one needs to ensure that the cardinality of the subset of neighbors , remains fixed or grows very slowly.

Figure 1: Graph Policy Gradients. Robots are randomly initialized and, based on some user set thresholds, a graph is defined. Information from K-hop neighbors is aggregated at each node by learning local filters. These local features are then used to learn policies to produce desired behavior.

To solve these problems for large scale multi-robot control, we draw inspiration from convolutional neural networks (CNNs). CNNs consist of sequentially composed layers where each layer comprises of banks of linear time invariant filters to extract local features along with pooling operations to reduce the dimensionality. Convolutions regularize the linear transform to exploit the underlying structure of the data and hence constrain the search space for the linear transform that minimizes the cost function. However, CNNs cannot be directly applied to irregular data elements that have arbitrary pairwise relationships defined by an underlying graph. To overcome this limitation, there has been the advent of a new architecture called graph convolutional networks (GCNs)  

[10, 11, 12]. Similar to CNNs, GCNs consist of sequentially composed layers that regularize the linear transform in each layer to be a graph convolution with a bank of graph filters and the weights of the filter are learned by minimizing some cost function.

When controlling a large swarm of robots operating in a Euclidean space , the underlying graph can be defined as where is the set of nodes and is the set of edges. For example, we define each robot to be a node and add an edge between robots if they are are less than distance apart. This graph acts as a support for the data vector where is the state representation of robot . Now, our GCN exploits this graph and at each node aggregates information from its neighbors (See Section 2.2). This information is propagated forward to the next layer through a non linear transformation. The output of the final layer is given by , where are independent policies for the robots. Similar to standard reinforcement learning, we execute these policies in the environment, collect a centralized reward and use policy gradients  [13] to update the weights of policy network. We call this algorithm Graph Policy Gradients (GPG) (See Fig. 1).

One possible concern with our proposed algorithm is that, when working with on-policy methods for a large number of robots, it is highly likely that, due to exploration/entropy the graph might change during training. Such a setting can result in an explosion in the number of possible graphs that one needs to learn over as the number of agents increases. We show that in light of this, our choice of GCNs for policy parametrization is well motivated because it can be shown that graph convolutions are permutation equivariant, i.e if one were to reorder the node ordering of the graph and the corresponding graph signal, then the output of the graph convolution does not change  [14]. This is important because it reduces the dimensionality of the problem and helps training converge faster when training with many robots.

To demonstrate the efficacy of our proposed algorithm, we perform experiments on simulated formation flying with robots. Designing controllers and trajectories for formation flying is an important problem in multi-robot literature  [15, 16] and is in fact a good test bed for other multi-robot control problems [17]. In our experiments, it is shown that GPG is able to converge as the number of robots are increased in comparison to state of the art on-policy reinforcement learning algorithms that employ fully connected networks to parametrize the policy. We also show that our graph filters are able to learn valid local features by training them on a small number of robots and transferring the behavior to a very large number of robots. For example, we train a GCN for three robots to maintain formation, avoid collisions and follow their trajectories. Then, we initialize a swarm of many robots and use this same graph filter over the entire graph to generate desired behavior. We show that the desired formation flying behavior is achieved without updating the weights for the larger swarm thus achieving zero-shot transfer. Lastly, the ability of GPG to adapt to more complex dynamics and control is demonstrated.

2 Methodology

2.1 Preliminaries

We pose learning the controller for a large number of robots as a policy learning problem in a collaborative Markov team [18]. The team is composed of robots generically indexed by which at any given point in time occupy a position in configuration space and must choose an action in action space. The team and environment are assumed to be Markovian. Thus, if one were to collect the robot configurations in the vector and actions in the vector

then, the evolution of the system is completely determined by the conditional transition probability

. The transition dynamics are also assumed to be the same for all agents, so that if we swap two of them in configuration and action space we expect to see the same statistical evolution. This is a natural consequence when, the robotic swarm is assumed to be composed of homogeneous robots. In this work, there exists no external communication between robots. Instead, we define a graph that encodes information about each robots neighbors. Robot can directly communicate only with its one hop neighbors. Communication with neighbors that are further away is possible only indirectly. Each robot has access to its own states and

. In this paper, we do not design which nodes each robot should talk to or what each robot must communicate. Each robot must learn a policy (probability distribution) from which the robot samples actions;

. Let . As robots operate in the environment, they collect a centralized global reward . Collapsing everything, we are interested in computing such that the expected sum of rewards over some time horizon is maximized:


where are the parameters of . At a high level, the global reward encodes desired behavior for the swarm. As the number of robots increase, the dimensionality of and increases too making the problem of learning non-trivial. Our proposition here to instead learn only compounds the difficulty of the problem as the size of the graph grows exponentially as the number of robots increase ( for robots). In the next section, we discuss using a graph convolutional network as a parametrization for to overcome this problem by extracting local information from the graph structure.

2.2 Graph Convolutional Networks

Consider a graph described by a set of nodes denoted , and a set of edges denoted . This graph is considered as the support for a data signal where the value is assigned to node . The relation between and is given by a a matrix called the graph shift operator. The elements of given as respect the sparsity of the graph, i.e , . Valid examples for are the adjacency matrix, the graph laplacian, and the random walk matrix. In this paper, we consider the normalized graph Laplacian similar to  [10]:


A key property of is that it is assumed symmetric, with decomposition where

is the eigenvector matrix and

is the eigenvalue matrix of

. defines a map between graph signals that represents local exchange of information between a node and its one-hop neighbors. More concretely, if the set of neighbors of node is given by then :


Eq. 3 performs a simple aggregation of data at node from its neighbors that are one-hop away. The aggregation of data at all nodes in the graph is denoted . By repeating this operation, one can access information from nodes located further away. For example, aggregates information from its -hop neighbors (see Fig 2). Now one can define the spectral -localized graph convolution as :


where is a linear shift invariant graph filter  [19] with coefficients . Similar to CNNs the output of the GCN is fed into a pointwise non-linear function. Thus, the final form of the graph convolution is given as:


where is a pointwise non-linearity. A visualization of a GCN can be seen in Fig. 2.

Figure 2: Graph Convolutional Networks. GCNs aggregate information between nodes and their neighbors. For each -hop neighborhood (illustrated by the increasing disks), record (Eq. 3) to build which exhibits a regular structure (Eq. 5). a) The value at each node when initialized and at the b) one-hop neighborhood. c) two-hop neighborhood. d) three-hop neighborhood.

2.3 Permutation Equivariance of Graph Convolutional Networks

To control robots, we propose defining a graph where each robot is a node and robots within distance of each other are connected by an edge. The robots are all initialized with random policies and by exploring different actions, they learn what policies best optimize the global reward. Such exploration can change the ordering of the configuration . This can lead to an explosion in the possible number of graphs that our policies have to learn over. However,  [14] proves a key property for graph convolutional filters. Given a set of permutation matrices :


where the operation permutes the elements of the vector then, it can be shown that :

Theorem 1:

Let graph be defined with a graph shift operator . Further, define to be the permuted graph with for and any it holds that :


Proof. See Appendix

A consequence of Theorem 1 is that the output of the graph convolution does not change under reordering of the graph nodes as long as the topology of the graph stays the same. Intuitively, if the graph exhibits several nodes that have the same graph neighborhoods, then the graph convolution filter can be translated to every other node with the same neighborhood. When learning control for a large number of robots, this property helps in reducing dimensionality of the problem.

2.4 Formation Flying

We choose formation flying as a test bed for controlling a large number of robots. In this work, the robots are optimized to produce desired behavior in terms of trajectory following. This behavior can easily be modified from formation flying to other multi-robot objectives such as flocking [20], information gathering [21], collaborative mapping [22] and multi-robot coverage [23].

Consider a two dimensional Euclidean space with homogeneous point mass robots indexed by . In later experiments, we use the robot models defined in the AirSim simulator  [24] for our experiments. Each robot has a desired goal position in the euclidean space. Collectively for all robots, all goal locations are denoted . At time , the robot’s position in the plane is given by . The state representation for robot that is used by our learning architecture consists only of its own relative position to the goal, i.e . We choose relative positions to goals as a representation of the robots, in order to maintain permutation invariance. It is important to note here that robot cannot arbitrarily communicate with any other robot in the swarm. Instead we define a graph that encodes pairwise relationships between robots. Robot can directly communicate only with its one hop neighbors. Communication with neighbors that are further away is possible only indirectly. For each robot, given an action , the state of the robot evolves according to some stationary dynamics distribution with conditional density . In this work, we work with continuous control only since it poses a much harder problem and is also more realistic when working with robots. Two conditions are encoded for robots to maintain formation. These are collision avoidance, and waypoint reaching for robots. The necessary and sufficient condition to ensure collision avoidance between robots is given as :


where is the euclidean distance and is a user-defined minimum distance between robots. Let us also define an assignment matrix as :


where is some threshold region of acceptance. The necessary and sufficient condition for all robots to be cover their assigned goals at some time is then:


where I

is the identity matrix. With these definitions in place, we now define the problem statement considered in this paper:

Problem 1. Given an initial set of robot configurations , a graph defining relationships between robots and some goals g, compute a set of policies such that executing actions results in a sequence of states that satisfy Eq.8 and at time , satisfy the assignment constraint in Eq.10.

2.5 Graph Policy Gradients

We proposing solving the statement in Problem 1 by exploiting the underlying graph structure. Given an initial swarm of robots, a graph can be defined by setting each robot to be a node. Next, an edge is added between nodes, if:


where is some user defined threshold to connect two robots. It is assumed that at time , the position of robot (given as ) and as a consequence the relative position of robot (given as ) to its own goal is known precisely. The relative positions of all the robots at time are collected into the vector . The graph defined earlier, acts as the support for . We do not evolve the graph over time since interchanging homogeneous nodes results in an equivalent graph convolution as discussed in Section 2.3. To ensure that the topology stays constant, the number of neighbors for each node is kept fixed, i.e robot is connected to its say three or four nearest neighbors. We also experiment with a time varying graph, in our simulations and it is observed that for a small number of robots, policies using computed time varying graphs converge slower than fixed graphs (see Fig. 4). However, as the number of robots is increased, policies computed using time varying graphs do not converge whereas policies trained using a fixed graph still converge to desired behavior. One possible reason for this is that when the number of robots is small, the number of possible graphs is also small and it is easier to learn over this small number of graphs. As the number of robots increase, the number of possible graphs increase exponentially leading to a very large number of graphs that the robot needs to learn over.

To compute the policies, a GCN architecture with layers is initialized. At each layer according to Eq.5, the output is given as:


where is a pointwise non-linear function, and

. In practice, the final layer outputs are parameters of Gaussian distributions from which actions are sampled. Intuitively at every node, the GCN architecture aggregates information and uses this information to compute policies. In order to satisfy the constraints given in Eq.

8 and Eq.10, we formulate a centralized reward structure of the form:


Each robot receives the same reward an attempts to learn a policy that best optimizes this reward. It is assumed that the policies for the robots are independent. Thus, the overall loss function for all the robots as given in Eq.



where now represents the filter weights of the GCN for . Consider a trajectory . Since the reward along a trajectory is the same for all robots and all robot policies are assumed independent, using direct differentiation the policy gradient is given as :


For the full algorithm, please see the appendix. The weights

, are then updated using any variant of stochastic gradient descent. We call this algorithm Graph Policy Gradients (GPG). In the next section we demonstrate how GPG is able to learn meaningful policies even as the number of robots increase and is also able to transfer filters learned from a small number of robots to a larger number of robots.

3 Experiments

To investigate the performance of GPG, we design experiments to answer three key questions :

  • Can GPG learn policies that achieve desired behavior as the number of robots increase?

  • How well do the graph filters learned by GPG transfer to a large number of robots?

  • Does GPG work with more complex dynamics and controls?

3.1 Training for Point Mass Formation Flying with GPG

Figure 3: Formation Flying.

We first look to establish if GPG can train for formation flying for simple point mass dynamics as the number of robots are increased. The state of every robot is its absolute position to its goal and the action at time for robot is the change in and position in the plane. Such a setting necessitates communication between robots. One could argue that the robots simply have to learn to take actions such that tends to zero. However, consider the example in Fig 3. When the goals are reasonably far away and each robot can take actions that could cause it to collide with its neighbors. Each robot must communicate with its neighbors who in turn must communicate with theirs in order to achieve collision free trajectories that take robots to their assigned goals.

Figure 4: 10 robot Formation Flying. Using a static graph during training increases the sample efficiency of GPG. A dynamic graph, i.e a graph that evolves over time as the robots move in space takes longer to converge.

To establish relevant baselines, we choose vanilla policy gradients (VPG) that uses the same policy gradient hyperparameters as GPG but differs from GPG in that it uses a 2 layer fully connected network for policy paramterization. We also compare with Proximal Policy Optimization (PPO)  

[25] a state of the art on policy method for learning continuous policies. In this paper, we employ a batched version of PPO for faster computation [26]. The PPO baseline also employs fully connected networks. Another choice of baseline used in this paper PPO with a recurrent network architecture since the problem on hand requires communication between agents and shared memory is one possible mechanism. The training curves for our experiments with 3,5 and 10 robots can be seen in Fig 5. We observe right away that GPG is able to converge in all three cases. VPG does not produce any meaningful behavior and both variants of PPO only produce partial results.

Figure 5: Training for Formation Flight. Point mass robots are trained for formation flight. The reward is a centralized reward. Each curve is produced by running three independent runs of the algorithm. Darker line represents mean and shaded area represents mean standard deviation of mean.

In the experiments above, we maintain a static graph while training in light of the permutation equivariance property of GCNs. While static graphs are used during training to speed up the training process, during testing, the graphs are dynamic since the graph convolution yields the same result. We also experiment with a dynamic graph that evolves as the robots move in space. As hypothesized and guided by the intuition of Theorem 1, using dynamic graphs increases sample complexity of the problem whereas the static graph converges faster. This can be seen in Fig 4.

3.2 Zero Shot Policy Transfer for Formation Flying

It can be observed from Fig 5 that while GPG is able to converge as the number of robots increase, the number of samples required also increases. Training GPG for ten robots alone takes over nine hours on a 12GB NVIDIA 2080Ti GPU whereas we wish to learn policies for hundreds of robots. One possible solution is to realize that the filters trained extract local features and the same filter can be used when working with many more robots. To demonstrate this, we train a filter with a small number of robots (depending on the formation we are interested in) and use the same filter weights without any gradient updates on the larger swarm.

Figure 6: Zero Shot Transfer to Large Number of Robots. (Left) Policies are trained for three robots to reach goals that are a small distance away. Robots are randomly initialized in a rectangular region and must reach goals much further than those in the training set. (Center) Robots are initialized on a circle and must execute policies such that the resulting shape is an arrowhead. (Right) Policies trained on swarms of five robots are transferred over to form a figure of eight. The choice of 101/51 robots is arbitrary and is not a limiting threshold.

Another point to note is that when training for the smaller swarm, we only train for goals that are a small distance away to reduce the number of samples (it is easier to discover goals that are close by than it is to discover goals much further away) required for training. However, during execution these policies are able to converge even when the goals are hundred units away. A snapshot of some of the formations we can produce using this zero shot transfer can be seen in Fig. 6. In Fig. 6 (left) and (center), policies are trained on three robots and utilize only 1-hop neighbor information. The figure of eight formation necessitates more communication especially when robots at the edges cross over to form the figure of eight. In Fig. 6 (right) the filters are learned by training with five robots with each robot utilizing 3 hop neighbor information.

With these results, it can be concluded that GPG is capable of learning complex behaviors for a small number of robots. These behaviors can then be transferred to larger more complex swarms. To the best of our knowledge there exists no other method to learn continuous control policies that can achieve similar results for so many robots.

3.3 Complex dynamics and control

In the experiments considered above, actions for robot are simply the change in position in the plane. Further, it is assumed that full control of robots can be achieved and at every step robots have zero velocities. These assumptions are infeasible in the real world. Thus, we also conduct experiments to test the efficacy of GPG on more realistic dynamics and control.

Figure 7: 11 robot Arrow Head Formation. The robots are spawned at ground and are manually controlled for takeoff. Once the robots are at a desired height, control is handed over to GPG. (see Appendix for detailed figures)

To simulate this, we use the AirSim simulator introduced in  [24]. Robots are now assumed to have finite mass and inertia and obey single integrator dynamics. The state for each robot used by GPG is now given as where is the velocity at time for robot . Thus, the state for each robot is not just its relative position but also includes its current velocity. The actions now represent change in velocity in the plane. Similar to the point mass experiments, policies are first learned for a small number of robots in AirSim. In addition to the non simplistic dynamics, the difficulty of training policies in AirSim is further compounded by the fact that AirSim is a real time simulator and produces an order of magnitued fewer training points as compared to the point-mass simulation over a fixed period of time. Thus, it is even more imperative when working with such a simulator to be able to learn policies for a small number of robots that require fewer samples and then be able to transfer these policies to a larger number of robots for which direct training would otherwise be infeasible. GPG being a model free algorithm faces no additional difficulty in adapting to the single integrator dynamics. A snapshot of results produced using GPG in AirSim can be seen in Fig 4. In this case, the baselines used in Section 3.1 are unable to learn reasonable behaviors within 50K episodes. This can be attributed to the fact that the observations in AirSim are noisier and the baselines most likely need a lot more samples to converge.

4 Related Work

In the recent past, there has been an immense amount of interest in learning policies for a large collection of agents. Literature in multi-agent RL has argued for a centralized training decentralized execution scheme [27, 28] where each agent has its own policy network but during train time, it has access to an additional critic network that co-ordinates information among all agents. Naturally, such a scheme is not scalable as the input size of the critic network grows as the number of robots increase in addition to the inherent complexity of training hundreds of separate policies simultaneously. Another line of thinking in this field, has been the idea to approximate the policies of all the agents using meta-learning  [29, 30]. These works while impressive, still run into the problem of having to execute rollouts for a large number of agents to learn a meaningful meta-representation. Closest to our work, perhaps is the work of  [31] where the authors propose using CNNs on an encoded graph representation of the agents. In contrast to our work, the policies learned are discrete and the convolutions do not fully exploit the local graph structure.

5 Conclusion

The problem of learning policies that map state representation to control commands for a large number of robots is immensely challenging, even if the robots are homogeneous. In this work, we propose using the underlying graph structure as additional information. To exploit information from the graph we propose using GCNs to parametrize the policies. Using GCNs we are able to learn local filters defined on the graph that extract local information. The additional benefit of using these local feature extractors is that they can be used to alleviate the problem of needing a large number of rollouts as the number of robots increase as the number of robots increase as demonstrated by the experiments in Section 3.2. It might even be possible to add other sensor representations such as camera images or lidar instead of requiring exact positions of the robots in space. These could be first processed using standard architectures such as CNNs and then be fed into GPG to learn policies for a large swarm of robots directly from onboard sensors. We leave this for future work.


Appendix A Permutation Equivariance

The proof for Theorem 1 was first given in  [14]. We reiterate it here due to its importance to this body of work.

Given a set of permutation matrices :


where the operation permutes the elements of the vector then, it can be shown that :

Theorem 1 Let graph be defined with a graph shift operator . Further, define to be the permuted graph with for and any it holds that :


Proof for Theorem 1
Given that is a permutation matrix. This implies

is also an orthogonal matrix. This implies

. Thus,




Finally, using


Hence, proved.

Appendix B Experiments in AirSim

For the experiments in AirSim, we assume the robots to obey single integrator dynamics.

Figure 8: Training 3 Robots in AirSim.

Robot ’s position in the plane at time is given as and its velocity at time is given as . The action chosen by the robot gives the change in velocities. The state of robot then evolves according to :


where is sampling time. To speed up training, the velocities are clipped in a range of m/s. The robots are initialized along a straight line with a fixed distance between them. An external controller is used to arm the robots and for takeoff. Once all the robots are at a fixed height, control is handed over to GPG. Further, when the robots reach their goals, they do not enter hover mode. GPG outputs small velocities for the robots and as a result when robots are close to their goals, they tend to oscillate around the goal point.

Figure 9: Arrowhead Formation Flying for 11 robots in AirSim.

Appendix C Hyperparameters

For the point mass experiments, when learning one-hop neighbor information, we use a 2 layer GCN with a tanh nonlinearity. The input layer consists of 16 hidden units and is fed the state representation . The output of this layer is then fed into a layer and a layer both of which have 16 hidden units. Actions for the robots are sampled using these parameters. We use the Deep Graph Library (DGL) to represent our GCNs. The learning rate is set to e

and uses the Adam optimizer. For the VPG baseline, the learning rate and the optimizer remain the same. The network is parametrized by fully connected layers where the input to the first layer is the state representation and consists of 64 hidden units followed by a ReLU non linearity. The output of this first layer is then fed into separate


layers each consisting of 64 hidden units and ReLU activation functions. We invite the reader to take a look at the code attached with this paper for more details.

Appendix D Graph Policy Gradients

Algorithm 1 summarizes the full graph policy gradients algorithm.

Initialize robots and policies parametrized by a GCN whose weights are given by and graph with graph shift operator

1:while True do
2:     for time  do
3:         for robots  do
4:              Record each robots state
5:              Aggregate information at each robot from its neighbors according to Eq. 3
6:         end for
7:         Collect
8:         Compute localized graph convolution as given in Eq.4
9:         Stack graph convolution layers such that and
10:         Sample
11:         Execute and collect reward over all robots =
12:     end for
13:     Record trajectory )].
14:     Compute the graph policy gradients as given in Eq 15 and update weights
15:end while
Algorithm 1 Graph Policy Gradients.