Inferring the underlying relations between components of complex dynamic systems can inform decision making for autonomous agents. One natural system with complex dynamics is crowd navigation (i.e., navigation in the presence of multiple humans). The crowd navigation task is challenging as the agent must predict and plan relative to likely human motions so as to avoid collisions and remain at safe and socially appropriate distances from people. Some prior work predicts human trajectories using hand-crafted social interaction models  or by modeling the temporal behavior of humans . Although these methods can estimate human trajectories, they do not use the prediction to inform the navigation policy. Other recent works [10, 12, 8]
use deep reinforcement learning (DRL) to learn a socially compliant policy. These policies either do not leverage the human interactions or approximate it with heuristics. They also simplify the human motion prediction problem with unrealistic assumptions such as linear human motion, and typically consider only the current state of humans instead of incorporating predicted human motion to inform the policy.
and Graph Neural Networks (GNNs) are one of the most powerful tools for modeling objects and their relations (interactions). A variant of GNNs is Graph Convolutional Networks (GCNs) where relations between nodes are defined as an adjacency matrix. Whereas in GCNs the relations between all nodes are given, Wang et al.  and Grover et al.  propose to learn the relations between objects and use learned attention to compute new features. Inspired by this work on relational reasoning and GNN models, we propose a relational graph learning model for crowd navigation that reasons about relations between agents (robot and humans) and then use a GCN to compute interactions between the agents. With the predicted interaction features of both the robot and humans, our approach jointly plans efficient robot navigation and predicts the motion of present humans. Figure 1 illustrates how interaction reasoning between all agents and explicit prediction can yield a farsighted navigation policy.
Planning and prediction of multi-agent states is fundamental in many problems, including the general case of decision making in an environment with non-cooperative agents. In this paper, we address this problem with a relational graph learning approach for model-based RL, which predicts future human motions and plans for crowd navigation simultaneously. We show that our model outperforms baselines from prior work and carry out ablation studies to measure the planning budget on navigation performance. We also qualitatively demonstrate that our approach mitigates undesirable robot behaviors in challenging scenarios. The code is available at github.com/ChanganVR/RelationalGraphLearning.
Ii Related Work
Crowd navigation. Mobile robot navigation in crowds is challenging due to the complex ways in which human intent and social interactions determine human motions. Prior work has used rule-based algorithms to characterize the interactions between agents. Helbing et al.  proposed to model the interactions as ”social forces”. RVO  and ORCA  solve for collision avoidance under reciprocal assumptions. Interacting Gaussian Process (IGP) [29, 30, 31] model the trajectory of each agent as an individual Gaussian Process and propose an interaction potential term to couple the individual GP for interaction. Rule-based methods rely heavily on hand-crafting the model for navigation. Social-LSTM  used an LSTM to predict human trajectories directly from large scale datasets. However, for crowd navigation, forecasting models do not directly yield an action policy. Recently, Chen et al. [10, 9] propose to use DRL for crowd navigation by learning a value network encoding the state values.
LM-SARL  improved on previous work by learning the robot state representation with attentive pooling over pairwise interaction features. However, this model is limited by partial modeling of crowd interactions, due to significant simplifying assumptions for the underlying state transition without explicitly modeling human motions. Most recently, LeTS-Drive 
used online belief-tree search to learn a value and policy function for autonomous driving in a crowded space. Although this approach models intentions and interactions between the vehicle and humans, the interaction is coarse-grained, utilizing Convolutional Neural Networks (CNNs) to process stacked frames of the environment, the history and intended path of the vehicle. In contrast, we include pairwise interactions among all agents, which coupled with our graph learning, explicitly captures relations between agents and models higher-order interactions.
Relational reasoning. Relational reasoning [3, 33] aims to capture relationships between entities such as decision-making agents [40, 1], image pixels , words, or humans and objects [7, 14]. The relations and entities are typically represented as a connected graph [38, 34], and standard tools for graph-based learning such as Graph Neural Nets(GNNs) [17, 39, 23, 16] are applied. GNNs are a class of neural networks that learn functions on graphs, representing objects as nodes and relations as edges. Knowledge about object dynamics can be encoded in the GNN node update function, and interaction dynamics can be encoded in the edge update function. Most current GNNs operate on interacting systems with known and fixed graph inputs, such as human skeletons or particles connected by springs [4, 23]. However, many interacting systems have unknown relations. For example, humans in sports  and crowds . Thus, it is important to infer relations among the entities and learn based on the inferred relations. In crowd navigation, the temporal dynamics of these relations are useful for planning safe and efficient crowd navigation (e.g., understanding the joint motion of a pair of friends as they cross the robot’s path while walking close to each other, in contrast to the motion of a pair of strangers walking in opposite directions and passing each other). Inspired by Kipf et al. , who estimate the graph connectivity map from trajectories using an auto-encoder architecture, our model dynamically infers the crowd-robot relation graph at each time step and learns the state representation for each agent based on this graph.
MPC, MCTS and model-based DRL. Model predictive control (MPC) is a family of control algorithms that leverage models of the system to predict state changes and plan control accordingly. Traditional MPC [37, 25] usually assumes access to a known environment model, which is frequently unrealistic. Monte-Carlo Tree Search (MCTS) has been used for decision-time planning by estimating action values based on many simulated trajectories in complex search problems such as the game of Go . More recent model-based RL methods first acquire a predictive model of the world, and then use that model to make decisions. Finn et al.  learned a state transition model by predicting future frames to achieve goal-directed robotic manipulation. VPN  and Predictron  learned to predict future abstract states that are informative for planning in Atari Games. In contrast, our model predictive relational graph takes a set of raw human states (e.g., positions, velocities) as input and predicts multiple interacting human trajectories. To our knowledge, we are the first to integrate relational learning with a model-based DRL algorithm for crowd navigation.
We first describe how we formulate the crowd navigation problem with deep RL, then introduce our relational graph learning model for modeling interactions in the crowd. In addition, we show how this model can be integrated into a simplified Monte-Carlo Tree Search (MCTS) approach for d-step planning. Figure 2 shows an overview of our approach.
Iii-a Deep Reinforcement Learning for Crowd Navigation
In this work, we address the crowd navigation task where the robot navigates through a crowd of humans to a goal position as efficiently and safely as possible. This task is formulated as a sequential decision making problem in recent works [10, 12, 8]. Each agent (either human or the robot) observes others’ observable state, including position , velocity and radius (an abstract measure of size). Each agent also maintains an internal state, such as a goal position and preferred speed . We assume actions are instantaneous and the agent always arrives at the target position in the next time step. We use and to denote the robot state and the observed state of human at time , respectively. The robot input state is defined as . The optimal policy, , is to maximize the expected return:
where is the reward received at time , is the discount factor, is the optimal value function,
is the transition probability from timeto time . The preferred velocity is used as a normalization term in the discount factor for numerical reasons .
We follow the formulation of the reward function defined in Chen et al. , which awards accomplishing the task while penalizing collisions or uncomfortable distances,
where is the minimum separation distance between the robot and the humans during the time period .
We do not make assumptions about the state transition model as in previous work . Instead, we use a model-based approach and learn a state transition model for predicting human motion.
This problem statement can be applied to a set of more general tasks where there are non-cooperative agents and the decision-maker only receives their observable states but does not know about their intents or hidden policies.
Iii-B Model Predictive Relational Graph Learning
The interactions (i.e., spatial relations) between humans are important for robot navigation and for predicting future human motion. Previous work does not learn such interaction features for the robot and humans simultaneously. Here, we model the crowd as a graph, reason about the relations between all agents and use a GCN to compute the robot and human state representations. Using the graph model as a building block, we further construct two other modules: a value estimation module which estimates the value of the current state and a state prediction module which predicts the state at the next time step.
Relational graph learning. The key challenge in crowd navigation is to learn a good representation of the crowd encoding interactions among all agents. Chen et al.  show that attentive crowd aggregation improves both interpretation and performance by modeling one-way human-robot interactions. This motivates us to model the crowd and the robot as a directed graph where . The edge indicates how much attention agent pays to agent or the importance of agent to agent . This pairwise relation is not known a priori, so it is inferred with a pairwise similarity function (relation inference). After the relations between all agents are inferred, a graph neural network propagates information from node to node and computes the state representations for all agents (interaction modeling) as shown in Figure 2 a).
LM-SARL  can be viewed as an approximation of our Relational Graph Learning (RGL) formulation, in that it learns robot interaction features with respect to all humans and uses attentive pooling to aggregate the interaction features. Our RGL formulation not only learns the attention of the robot to humans but also from humans to humans. Apart from learning the robot state representation, RGL also learns the state representation for other agents simultaneously and propagates these features to the robot node by message passing. In this way, RGL also models higher-order interactions. For example, fully models human-human interaction rather than using a local map (as in LM-SARL) to approximate the local interactions of humans. This approach is also favorable compared to LeTS-Drive  in that LeTS-Drive doesn’t reason about pairwise state interactions or model human-human interactions.
Iii-B1 Relation Inference
the initial values of vertices are the state values for the agents:
. Since robot and human states have different dimensions, we use two multilayer perceptrons (MLPs)and to embed them into a latent space, resulting in a matrix , where the first row is the latent state of the robot and the remaining rows are the latent states of humans. Given the feature matrix , a relation matrix is computed using a pairwise similarity function. Following Wang et al. , we use an embedded Gaussian as the similarity function. The pairwise form is given by and the matrix form is given by where , , and . A learned relation is illustrated in Figure 1 where the thickness of the line indicates the strength of pairwise interactions.
Iii-B2 Interaction Modeling
with the feature matrix and relation matrix , we use a GCN to compute the pairwise interaction features. The message passing rule is defined by where is a layer-specific trainable weight matrix, is the node-level feature of layer , and
is an activation function. The feature of nodeat level aggregates its neighbor node features at level weighted by the relations stored in matrix .
Let and after layer propagations, we have state representation matrix for , and is the state representation for agent encoding its local interactions.
Value estimation. Our value estimation module consists of two models: a relational graph model to infer the robot state representation and a subsequent value network to predict the value of the state . The value network is an MLP and predicts values over a discretized action space by looking ahead as shown in Figure 2 b).
State prediction. We assume the action of the robot can be perfectly executed and the next robot state can be computed directly from the current state and action . Our state prediction module models interactions between human states to predict future human states, as shown in Figure 2 c). In the action selection phase, previous works [10, 8] rely on either querying the ground truth or off-the-shelf algorithms to approximate the next human states, which does not benefit from end-to-end learning. Our state prediction module consists of two models: first a relational graph model predicts relations between all agents and their layer interaction features, then a successor motion prediction model uses the human state representations to predict their next state , where is the predicted state for human with . The motion prediction network is modelled by an MLP. In addition, to simplify the prediction task, we use a hand-crafted function to estimate the reward based on the prediction of human states, denoted by .
Iii-C -step Planning
The use of DRL for crowd navigation addresses the problem of shortsighted collision avoidance in traditional methods  by learning a state value function. However, with imperfect value estimation and unknown human behavior policies, the robot can sometimes be stuck in a local minimum with suboptimal behaviors. By predicting future states and future rewards with our approach, we can simulate -steps into the future to provide a better estimate of the state values. Furthermore, by controlling the depth of this simulation, we can trade off computation for better performance. Our approach is visualized in the tree search diagram shown in Figure 2 d).
Compared to Chen et al. , our model does not need to query the environment or use off-the-shelf trajectory predictions to obtain the next human states. We follow the -step planning method proposed in Oh et al. , performing rollouts using the crowd state prediction and value estimation up to steps in the future, and select the action with the maximum -step predicted return, defined as follows:
With -step planning, computation increases exponentially with search depth and width. Due to our large action space, we use action space clipping to reduce the computational cost. Intuitively, the value function estimates the quality of entering a state. Using this estimate, we recursively search the top- next states with one-step lookahead. Compared to only considering the top action, -step planning provides a better state value estimate when the agent encounters unseen states. Pseudocode for the -step planning is summarized in Algorithm 1. V-PLAN is a recursive function that takes a raw state as input and simulates future trajectories up to depth with action space clipping width . Figure 2 c) shows one toy example of tree search with depth and clipping width .
Iii-D Joint Value Estimation and State Prediction Learning
Pseudocode for the joint state prediction and value estimation learning scheme is in Algorithm 2
. We first use imitation learning with collected experience from a demonstrator ORCA policy to initialize the model and then use RL to refine the policy. During RL training, we use temporal-difference with experience replay and a fixed target network to train the value estimation function
and use supervised learning to train the state prediction function. Sharing graph models between and makes the RL training unstable and harder to optimize, so we use separate graph models for these two functions. Compared to the training scheme in , in addition to the extra steps for training the state prediction model, our training algorithm uses as the target value rather than , which provides more accurate value estimation.
|Method||Success||Collision||Extra Time||Avg. Return||Max Diff.|
|MP-RGL-Multistep (Ours full)|
indicates standard deviation measured using five independently seeded training runs.
Iv Experimental Results
Implementation details. The hidden units of , , , have dimensions (, ), (, ), (, , ), (, ) and the output dimension of is for all . For fair comparison with the baseline method, we use the value network for the baseline as reported in Chen et al. . All the parameters are trained using Adam , and the learning rate is . The discount factor is set to be . The exploration rate of the -greedy policy decays from to linearly in the first episodes. We assume holonomic kinematics for the robot, i.e. it can move in any direction. The action space consists of discrete actions: speeds exponentially spaced between (0, ] and headings evenly spaced between [0, 2).
, the parameters of which are sampled from a Gaussian distribution to introduce behavioral diversity. We use circle crossing scenarios for our experiments. Circle crossing scenarios havehumans randomly positioned on a circle of radius with random perturbation added to their , coordinates. The maximum speed for the agent, is 1 , and the goal is fixed so that it can be reached in a minimum time of seconds. To fully evaluate the effectiveness of the proposed model, we look into the simulation setting where the robot is invisible to humans. As a result, the simulated humans react only to humans but not to the robot. This setting is a clean testbed for validating the model’s ability to reason about human-robot and human-human interaction without affecting human behaviors. All models are evaluated with random test cases.
Quantitative Evaluation. The state-of-the-art method LM-SARL  assumes that the next ground truth state is given during training. To demonstrate the effectiveness of our relational graph learning model, we use a linear motion model (agents keep velocities as in the last state) in LM-SARL as well as serving as our state prediction model so that the comparison is purely in robot state representation learning. These two models are indicated by LM-SARL-Linear and RGL-Linear, respectively. We refer to our full model as MP-RGL-Multistep and the model with one-step lookahead as MP-RGL-Onestep. For all RGL-based models, we use a two-layer GCN. For MP-RGL-Multistep, we let and . We do not compare with LeTS-Drive  as it focuses on components other than interaction modeling and assumes different inputs. Table I reports the rates of success, collision, the average extra navigation time (defined as extra navigation time beyond the minimum possible 8 seconds) as well as average return, which is the cumulative discounted rewards averaged over all steps in the 500 test cases. To provide a clearer interpretation of the average return metric, we add one more metric, that is the difference between the average return and its upper bound. The upper bound only exists in imaginary cases where there are no other humans and the agent can head straight to the goal all the time.
As expected, the performance of LM-SARL-Linear drops compared to the one reported in  after replacing the unrealistic assumption of access to the next ground truth state with a linear motion model. The first comparison between LM-SARL-Linear and RGL-Linear reflects the improvement brought by modeling the crowd interaction with graph neural networks. Then the comparison between RGL-Linear and MP-RGL-Onestep shows that using a state prediction model leads to a better estimate of human motions and thus yields a better navigation policy. Finally, the best model in the table is our MP-RGL-Multistep. Comparing it to the one-step version, we see that -step planning improves the success rate, reduces extra time to reach the goal and increases the average return. Even the best-performing model is not collision-free. The robot is invisible to humans and the policy does not have access to the next ground truth state, making some collisions unavoidable (e.g., when the robot is surrounded by humans converging towards it).
Effect of planning budget. In our -step planning, both the tree depth and action space clipping width influence the computation time and planning performance. With larger , our approach is able to look further into the future and plan further ahead. With larger , we consider more actions at each depth and can predict a wider range of possible outcomes, potentially reaching a better path. We study how the planning budget in -step planning influences performance. We tested the MP-RGL-Onestep model in Table I with various test-time planning budgets. By simply setting , the extra time decreases from to and the average return improves from to . With larger , the performance is also further improved. From these two experiments, we conclude that our -step planning leads to better performance both at learning and decision-making.
Investigation of running time. The GNN-based state feature extractor is computationally efficient compared to attention-based crowd navigation work . The GNN computation amounts to a matrix multiplication, with negligible change when the number of agents N is relatively small. Search complexity is mostly related to the depth and width . The running time of our full model is 0.095s, 0.101s, 0.114s, 0.115s per each decision making in scenarios of 5, 10, 15, 20 humans respectively on an i7-4790 CPU. This shows that i) our approach runs at 10 FPS which is sufficient for real-time settings when the robot moves at relatively slow speeds, and ii) the impact of scaling the human agent number is small.
Qualitative evaluation. We further investigate how our full model handles two challenging scenarios. The first one is when shortsighted behaviors lead to collisions, and the second one shows the robot freezing when it is close to both the goal and potential danger.
The scenario shown in Figure 2(a) is a typical example of the first scenario. The one-step lookahead policy assigns almost equally high values to actions in the direction of and , with being the highest. This is reasonable for a shortsighted policy since no human occupies either of these two directions. However, taking action around will result in a collision with human 0. By looking two steps ahead, the policy anticipates an unavoidable collision in two steps if it moves in the direction now. The advantage of the multi-step lookahead policy also validates the effectiveness of encoding higher-order human interactions. By observing the strong relation between human 0 and human 3, who are moving towards the same direction and are very close to each other, the robot predicts human 3 will move to a direction of and predicts negative rewards for taking an action in the direction of . Thus, the two-step lookahead policy assigns low values to the direction of and avoids collisions by taking action in the direction of .
The scenario shown in Figure 2(b) is a typical example of the second challenging scenario, where human 3 is standing near the goal position of the robot. The one-step lookahead policy assigns the highest value to the action in the direction of and a lower value to the action in the direction of . This is because taking an action towards will result in a discomfort penalty from stepping into the comfort zone of human 3. The two-step lookahead policy can predict the big positive reward after taking an action in the direction of , and the positive reward two steps later can compensate the discomfort penalty in one step. Thus, the two-step lookahead policy assigns the highest value to the action towards and makes a non-myopic decision.
V Real-world Demonstrations
In this paper, we address crowd navigation with a relational graph learning approach. By formulating human-robot and human-human interactions as a graph and using GCNs to compute interaction features, our model can estimate state values as well as predict human motion. Augmented by -step planning, our model explicitly plans into the future under a specific search budget. We show our model outperforms baseline methods by a large margin and can handle challenging navigation scenarios. The relational graph learning approach we proposed can be extended in several ways. For example, we do not model the temporal dynamics of past human trajectories, which can help infer the intent of individuals and group structure among humans.
-  (2019-06-04) Learning transferable cooperative behavior in multi-agent teams. External Links: Cited by: §II.
-  (2016-06) Social LSTM: human trajectory prediction in crowded spaces. In , pp. 961–971. External Links: Cited by: §I, §II.
Relational inductive biases, deep learning, and graph networks. External Links: Cited by: §II.
-  (2016-12-01) Interaction networks for learning about objects, relations and physics. External Links: Cited by: §I, §II.
-  (2008-05) Reciprocal velocity obstacles for real-time multi-agent navigation. In 2008 IEEE International Conference on Robotics and Automation, pp. 1928–1935. External Links: Cited by: §II.
-  (2019-05-28) LeTS-drive: driving in a crowd by learning from tree search. External Links: Cited by: §II, §III-B, §IV.
-  (2015-12) HICO: a benchmark for recognizing human-object interactions in images. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1017–1025. External Links: Cited by: §II.
-  (2018-09-24) Crowd-robot interaction: crowd-aware robot navigation with attention-based deep reinforcement learning. External Links: Cited by: §I, §II, §III-A, §III-A, §III-A, §III-B2, §III-B, §III-B, §III-C, §III-D, TABLE I, §IV, §IV, §IV, §IV.
-  (2017-03-26) Socially aware motion planning with deep reinforcement learning. External Links: Cited by: §II.
-  (2016-09-26) Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. External Links: Cited by: §I, §II, §III-A, §III-B2.
Structure inference machines: recurrent neural networks for analyzing relations in group activity recognition. External Links: Cited by: §I.
-  (2018-05-04) Motion planning among dynamic, decision-making agents with deep reinforcement learning. External Links: Cited by: §I, §III-A.
-  (2016-10-03) Deep visual foresight for planning robot motion. External Links: Cited by: §II.
-  (2017-04-24) Detecting and recognizing human-object interactions. External Links: Cited by: §II.
-  (2018-06-17) Learning policy representations in multiagent systems. External Links: Cited by: §I.
-  (2017-06-07) Inductive representation learning on large graphs. External Links: Cited by: §II.
-  (2017-09-16) Representation learning on graphs: methods and applications. External Links: Cited by: §II.
-  (1995-05-01) Social force model for pedestrian dynamics. 51 (5), pp. 4282–4286. External Links: Cited by: §I, §II.
-  (2017-06-19) VAIN: attentional multi-agent predictive modeling. External Links: Cited by: §I.
-  (2018) Hierarchical relational networks for group activity recognition and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 721–736. Cited by: §I, §II.
-  (2014-12) Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs]. Note: arXiv: 1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: Cited by: §IV.
-  (2018-02-13) Neural relational inference for interacting systems. External Links: Cited by: §I, §II.
-  (2016-09-09) Semi-supervised classification with graph convolutional networks. External Links: Cited by: §I, §II.
-  (2017-03-08) A structured self-attentive sentence embedding. External Links: Cited by: §II.
-  (2005-02-01) Robust model predictive control of constrained linear systems with bounded disturbances. 41 (2), pp. 219–224. External Links: Cited by: §II.
-  (2017-07-11) Value prediction network. External Links: Cited by: §II, §III-C.
-  (2016-01) Mastering the game of go with deep neural networks and tree search. 529 (7587), pp. 484–489. External Links: Cited by: §II.
-  The predictron: end-to-end learning and planning. pp. 9. Cited by: §II.
-  (2010-10) Unfreezing the robot: navigation in dense, interacting crowds. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 797–803. External Links: Cited by: §II.
-  (2013-05) Robot navigation in dense human crowds: the case for cooperation. In 2013 IEEE International Conference on Robotics and Automation, pp. 2153–2160. External Links: Cited by: §II.
-  (2017-12) Sparse interacting gaussian processes: efficiency and optimality theorems of autonomous crowd navigation. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 327–334. External Links: Cited by: §II.
-  (2011) Reciprocal n-body collision avoidance. In Robotics Research, C. Pradalier, R. Siegwart, and G. Hirzinger (Eds.), Springer Tracts in Advanced Robotics, pp. 3–19. External Links: Cited by: §II, §III-C, §III-D, §IV.
-  (2017-06-12) Attention is all you need. External Links: Cited by: §II.
-  (2017-10-30) Graph attention networks. External Links: Cited by: §II.
Social attention: modeling attention in human crowds. External Links: Cited by: §II.
-  (2017-11-21) Non-local neural networks. External Links: Cited by: §I, §II, §III-B1.
-  (2006-12) Trajectory free linear model predictive control for stable walking in the presence of strong perturbations. In 2006 6th IEEE-RAS International Conference on Humanoid Robots, pp. 137–142. External Links: Cited by: §II.
-  (2017-01-09) Scene graph generation by iterative message passing. External Links: Cited by: §II.
-  (2019) HOW POWERFUL ARE GRAPH NEURAL NETWORKS?. pp. 17. Cited by: §II.
-  (2018-06-05) Relational deep reinforcement learning. External Links: Cited by: §II.