Log In Sign Up

Where to go next: Learning a Subgoal Recommendation Policy for Navigation Among Pedestrians

Robotic navigation in environments shared with other robots or humans remains challenging because the intentions of the surrounding agents are not directly observable and the environment conditions are continuously changing. Local trajectory optimization methods, such as model predictive control (MPC), can deal with those changes but require global guidance, which is not trivial to obtain in crowded scenarios. This paper proposes to learn, via deep Reinforcement Learning (RL), an interaction-aware policy that provides long-term guidance to the local planner. In particular, in simulations with cooperative and non-cooperative agents, we train a deep network to recommend a subgoal for the MPC planner. The recommended subgoal is expected to help the robot in making progress towards its goal and accounts for the expected interaction with other agents. Based on the recommended subgoal, the MPC planner then optimizes the inputs for the robot satisfying its kinodynamic and collision avoidance constraints. Our approach is shown to substantially improve the navigation performance in terms of number of collisions as compared to prior MPC frameworks, and in terms of both travel time and number of collisions compared to deep RL methods in cooperative, competitive and mixed multiagent scenarios.


Learning to Socially Navigate in Pedestrian-rich Environments with Interaction Capacity

Existing navigation policies for autonomous robots tend to focus on coll...

Learning Deep SDF Maps Online for Robot Navigation and Exploration

We propose an algorithm to (i) learn online a deep signed distance funct...

MAPPER: Multi-Agent Path Planning with Evolutionary Reinforcement Learning in Mixed Dynamic Environments

Multi-agent navigation in dynamic environments is of great industrial va...

Collision Avoidance in Pedestrian-Rich Environments with Deep Reinforcement Learning

Collision avoidance algorithms are essential for safe and efficient robo...

Occlusion-Aware MPC for Guaranteed Safe Robot Navigation with Unseen Dynamic Obstacles

For safe navigation in dynamic uncertain environments, robotic systems r...

Autonomous Navigation of AGVs in Unknown Cluttered Environments: log-MPPI Control Strategy

Sampling-based model predictive control (MPC) optimization methods, such...

APPLE: Adaptive Planner Parameter Learning from Evaluative Feedback

Classical autonomous navigation systems can control robots in a collisio...

I Introduction

Autonomous robot navigation in crowds remains difficult due to the interaction effects among navigating agents. Unlike multi-robot environments, robots operating among pedestrians require decentralized algorithms that can handle a mixture of other agents’ behaviors without depending on explicit communication between agents.

Several state-of-the-art collision avoidance methods employ model-predictive control (MPC) with online optimization to compute motion plans that are guaranteed to respect important constraints [31]. These constraints could include the robot’s nonlinear kino-dynamics model or collision avoidance of static obstacles and other dynamic, decision-making agents (e.g., pedestrians). Although online optimization becomes less computationally practical for extremely dense scenarios, modern solvers enable real-time motion planning in many situations of interest [3].

A key challenge is that the robot’s global goal is often located far beyond the planning horizon, meaning that a local subgoal or cost-to-go heuristic must be specified instead. This is straightforward in a static environment (e.g., using euclidean/diffusion 

[7] distance), but the presence interactive agents makes it difficult to quantify which subgoals will lead to the global goal quickest. A body of work addresses this challenge with deep reinforcement learning (RL), in which agents learn a model of the long-term cost of actions in an offline training phase (usually in simulation) [6, 4, 10, 11]. The learned model is fast-to-query during online execution, but the way learned costs/policies have been used to date does not provide guarantees on collision avoidance or feasibility with respect to the robot dynamics.

Fig. 1: Proposed navigation architecture. The subgoal planner observes the environment and suggests the next subgoal position to the local motion planner, the MPC. The MPC then computes a local trajectory and the robot executes the next optimal control command, which minimizes the distance to the provided position reference while respecting collision and kinodynamic constraints.

In this paper, we introduce Goal Oriented Model Predictive Control (GO-MPC), which enhances state-of-art online optimization-based planners with a learned global guidance policy. In an offline RL training phase, an agent learns a policy that uses the current world configuration (the states of the robot and other agents, and a global goal) to recommend a local subgoal for the MPC, as depicted in Fig. 1. Then, the MPC generates control commands ensuring that the robot and collision avoidance constraints are satisfied (if a feasible solution is found) while making progress towards the suggested subgoal. Our approach maintains the kino-dynamic feasibility and collision avoidance guarantees inherent in an MPC formulation, while improving the average time-to-goal and success rate by leveraging past experience in crowded situations.

The main contributions of this work are:

  • A goal-oriented Model Predictive Control method (GO-MPC) for navigation among interacting agents, which utilizes a learned global guidance policy (recommended subgoal) in the cost function and ensures that dynamic feasibility and collision avoidance constraints are satisfied when a feasible solution to the optimization problem is found;

  • An algorithm to train an RL agent jointly with an optimization-based controller in mixed environments, which is directly applicable to real-hardware, reducing the sim to real gap.

Finally, we present simulation results demonstrating an improvement over several state-of-art methods in challenging scenarios with realistic robot dynamics and a mixture of cooperative and non-cooperative neighboring agents. Our approach shows different navigation behaviors: navigating through the crowd when interacting with cooperative agents, avoiding congestion areas when non-cooperative agents are present and enabling communication-free decentralized multi-robot collision avoidance.

I-a Related Work

I-A1 Navigation Among Crowds

Past work on navigation in cluttered environments often focuses on interaction models using geometry [36, 38], physics [15], topologies [27, 28], handcrafted functions [35], and cost functions [21, 21]

or joint probability distributions

[39] learned from data. While accurate interaction models are critical for collision avoidance, this work emphasizes that the robot’s performance (time-to-goal) is highly dependent on the quality of its cost-to-go model (i.e., the module that recommends a subgoal for the local planner). Designing a useful cost-to-go model in this problem remains challenging, as it requires quantifying how “good” a robot’s configuration is with respect to dynamic, decision-making agents. In [6], deep RL was introduced as a way of modeling cost-to-go through an offline training phase; the online execution used simple vehicle and interaction models for collision-checking. Subsequent works incorporated other interactions to generate more socially compliant behavior within the same framework [5, 4]. To relax the need for simple online models, [10] moved the collision-checking to the offline training phase. While these approaches use pre-processed information typically available from perception pipelines (e.g., pedestrian detection, tracking systems), other works proposed to learn end-to-end policies [11, 34]

. Although all of these RL-based approaches learn to estimate the cost-to-go, the online implementations do not provide guarantees that the recommended actions will satisfy realistic vehicle dynamics or collision avoidance constraints. Thus, this work builds on the promising idea of learning a cost-to-go model, but we start from an inherently safe MPC formulation.

I-A2 Learning-Enhanced MPC

Outside the context of crowd navigation, numerous recent works have proposed learning-based solutions to overcome some of the known limitations of optimization-based methods (e.g., nonlinear MPC) [16]. For example, solvers are often sensitive to the quality of the initial guess hence, [26] proposes to learn a policy from data that efficiently “warm-starts” a MPC. Model inaccuracies can lead to sub-optimal MPC solution quality; [1] proposes to learn a policy by choosing between two actions with the best expected reward at each timestep: one from model-free RL and one from a model-based trajectory optimizer. Alternatively, RL can be used to optimize the weights of an MPC-based Q-function approximator or to update a robust MPC parametrization [41]. When the model is completely unknown, [30] shows a way of learning a dynamics model to be used in MPC. Computation time is another key challenge: [42] learns a cost-to-go estimator to enable shortening of the planning horizons without sacrificing much solution quality, although their approach differs from this work as it uses local and linear function approximators which limits its applicability to high-dimensional state spaces.The closest related works address cost tuning with learning. MPC’s cost functions are replaced with a value function learned via RL offline in [25] (terminal cost) and [13] (stage cost).[20] deployed value function learning on a real robot outperforming an expert-tuned MPC. While these ideas also use RL for a better cost-to-go model, this work focuses on the technical challenge of learning a subgoal policy required for navigation through crowds avoiding the approximation issues and extrapolation issues to unseen events. Moreover, this work learns to set terminal constraints rather than setting a cost with a value function.

I-A3 Combining MPC with RL

Recently, there is increasing interest on approaches combining the strengths of MPC and RL as suggested in [9]

. For instance, optimization-based planning has been used to explore high-reward regions and distill the knowledge into a policy neural network, rather than a neural network policy to improve an optimization.

[23, 24, 29].

Similar to our approach, [18] utilizes the RL policy during training to ensure exploration and employs a MPC to optimize sampled trajectories from the learned policy at test time. Moreover, policy networks have be used to generate proposals for a sampling-based MPC [40], or to select goal positions from a predefined set [14].

Nevertheless, to the extent of our knowledge, approaches combining the benefits of both optimization and learning-based methods were not explored in the context of crowd navigation. Moreover, the works exploring a similar idea of learning a cost-to-go model do not allow to explicitly define collision constraints and ensure safety. To overcome the previous issues, in this paper, we explore the idea of learning a cost-to-go model to directly generate subgoal positions, which lead to higher long-term rewards and too give the role of local collision avoidance and kinematics constraints to an optimization-based planner.

Such cost-to-go information can be formulated as learning a value function for the ego-agent state-space providing information which states are more valuable [13]. In contrast, we propose to learn a policy directly informing which actions lead to higher rewards allowing to directly incorporate the MPC controller in the training phase.

Ii Preliminaries

Throughout this paper, vectors are denoted in bold lowercase letters,

, matrices in capital, , and sets in calligraphic uppercase, . denotes the Euclidean norm of and denotes the weighted squared norm. The variables denote the state and action variables used in the RL formulation, and denote the control state and action commands used in the optimization problem.

Ii-a Problem Formulation

Consider a scenario where a robot must navigate from an initial position to a goal position on the plane , surrounded by non-communicating agents. At each time-step , the robot first observes its state (defined in Sec.III-A2) and the set of the other agents states , then takes action , leading to the immediate reward and next state , under the transition model .

We use the superscript to denote the -th nearby agent and omit the superscript when referring to the robot. For each agent , denotes its position, its velocity at step relative to a inertial frame, and the agent radius. We assume that each agent’s current position and velocity are observed (e.g., with on-board sensing) while other agents’ motion intentions (e.g., goal positions) are unknown. Finally, denotes the area occupied by the robot and by each surrounding agent, at time-step .The goal is to learn a policy for the robot that minimizes time to goal while ensuring collision-free motions, defined as: equationparentequation

s.t. (1a)

where (1a) are the transition dynamic constraints considering the dynamic model , (1b) the terminal constraints, (1c) the collision avoidance constraints and , and are the set of admissible states, inputs (e.g., to limit the robot’s maximum speed) and the set of admissible control states, respectively. Note that we only constraint the control states of the robot. Moreover, we assume other agents have various behaviors (e.g., cooperative or non-cooperative): each agent samples a policy from a closed set (defined in Sec.II-C) at the beginning of each episode.

Ii-B Agent Dynamics

Real robotic systems’ inertia imposes limits on linear and angular acceleration. Thus, we assume a second-order unicycle model for the robot [22]:


where and are the agent position coordinates and is the heading angle in a global frame. is the agent forward velocity, denotes the angular velocity and, the linear and angular acceleration, respectively.

Ii-C Modeling Other Agents’ Behaviors

In a real scenario, agents may follow different policies and show different levels of cooperation. Hence, in contrast to previous approaches, we do not consider all the agents to follow the same policy [10, 12]. At the beginning of an episode, each non-ego agent either follows a cooperative or a non-cooperative policy. For the cooperative policy, we employ the Reciprocal Velocity Obstacle (RVO) [37] model with a random cooperation coefficient111This coefficient is denoted as in [36] sampled at the beginning of the episode. The “reciprocal” in RVO means that all agents follow the same policy and use the cooperation coefficient to split the collision avoidance effort among the agents (e.g., a coefficient of 0.5 means that each agent will apply half of the effort to avoid the other). In this work, for the non-cooperative agents, we consider both constant velocity (CV) and non-CV policies. The agents following a CV model drive straight in the direction of their goal position with constant velocity. The agents following a non-CV policy either move in sinusoids towards their final goal position or circular motion around their initial position.

Iii Method

Learning a sequence of intermediate goal states that lead an agent toward a final goal destination can be formulated as a single-agent sequential decision making problem. Because parts of the environment can be difficult to model explicitly, the problem can be solved with a reinforcement learning framework. Hence, we propose a two-level planning architecture, as depicted in Figure 1, consisting of a subgoal recommender (Section III-A2) and an optimization-based motion planner (Section II-C). We start by defining the RL framework and our’s policy architecture (SectionIII-A2). Then, we formulate the MPC to execute the policy’s actions and ensure local collision avoidance (SectionIII-B).

Iii-a Learning a Subgoal Recommender Policy

We aim to develop a decision-making algorithm to provide an estimate of the cost-to-go in dynamic environments with mixed-agents. In this paper, we propose to learn a policy directly informing which actions lead to higher rewards.

Iii-A1 RL Formulation

As in [6], the observation vector is composed by the ego-agent and the surrounding agents states, defined as:


where is the ego-agent state and the -th agent state at step . Moreover, is the ego-agent’s distance to goal and is the distance to the -th agent.

Here, we seek to learn the optimal policy for the ego-agent

mapping the ego-agent’s observation of the environment to a probability distribution of actions. We consider a continuous action space

and define an action as position increments providing the direction maximizing the ego-agent rewards, defined as: equationparentequation


where are the position increments, the maximum linear velocity and are the network policy parameters. Moreover, to ensure that the next sub-goal position is within the planning horizon of the ego-agent, we bound the action space according with the planning horizon of the optimization-based planner and its dynamic constraints, as represented in Eq. 4b.

We design the reward function to motivate the ego-agent to reach the goal position while penalizing collisions:


where is the distance to the closest surrounding agent. allows to adapt the reward function as shown in the ablation study (Sec.IV-C), rewards the agent if reaches the goal penalizes if it collides with any other agents. In Section. IV-C we analyze its influence in the behavior of the learned policy.

Iii-A2 Policy Network Architecture

Fig. 2: Proposed network policy architecture.

A key challenge in collision avoidance among pedestrians is that the number of nearby agents can vary between timesteps. Because feed-forward NNs require a fixed input vector size, prior work [10]

proposed the use of Recurrent Neural Networks (RNNs) to compress the

agent states into a fixed size vector at each time-step. Yet, that approach discarded time-dependencies of successive observations (i.e., hidden states of recurrent cells).

Here, we use the “store-state” strategy, as proposed in [19]. During the rollout phase, at each time-step we store the hidden-state of the RNN together with the current state and other agents state, immediate reward and next state, . Moreover, the previous hidden-state is feed back to warm-start the RNN in the next step, as depicted in Fig.2. During the training phase, we use the stored hidden-states to initialize the network. Our policy architecture is depicted in Fig. 2. We employ a RNN to encode a variable sequence of the other agents states and model the existing time-dependencies. Then, we concatenate the fixed-length representation of the other agent’s states with the ego-agent’s state to create a join state representation. This representation vector is fed to two fully-connected layers (FCL). The network has two output heads: one estimates the probability distribution parameters of the policy’s action space and the other estimates the state-value function . and

are the mean and variance of the policy’s distribution, respectively.

Iii-B Local Collision Avoidance: Model Predictive Control

Here, we employ MPC to generate locally optimal commands respecting the kino-dynamics and collision avoidance constraints. To simplify the notation used, hereafter, we assume the current time-step as zero.

Iii-B1 State and Control Inputs

We define the ego-agent control input vector as and the control state as following the dynamics model defined in Section II-B.

Iii-B2 Dynamic Collision Avoidance

We define a set of nonlinear constraints to ensure that the MPC generates collision-free control commands for the ego-agent (if a feasible solution exists). To limit the problem complexity and ensure to find a solution in real-time, we consider a limited number of surrounding agents , with . Consider as the set of all surrounding agent states, than the set of the -th closest agents is:

Definition 1.

A set is the set of the -th closest agents if the euclidean distance .

We represent the area occupied by each agent as a circle with radius . To ensure collision-free motions, we impose that each circle does not intersect with the area occupied by the ego-agent resulting in the following set of inequality constraints:


for each planning step . This formulation can be extended for agents with general quadratic shapes, as in [3].

Iii-B3 Cost function

The subgoal recommender provides a reference position guiding the ego-agent toward the final goal position and minimizing the cost-to-go while accounting for the other agents. The terminal cost is defined as the normalized distance between the ego-agent’s terminal position (after a planning horizon ) and the reference position (with weight coefficient ):


To ensure smooth trajectories, we define the stage cost as a quadratic penalty on the ego-agent control commands


where is the weight coefficient.

Iii-B4 MPC Formulation

The MPC is then defined as a non-convex optimization problem


In this work, we assume a constant velocity model estimate of the other agents’ future positions, as in [3].

Iii-C Ppo-Mpc

In this work, we train the policy using a state-of-art method, Proximal Policy Optimization (PPO) [33], but the overall framework is agnostic to the specific RL training algorithm. We propose to jointly train the guidance policy and value function with the MPC, as opposed to prior works [10] that use an idealized low-level controller during policy training (that cannot be implemented on a real robot). Algorithm 1 describes the proposed training strategy and has two main phases: supervised and RL training. First, we randomly initialize the policy and value function parameters . Then, at the beginning of each episode we randomly select the number of surrounding agents between , the training scenario and the surrounding agents policy. More details about the different training scenarios and considered is given in Sec.IV-B.

An initial RL policy is unlikely to lead an agent to a goal position. Hence, during the warm-start phase, we use the MPC as an expert and perform supervised training to train the policy and value function parameters for steps. By setting the MPC goal state as the ego-agent final goal state and solving the MPC problem, we obtain a locally optimal sequence of control states . For each step, we define and store the tuple containing the network hidden-state, state, next state, and reward in a buffer . Then, we compute advantage estimates [32] and perform a supervised training step


where are the value function and policy parameters, respectively. Note that and share the same parameter except for the final layer, as depicted in Fig.2. Afterwards, we use Proximal Policy Optimization (PPO) [33] with clipped gradients for training the policy. PPO is a on-policy method addressing the high-variance issue of policy gradient methods for continuous control problems. We refer the reader to [33] for more details about the method’s equations. Please note that our approach is agnostic to which RL algorithm we use. Moreover, to increase the learning speed during training, we gradually increase the number of agents in the training environments (curriculum learning [2]).

1:Inputs: planning horizon , value fn. and policy parameters , number of supervised and RL training episodes , number of agents , , and reward function
2:Initialize states: ,
3:while  do
4:     Initialize and
5:     for  do
6:         if  then
7:              Solve Eq.9 considering
8:              Set
9:         else
11:         end if
13:         Store
14:         if done then
16:              Reset hidden-state:
17:              Initialize:
18:         end if
19:     end for
20:     if  then
21:         Supervised training: Eq.10 and Eq.11
22:     else
23:         PPO training [33]
24:     end if
25:end while
Algorithm 1 PPO-MPC Training

Iv Results

This section quantifies the performance throughout the training procedure, provides an ablation study, and compares the proposed method (sample trajectories and numerically) against the following baseline approaches:

  • MPC: Model Predictive Controller from Section III-B with final goal position as position reference, ;

  • DRL [10]: state-of-the-art Deep Reinforcement Learning approach for multi-agent collision avoidance

To analyze the impact of a realistic kinematic model during training, we consider two variants of the DRL method  [10]: the same RL algorithm [10] was used to train a policy under a first-order unicycle model, referred to as DRL, and a second-order unicycle model (Eq.2), referred to as DRL-2. All experiments use a second-order unicycle model (Eq.2) in environments with cooperative and non-cooperative agents to represent realistic robot/pedestrian behavior. Animations of sample trajectories accompany the paper.

Iv-a Experimental setup

The proposed training algorithm builds upon the open-source PPO implementation provided in the Stable-Baselines

[17] package. We used a laptop with an Intel Core i7 and 32 GB of RAM for training. To solve the non-linear and non-convex MPC problem of Eq. 9, we used the ForcesPro [8]

solver. If no feasible solution is found within the maximum number of iterations, then the robot decelerates. All MPC methods used in this work consider collision constraints with up to the closest six agents so that the optimization problem can be solved in less than 20ms. Moreover, our policy’s network has an average computation time of 2ms with a variance of 0.4ms for all experiments. Hyperparameter values are summarized in

Table I.

Planning Horizon 2 s Num. mini batches 2048
Number of Stages 20 3
0.99 -10
Clip factor 0.1 Learning rate
TABLE I: Hyper-parameters.

Iv-B Training procedure

To train and evaluate our method we have selected four navigation scenarios, similar to [10, 4, 11]:

  • Symmetric swapping: Each agent’s position is randomly initialized in different quadrants of the x-y plane, where all agents have the same distance to the origin. Each agent’s goal is to swap positions with an agent from the opposite quadrant.

  • Asymmetric swapping: As before, but all agents are located at different distances to the origin.

  • Pair-wise swapping: Random initial positions; pairs of agents’ goals are each other’s intial positions

  • Random: Random initial & goal positions

Each training episode consists of a random number of agents and a random scenario. At the start of each episode, each other agent’s policy is sampled from a binomial distribution (80% cooperative, 20% non-cooperative). Moreover, for the cooperative agents we randomly sample a cooperation coefficient

and for the non-cooperative agents is randomly assigned a CV or non-CV policy (i.e., sinusoid or circular). Fig. 3 shows the evolution of the robot average reward and the percentage of failure episodes. The top sub-plot compares our method average reward with the two baseline methods: DRL (with pre-trained weights) and MPC. The average reward for the baseline methods (orange, yellow) drops as the number of agents increases (each vertical bar). In contrast, our method (blue) improves with training and eventually achieves higher average reward for 10-agent scenarios than baseline methods achieve for 2-agent scenarios. The bottom plot demonstrates that the percentage of collisions decreases throughout training despite the number of agents increasing.

Fig. 3: Moving average rewards and percentage of failure episodes during training. The top plot shows our method average episode reward vs DRL [10] and simple MPC.

Iv-C Ablation Study

A key design choice in RL is the reward function; here, we study the impact on policy performance of three variants of reward. The sparse reward uses (only non-zero reward upon reaching goal/colliding). The time reward uses (penalize every step until reaching goal). The progress reward uses (encourage motion toward goal). Aggregated results in Table II show that the resulting policy trained with a time reward function allows the robot to reach the goal with minimum time, to travel the smallest distance, and achieve the lowest percentage of failure cases. Based on these results, we selected the policy trained with the time reward function for the subsequent experiments.

Time to Goal [s] % failures (% collisions / % timeout) Traveled distance Mean [m]
# agents 6 8 10 6 8 10 6 8 10
Sparse Reward 8.00 8.51 8.52 0 (0 / 0) 1 (0 / 1) 2 (1 / 1) 13.90 14.34 14.31
Progress Reward 8.9 8.79 9.01 2 (1 / 1) 3 (3 / 0) 1 (1 / 0) 14.75 14.57 14.63
Time Reward 7.69 8.03 8.12 0 (0 / 0) 0 (0 / 0) 0 (0 / 0) 13.25 14.01 14.06
TABLE II: Ablation Study: Discrete reward function leads to better policy than sparse, dense reward functions. Results are aggregated over 200 random scenarios with agents.

Iv-D Qualitative Analysis

This section compares and analyzes trajectories for different scenarios. Fig. 4 shows that our method resolves a failure mode of both RL and MPC baselines. The robot has to swap position with a non-cooperative agent (red, moving right-to-left) and avoid a collision. We overlap the trajectories (moving left-to-right) performed by the robot following our method (blue) versus the baseline policies (orange, magenta). The MPC policy (orange) causes a collision due to the dynamic constraints and limited planning horizon. The DRL policy avoids the non-cooperative agent, but due to its reactive nature, only avoids the non-cooperative agent when very close, resulting in larger travel time. Finally, when using our approach, the robot initiates a collision avoidance maneuver early enough to lead to a smooth trajectory and faster arrival at the goal.

Fig. 4: Two agents swapping scenario. In blue is depicted the trajectory of robot, in red the non-cooperative agent, in purple the DRL agent and, in orange the MPC.
Fig. 5: Sample trajectories with mixed agent policies (robot: blue, cooperative: green, non-cooperative: red). In (a), all agents are cooperative; in (b), two are cooperative and five non-cooperative (const. vel.); in (c), three are cooperative and two non-cooperative (sinusoidal). The GO-MPC agent avoids non-cooperative agents differently than cooperative agents.
(a) DRL [10]
(b) DRL-2 (ext. of [10])
(c) GO-MPC
Fig. 6: 8 agents swapping positions. To simulate a multi-robot environment, all agents follow the same policy.

We present results for mixed settings in Fig. 5 and homogeneous settings in Fig. 6 with agents. In mixed settings, the robot follows our proposed policy while the other agents either follow an RVO [37] or a non-cooperative policy (same distribution as in training). Fig. 5 demonstrates that our navigation policy behaves differently when dealing with only cooperative agents or both cooperative and non-cooperative. Whereas in Fig. 4(a) the robot navigates through the crowd, Fig. 4(b) shows that the robot takes a longer path to avoid the congestion.

In the homogeneous setting, all agents follow our proposed policy. Fig. 6 shows that our method achieves faster time-to-goal than two DRL baselines. Note that this scenario was never introduced during the training phase, nor have the agents ever experienced other agents with the same policy before. Following the DRL policy (Fig. 5(a)), all agents navigate straight to their goal positions leading to congestion in the center with reactive avoidance. The trajectories from the DRL-2 approach (Fig. 5(b)) are more conservative, due to the limited acceleration available. In contrast, the trajectories generated by our approach (Fig. 5(c)), present a balance between going straight to the goal and avoiding congestion in the center, allowing the agents to reach their goals faster and with smaller distance traveled.

Iv-E Performance Results

This section aggregates performance of the various methods across 200 random scenarios. Performance is quantified by average time to reach the goal position, percentage of episodes that end in failures (either collision or timeout), and the average distance traveled.

The numerical results are summarized in Table III. Our method outperforms each baseline for both mixed and homogeneous scenarios. To evaluate the statistical significance, we performed pairwise Mann–Whitney U-tests between GO-MPC and each baseline (95 confidence). GO-MPC shows statistically significant performance improvements over the DRL-2 baseline in terms of travel time and distance, and the DRL baseline in term of travel time for six agents and travel distance for ten agents. . For homogeneous scenarios, GO-MPC is more conservative than DRL and MPC baselines resulting in a larger average traveled distance. Nevertheless, GO-MPC is reaches the goals faster than each baseline and is less conservative than DRL-2, as measured by a significantly lower average distance traveled.

Finally, considering higher-order dynamics when training DRL agents (DRL-2) improves the collision avoidance performance. However, it also increases the average time to goal and traveled distance, meaning a more conservative policy that still under-performs GO-MPC in each metric.

Time to Goal (mean std) [s] % failures (% collisions / % deadlocks) Traveled Distance (mean std) [m]
# agents 6 8 10 6 8 10 6 8 10
Mixed Agents
MPC 11.2 2.2 11.3 2.4 11.0 2.2 13 (0 / 0) 22 (0 / 0) 22 (22 / 0) 12.24 2.3 12.40 2.5 12.13 2.3
DRL [10] 13.7 3.0 13.7 3.1 14.4 3.3 17 (17 / 0) 23 (23 / 0) 29 (29 / 0) 13.75 3.3 13.80 4.0 14.40 3.3
DRL-2 [10]+ 15.3 2.3 16.1 2.2 16.7 2.2 6 (6 / 0) 10 (10 / 0) 13 (13 / 0) 14.86 2.3 16.05 2.2 16.66 2.2
GO-MPC 12.7 2.7 12.9 2.8 13.3 2.8 0 (0 / 0) 0 (0 / 0) 0 (0 / 0) 13.65 2.7 13.77 2.8 14.29 2.8
MPC 17.37 2.9 16.38 1.5 16.64 1.7 30 (29 / 1) 36 (25 / 11) 35 (28 / 7) 11.34 2.1 10.86 2.3 10.62 2.8
DRL [10] 14.18 2.4 14.40 2.7 14.64 3.3 16 (14 / 2) 20 (18 / 2) 20 (20 / 0) 12.81 2.3 12.23 2.3 12.23 3.2
DRL-2 [10]+ 15.96 3.1 17.47 4.2 15.96 4.5 17 (11 / 6) 29 (21 / 8) 28 (24 / 4) 15.17 3.0 15.85 4.2 15.40 4.5
GO-MPC 13.77 2.9 14.30 3.3 14.63 2.9 0 (0 / 0) 0 (0 / 0) 2 (1 / 1) 14.67 2.9 15.09 3.3 15.12 2.9
TABLE III: Statistics for 200 runs of proposed method (GO-MPC) compared to baselines (MPC, DRL [10] and DRL-2, an extension of [10]): time to goal and traveled distance for the successful episodes, and number of episodes resulting in collision for agents. For the mixed setting, 80% of agents are cooperative, and 20% are non-cooperative.

V Conclusions & Future Work

This paper introduced a subgoal planning policy for guiding a local optimization planner. We employed DRL methods to learn a subgoal policy accounting for the interaction effects among the agents. Then, we used an MPC to compute locally optimal motion plans respecting the robot dynamics and collision avoidance constraints. Learning a subgoal policy improved the collision avoidance performance among cooperative and non-cooperative agents as well as in multi-robot environments. Moreover, our approach can reduce travel time and distance in cluttered environments. Future work could account for environment constraints.


  • [1] G. Bellegarda and K. Byl (2019) Combining benefits from trajectory optimization and deep reinforcement learning. External Links: 1910.09667 Cited by: §I-A2.
  • [2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    pp. 41–48. Cited by: §III-C.
  • [3] B. Brito, B. Floor, L. Ferranti, and J. Alonso-Mora (2019) Model predictive contouring control for collision avoidance in unstructured dynamic environments. IEEE Robotics and Automation Letters 4 (4), pp. 4459–4466. Cited by: §I, §III-B2, §III-B4.
  • [4] C. Chen, Y. Liu, S. Kreiss, and A. Alahi (2019) Crowd-robot interaction: crowd-aware robot navigation with attention-based deep reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6015–6022. Cited by: §I-A1, §I, §IV-B.
  • [5] Y. F. Chen, M. Everett, M. Liu, and J. P. How Socially aware motion planning with deep reinforcement learning. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §I-A1.
  • [6] Y. F. Chen, M. Liu, M. Everett, and J. P. How (2017) Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 285–292. Cited by: §I-A1, §I, §III-A1.
  • [7] Y. F. Chen, S. Liu, M. Liu, J. Miller, and J. P. How (2016) Motion planning with diffusion maps. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §I.
  • [8] A. Domahidi and J. Jerez (2014-07) FORCES Professional. Note: embotech GmbH ( Cited by: §IV-A.
  • [9] D. Ernst, M. Glavic, F. Capitanescu, and L. Wehenkel (2008) Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39 (2), pp. 517–529. Cited by: §I-A3.
  • [10] M. Everett, Y. F. Chen, and J. P. How (2021) Collision avoidance in pedestrian-rich environments with deep reinforcement learning. IEEE Access 9 (), pp. 10357–10377. External Links: Document Cited by: §I-A1, §I, §II-C, §III-A2, §III-C, Fig. 3, 5(a), 5(b), 2nd item, §IV-B, TABLE III, §IV.
  • [11] T. Fan, P. Long, W. Liu, and J. Pan Distributed multi-robot collision avoidance via deep reinforcement learning for navigation in complex scenarios. The International Journal of Robotics Research. Cited by: §I-A1, §I, §IV-B.
  • [12] T. Fan, P. Long, W. Liu, and J. Pan Fully distributed multi-robot collision avoidance via deep reinforcement learning for safe and efficient navigation in complex scenarios. arXiv preprint arXiv:1808.03841. Cited by: §II-C.
  • [13] F. Farshidian, D. Hoeller, and M. Hutter (2019) Deep value model predictive control. arXiv preprint arXiv:1910.03358. Cited by: §I-A2, §I-A3.
  • [14] C. Greatwood and A. G. Richards (2019) Reinforcement learning and model predictive control for robust embedded quadrotor guidance and control. Autonomous Robots 43 (7), pp. 1681–1693. Cited by: §I-A3.
  • [15] D. Helbing and P. Molnar (1995) Social force model for pedestrian dynamics. Physical review E 51 (5), pp. 4282. Cited by: §I-A1.
  • [16] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger (2020) Learning-based model predictive control: toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems 3, pp. 269–296. Cited by: §I-A2.
  • [17] A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu (2018) Stable baselines. GitHub. Cited by: §IV-A.
  • [18] Z. Hong, J. Pajarinen, and J. Peters (2019) Model-based lookahead reinforcement learning. External Links: 1908.06012 Cited by: §I-A3.
  • [19] S. Kapturowski, G. Ostrovski, W. Dabney, J. Quan, and R. Munos (2019) Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: §III-A2.
  • [20] N. Karnchanachari, M. I. Valls, D. Hoeller, and M. Hutter (2020) Practical reinforcement learning for mpc: learning from sparse objectives in under an hour on a real robot. External Links: 2003.03200 Cited by: §I-A2.
  • [21] B. Kim and J. Pineau (2016) Socially adaptive path planning in human environments using inverse reinforcement learning. International Journal of Social Robotics 8 (1), pp. 51–66. Cited by: §I-A1.
  • [22] S. M. LaValle Planning algorithms. Cambridge university press. Cited by: §II-B.
  • [23] S. Levine and V. Koltun (2013) Guided policy search. In International Conference on Machine Learning, pp. 1–9. Cited by: §I-A3.
  • [24] S. Levine and V. Koltun (2013) Variational policy search via trajectory optimization. In Advances in neural information processing systems, Cited by: §I-A3.
  • [25] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch (2018) Plan online, learn offline: efficient learning and exploration via model-based control. arXiv preprint arXiv:1811.01848. Cited by: §I-A2.
  • [26] N. Mansard, A. DelPrete, M. Geisert, S. Tonneau, and O. Stasse (2018) Using a memory of motion to efficiently warm-start a nonlinear predictive controller. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2986–2993. Cited by: §I-A2.
  • [27] C. I. Mavrogiannis and R. A. Knepper (2019) Multi-agent path topology in support of socially competent navigation planning. The International Journal of Robotics Research 38 (2-3), pp. 338–356. Cited by: §I-A1.
  • [28] C. I. Mavrogiannis, W. B. Thomason, and R. A. Knepper (2018) Social momentum: a framework for legible navigation in dynamic multi-agent environments. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, pp. 361–369. Cited by: §I-A1.
  • [29] I. Mordatch and E. Todorov (2014) Combining the benefits of function approximation and trajectory optimization.. In Robotics: Science and Systems, Vol. 4. Cited by: §I-A3.
  • [30] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2017) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. External Links: 1708.02596 Cited by: §I-A2.
  • [31] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli (2016) A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Transactions on intelligent vehicles. Cited by: §I.
  • [32] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §III-C.
  • [33] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. External Links: 1707.06347 Cited by: §III-C, §III-C, 23.
  • [34] L. Tai, J. Zhang, M. Liu, and W. Burgard (2018)

    Socially compliant navigation through raw depth inputs with generative adversarial imitation learning

    In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1111–1117. Cited by: §I-A1.
  • [35] P. Trautman, J. Ma, R. M. Murray, and A. Krause (2015) Robot navigation in dense human crowds: statistical models and experimental studies of human–robot cooperation. The International Journal of Robotics Research 34 (3), pp. 335–356. Cited by: §I-A1.
  • [36] J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha (2011) Reciprocal n-body collision avoidance. In Robotics research, Cited by: §I-A1, footnote 1.
  • [37] J. Van den Berg, M. Lin, and D. Manocha (2008) Reciprocal velocity obstacles for real-time multi-agent navigation. In 2008 IEEE International Conference on Robotics and Automation, pp. 1928–1935. Cited by: §II-C, §IV-D.
  • [38] J. Van Den Berg, J. Snape, S. J. Guy, and D. Manocha Reciprocal collision avoidance with acceleration-velocity obstacles. In 2011 IEEE International Conference on Robotics and Automation, Cited by: §I-A1.
  • [39] A. Vemula, K. Muelling, and J. Oh (2017) Modeling cooperative navigation in dense human crowds. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1685–1692. Cited by: §I-A1.
  • [40] T. Wang and J. Ba Exploring model-based planning with policy networks. In International Conference on Learning Representations, Cited by: §I-A3.
  • [41] M. Zanon and S. Gros (2020) Safe reinforcement learninge using robust mpc. IEEE Transactions on Automatic Control. Cited by: §I-A2.
  • [42] M. Zhong, M. Johnson, Y. Tassa, T. Erez, and E. Todorov (2013) Value function approximation and model predictive control. In 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp. 100–107. Cited by: §I-A2.