I Introduction
Autonomous robot navigation in crowds remains difficult due to the interaction effects among navigating agents. Unlike multirobot environments, robots operating among pedestrians require decentralized algorithms that can handle a mixture of other agents’ behaviors without depending on explicit communication between agents.
Several stateoftheart collision avoidance methods employ modelpredictive control (MPC) with online optimization to compute motion plans that are guaranteed to respect important constraints [31]. These constraints could include the robot’s nonlinear kinodynamics model or collision avoidance of static obstacles and other dynamic, decisionmaking agents (e.g., pedestrians). Although online optimization becomes less computationally practical for extremely dense scenarios, modern solvers enable realtime motion planning in many situations of interest [3].
A key challenge is that the robot’s global goal is often located far beyond the planning horizon, meaning that a local subgoal or costtogo heuristic must be specified instead. This is straightforward in a static environment (e.g., using euclidean/diffusion
[7] distance), but the presence interactive agents makes it difficult to quantify which subgoals will lead to the global goal quickest. A body of work addresses this challenge with deep reinforcement learning (RL), in which agents learn a model of the longterm cost of actions in an offline training phase (usually in simulation) [6, 4, 10, 11]. The learned model is fasttoquery during online execution, but the way learned costs/policies have been used to date does not provide guarantees on collision avoidance or feasibility with respect to the robot dynamics.In this paper, we introduce Goal Oriented Model Predictive Control (GOMPC), which enhances stateofart online optimizationbased planners with a learned global guidance policy. In an offline RL training phase, an agent learns a policy that uses the current world configuration (the states of the robot and other agents, and a global goal) to recommend a local subgoal for the MPC, as depicted in Fig. 1. Then, the MPC generates control commands ensuring that the robot and collision avoidance constraints are satisfied (if a feasible solution is found) while making progress towards the suggested subgoal. Our approach maintains the kinodynamic feasibility and collision avoidance guarantees inherent in an MPC formulation, while improving the average timetogoal and success rate by leveraging past experience in crowded situations.
The main contributions of this work are:

A goaloriented Model Predictive Control method (GOMPC) for navigation among interacting agents, which utilizes a learned global guidance policy (recommended subgoal) in the cost function and ensures that dynamic feasibility and collision avoidance constraints are satisfied when a feasible solution to the optimization problem is found;

An algorithm to train an RL agent jointly with an optimizationbased controller in mixed environments, which is directly applicable to realhardware, reducing the sim to real gap.
Finally, we present simulation results demonstrating an improvement over several stateofart methods in challenging scenarios with realistic robot dynamics and a mixture of cooperative and noncooperative neighboring agents. Our approach shows different navigation behaviors: navigating through the crowd when interacting with cooperative agents, avoiding congestion areas when noncooperative agents are present and enabling communicationfree decentralized multirobot collision avoidance.
Ia Related Work
IA1 Navigation Among Crowds
Past work on navigation in cluttered environments often focuses on interaction models using geometry [36, 38], physics [15], topologies [27, 28], handcrafted functions [35], and cost functions [21, 21]
or joint probability distributions
[39] learned from data. While accurate interaction models are critical for collision avoidance, this work emphasizes that the robot’s performance (timetogoal) is highly dependent on the quality of its costtogo model (i.e., the module that recommends a subgoal for the local planner). Designing a useful costtogo model in this problem remains challenging, as it requires quantifying how “good” a robot’s configuration is with respect to dynamic, decisionmaking agents. In [6], deep RL was introduced as a way of modeling costtogo through an offline training phase; the online execution used simple vehicle and interaction models for collisionchecking. Subsequent works incorporated other interactions to generate more socially compliant behavior within the same framework [5, 4]. To relax the need for simple online models, [10] moved the collisionchecking to the offline training phase. While these approaches use preprocessed information typically available from perception pipelines (e.g., pedestrian detection, tracking systems), other works proposed to learn endtoend policies [11, 34]. Although all of these RLbased approaches learn to estimate the costtogo, the online implementations do not provide guarantees that the recommended actions will satisfy realistic vehicle dynamics or collision avoidance constraints. Thus, this work builds on the promising idea of learning a costtogo model, but we start from an inherently safe MPC formulation.
IA2 LearningEnhanced MPC
Outside the context of crowd navigation, numerous recent works have proposed learningbased solutions to overcome some of the known limitations of optimizationbased methods (e.g., nonlinear MPC) [16]. For example, solvers are often sensitive to the quality of the initial guess hence, [26] proposes to learn a policy from data that efficiently “warmstarts” a MPC. Model inaccuracies can lead to suboptimal MPC solution quality; [1] proposes to learn a policy by choosing between two actions with the best expected reward at each timestep: one from modelfree RL and one from a modelbased trajectory optimizer. Alternatively, RL can be used to optimize the weights of an MPCbased Qfunction approximator or to update a robust MPC parametrization [41]. When the model is completely unknown, [30] shows a way of learning a dynamics model to be used in MPC. Computation time is another key challenge: [42] learns a costtogo estimator to enable shortening of the planning horizons without sacrificing much solution quality, although their approach differs from this work as it uses local and linear function approximators which limits its applicability to highdimensional state spaces.The closest related works address cost tuning with learning. MPC’s cost functions are replaced with a value function learned via RL offline in [25] (terminal cost) and [13] (stage cost).[20] deployed value function learning on a real robot outperforming an experttuned MPC. While these ideas also use RL for a better costtogo model, this work focuses on the technical challenge of learning a subgoal policy required for navigation through crowds avoiding the approximation issues and extrapolation issues to unseen events. Moreover, this work learns to set terminal constraints rather than setting a cost with a value function.
IA3 Combining MPC with RL
Recently, there is increasing interest on approaches combining the strengths of MPC and RL as suggested in [9]
. For instance, optimizationbased planning has been used to explore highreward regions and distill the knowledge into a policy neural network, rather than a neural network policy to improve an optimization.
[23, 24, 29].Similar to our approach, [18] utilizes the RL policy during training to ensure exploration and employs a MPC to optimize sampled trajectories from the learned policy at test time. Moreover, policy networks have be used to generate proposals for a samplingbased MPC [40], or to select goal positions from a predefined set [14].
Nevertheless, to the extent of our knowledge, approaches combining the benefits of both optimization and learningbased methods were not explored in the context of crowd navigation. Moreover, the works exploring a similar idea of learning a costtogo model do not allow to explicitly define collision constraints and ensure safety. To overcome the previous issues, in this paper, we explore the idea of learning a costtogo model to directly generate subgoal positions, which lead to higher longterm rewards and too give the role of local collision avoidance and kinematics constraints to an optimizationbased planner.
Such costtogo information can be formulated as learning a value function for the egoagent statespace providing information which states are more valuable [13]. In contrast, we propose to learn a policy directly informing which actions lead to higher rewards allowing to directly incorporate the MPC controller in the training phase.
Ii Preliminaries
Throughout this paper, vectors are denoted in bold lowercase letters,
, matrices in capital, , and sets in calligraphic uppercase, . denotes the Euclidean norm of and denotes the weighted squared norm. The variables denote the state and action variables used in the RL formulation, and denote the control state and action commands used in the optimization problem.Iia Problem Formulation
Consider a scenario where a robot must navigate from an initial position to a goal position on the plane , surrounded by noncommunicating agents. At each timestep , the robot first observes its state (defined in Sec.IIIA2) and the set of the other agents states , then takes action , leading to the immediate reward and next state , under the transition model .
We use the superscript to denote the th nearby agent and omit the superscript when referring to the robot. For each agent , denotes its position, its velocity at step relative to a inertial frame, and the agent radius. We assume that each agent’s current position and velocity are observed (e.g., with onboard sensing) while other agents’ motion intentions (e.g., goal positions) are unknown. Finally, denotes the area occupied by the robot and by each surrounding agent, at timestep .The goal is to learn a policy for the robot that minimizes time to goal while ensuring collisionfree motions, defined as: equationparentequation
s.t.  (1a)  
(1b)  
(1c)  
(1d)  
where (1a) are the transition dynamic constraints considering the dynamic model , (1b) the terminal constraints, (1c) the collision avoidance constraints and , and are the set of admissible states, inputs (e.g., to limit the robot’s maximum speed) and the set of admissible control states, respectively. Note that we only constraint the control states of the robot. Moreover, we assume other agents have various behaviors (e.g., cooperative or noncooperative): each agent samples a policy from a closed set (defined in Sec.IIC) at the beginning of each episode.
IiB Agent Dynamics
Real robotic systems’ inertia imposes limits on linear and angular acceleration. Thus, we assume a secondorder unicycle model for the robot [22]:
(2) 
where and are the agent position coordinates and is the heading angle in a global frame. is the agent forward velocity, denotes the angular velocity and, the linear and angular acceleration, respectively.
IiC Modeling Other Agents’ Behaviors
In a real scenario, agents may follow different policies and show different levels of cooperation. Hence, in contrast to previous approaches, we do not consider all the agents to follow the same policy [10, 12]. At the beginning of an episode, each nonego agent either follows a cooperative or a noncooperative policy. For the cooperative policy, we employ the Reciprocal Velocity Obstacle (RVO) [37] model with a random cooperation coefficient^{1}^{1}1This coefficient is denoted as in [36] sampled at the beginning of the episode. The “reciprocal” in RVO means that all agents follow the same policy and use the cooperation coefficient to split the collision avoidance effort among the agents (e.g., a coefficient of 0.5 means that each agent will apply half of the effort to avoid the other). In this work, for the noncooperative agents, we consider both constant velocity (CV) and nonCV policies. The agents following a CV model drive straight in the direction of their goal position with constant velocity. The agents following a nonCV policy either move in sinusoids towards their final goal position or circular motion around their initial position.
Iii Method
Learning a sequence of intermediate goal states that lead an agent toward a final goal destination can be formulated as a singleagent sequential decision making problem. Because parts of the environment can be difficult to model explicitly, the problem can be solved with a reinforcement learning framework. Hence, we propose a twolevel planning architecture, as depicted in Figure 1, consisting of a subgoal recommender (Section IIIA2) and an optimizationbased motion planner (Section IIC). We start by defining the RL framework and our’s policy architecture (SectionIIIA2). Then, we formulate the MPC to execute the policy’s actions and ensure local collision avoidance (SectionIIIB).
Iiia Learning a Subgoal Recommender Policy
We aim to develop a decisionmaking algorithm to provide an estimate of the costtogo in dynamic environments with mixedagents. In this paper, we propose to learn a policy directly informing which actions lead to higher rewards.
IiiA1 RL Formulation
As in [6], the observation vector is composed by the egoagent and the surrounding agents states, defined as:
(3)  
where is the egoagent state and the th agent state at step . Moreover, is the egoagent’s distance to goal and is the distance to the th agent.
Here, we seek to learn the optimal policy for the egoagent
mapping the egoagent’s observation of the environment to a probability distribution of actions. We consider a continuous action space
and define an action as position increments providing the direction maximizing the egoagent rewards, defined as: equationparentequation(4a)  
(4b)  
(4c) 
where are the position increments, the maximum linear velocity and are the network policy parameters. Moreover, to ensure that the next subgoal position is within the planning horizon of the egoagent, we bound the action space according with the planning horizon of the optimizationbased planner and its dynamic constraints, as represented in Eq. 4b.
We design the reward function to motivate the egoagent to reach the goal position while penalizing collisions:
(5) 
where is the distance to the closest surrounding agent. allows to adapt the reward function as shown in the ablation study (Sec.IVC), rewards the agent if reaches the goal penalizes if it collides with any other agents. In Section. IVC we analyze its influence in the behavior of the learned policy.
IiiA2 Policy Network Architecture
A key challenge in collision avoidance among pedestrians is that the number of nearby agents can vary between timesteps. Because feedforward NNs require a fixed input vector size, prior work [10]
proposed the use of Recurrent Neural Networks (RNNs) to compress the
agent states into a fixed size vector at each timestep. Yet, that approach discarded timedependencies of successive observations (i.e., hidden states of recurrent cells).Here, we use the “storestate” strategy, as proposed in [19]. During the rollout phase, at each timestep we store the hiddenstate of the RNN together with the current state and other agents state, immediate reward and next state, . Moreover, the previous hiddenstate is feed back to warmstart the RNN in the next step, as depicted in Fig.2. During the training phase, we use the stored hiddenstates to initialize the network. Our policy architecture is depicted in Fig. 2. We employ a RNN to encode a variable sequence of the other agents states and model the existing timedependencies. Then, we concatenate the fixedlength representation of the other agent’s states with the egoagent’s state to create a join state representation. This representation vector is fed to two fullyconnected layers (FCL). The network has two output heads: one estimates the probability distribution parameters of the policy’s action space and the other estimates the statevalue function . and
are the mean and variance of the policy’s distribution, respectively.
IiiB Local Collision Avoidance: Model Predictive Control
Here, we employ MPC to generate locally optimal commands respecting the kinodynamics and collision avoidance constraints. To simplify the notation used, hereafter, we assume the current timestep as zero.
IiiB1 State and Control Inputs
We define the egoagent control input vector as and the control state as following the dynamics model defined in Section IIB.
IiiB2 Dynamic Collision Avoidance
We define a set of nonlinear constraints to ensure that the MPC generates collisionfree control commands for the egoagent (if a feasible solution exists). To limit the problem complexity and ensure to find a solution in realtime, we consider a limited number of surrounding agents , with . Consider as the set of all surrounding agent states, than the set of the th closest agents is:
Definition 1.
A set is the set of the th closest agents if the euclidean distance .
We represent the area occupied by each agent as a circle with radius . To ensure collisionfree motions, we impose that each circle does not intersect with the area occupied by the egoagent resulting in the following set of inequality constraints:
(6) 
for each planning step . This formulation can be extended for agents with general quadratic shapes, as in [3].
IiiB3 Cost function
The subgoal recommender provides a reference position guiding the egoagent toward the final goal position and minimizing the costtogo while accounting for the other agents. The terminal cost is defined as the normalized distance between the egoagent’s terminal position (after a planning horizon ) and the reference position (with weight coefficient ):
(7) 
To ensure smooth trajectories, we define the stage cost as a quadratic penalty on the egoagent control commands
(8) 
where is the weight coefficient.
IiiB4 MPC Formulation
The MPC is then defined as a nonconvex optimization problem
(9)  
s.t.  
In this work, we assume a constant velocity model estimate of the other agents’ future positions, as in [3].
IiiC PpoMpc
In this work, we train the policy using a stateofart method, Proximal Policy Optimization (PPO) [33], but the overall framework is agnostic to the specific RL training algorithm. We propose to jointly train the guidance policy and value function with the MPC, as opposed to prior works [10] that use an idealized lowlevel controller during policy training (that cannot be implemented on a real robot). Algorithm 1 describes the proposed training strategy and has two main phases: supervised and RL training. First, we randomly initialize the policy and value function parameters . Then, at the beginning of each episode we randomly select the number of surrounding agents between , the training scenario and the surrounding agents policy. More details about the different training scenarios and considered is given in Sec.IVB.
An initial RL policy is unlikely to lead an agent to a goal position. Hence, during the warmstart phase, we use the MPC as an expert and perform supervised training to train the policy and value function parameters for steps. By setting the MPC goal state as the egoagent final goal state and solving the MPC problem, we obtain a locally optimal sequence of control states . For each step, we define and store the tuple containing the network hiddenstate, state, next state, and reward in a buffer . Then, we compute advantage estimates [32] and perform a supervised training step
(10)  
(11) 
where are the value function and policy parameters, respectively. Note that and share the same parameter except for the final layer, as depicted in Fig.2. Afterwards, we use Proximal Policy Optimization (PPO) [33] with clipped gradients for training the policy. PPO is a onpolicy method addressing the highvariance issue of policy gradient methods for continuous control problems. We refer the reader to [33] for more details about the method’s equations. Please note that our approach is agnostic to which RL algorithm we use. Moreover, to increase the learning speed during training, we gradually increase the number of agents in the training environments (curriculum learning [2]).
Iv Results
This section quantifies the performance throughout the training procedure, provides an ablation study, and compares the proposed method (sample trajectories and numerically) against the following baseline approaches:

MPC: Model Predictive Controller from Section IIIB with final goal position as position reference, ;

DRL [10]: stateoftheart Deep Reinforcement Learning approach for multiagent collision avoidance
To analyze the impact of a realistic kinematic model during training, we consider two variants of the DRL method [10]: the same RL algorithm [10] was used to train a policy under a firstorder unicycle model, referred to as DRL, and a secondorder unicycle model (Eq.2), referred to as DRL2. All experiments use a secondorder unicycle model (Eq.2) in environments with cooperative and noncooperative agents to represent realistic robot/pedestrian behavior. Animations of sample trajectories accompany the paper.
Iva Experimental setup
The proposed training algorithm builds upon the opensource PPO implementation provided in the StableBaselines
[17] package. We used a laptop with an Intel Core i7 and 32 GB of RAM for training. To solve the nonlinear and nonconvex MPC problem of Eq. 9, we used the ForcesPro [8]solver. If no feasible solution is found within the maximum number of iterations, then the robot decelerates. All MPC methods used in this work consider collision constraints with up to the closest six agents so that the optimization problem can be solved in less than 20ms. Moreover, our policy’s network has an average computation time of 2ms with a variance of 0.4ms for all experiments. Hyperparameter values are summarized in
Table I.Planning Horizon  2 s  Num. mini batches  2048 

Number of Stages  20  3  
0.99  10  
Clip factor  0.1  Learning rate 
IvB Training procedure
To train and evaluate our method we have selected four navigation scenarios, similar to [10, 4, 11]:

Symmetric swapping: Each agent’s position is randomly initialized in different quadrants of the xy plane, where all agents have the same distance to the origin. Each agent’s goal is to swap positions with an agent from the opposite quadrant.

Asymmetric swapping: As before, but all agents are located at different distances to the origin.

Pairwise swapping: Random initial positions; pairs of agents’ goals are each other’s intial positions

Random: Random initial & goal positions
Each training episode consists of a random number of agents and a random scenario. At the start of each episode, each other agent’s policy is sampled from a binomial distribution (80% cooperative, 20% noncooperative). Moreover, for the cooperative agents we randomly sample a cooperation coefficient
and for the noncooperative agents is randomly assigned a CV or nonCV policy (i.e., sinusoid or circular). Fig. 3 shows the evolution of the robot average reward and the percentage of failure episodes. The top subplot compares our method average reward with the two baseline methods: DRL (with pretrained weights) and MPC. The average reward for the baseline methods (orange, yellow) drops as the number of agents increases (each vertical bar). In contrast, our method (blue) improves with training and eventually achieves higher average reward for 10agent scenarios than baseline methods achieve for 2agent scenarios. The bottom plot demonstrates that the percentage of collisions decreases throughout training despite the number of agents increasing.IvC Ablation Study
A key design choice in RL is the reward function; here, we study the impact on policy performance of three variants of reward. The sparse reward uses (only nonzero reward upon reaching goal/colliding). The time reward uses (penalize every step until reaching goal). The progress reward uses (encourage motion toward goal). Aggregated results in Table II show that the resulting policy trained with a time reward function allows the robot to reach the goal with minimum time, to travel the smallest distance, and achieve the lowest percentage of failure cases. Based on these results, we selected the policy trained with the time reward function for the subsequent experiments.
Time to Goal [s]  % failures (% collisions / % timeout)  Traveled distance Mean [m]  

# agents  6  8  10  6  8  10  6  8  10 
Sparse Reward  8.00  8.51  8.52  0 (0 / 0)  1 (0 / 1)  2 (1 / 1)  13.90  14.34  14.31 
Progress Reward  8.9  8.79  9.01  2 (1 / 1)  3 (3 / 0)  1 (1 / 0)  14.75  14.57  14.63 
Time Reward  7.69  8.03  8.12  0 (0 / 0)  0 (0 / 0)  0 (0 / 0)  13.25  14.01  14.06 
IvD Qualitative Analysis
This section compares and analyzes trajectories for different scenarios. Fig. 4 shows that our method resolves a failure mode of both RL and MPC baselines. The robot has to swap position with a noncooperative agent (red, moving righttoleft) and avoid a collision. We overlap the trajectories (moving lefttoright) performed by the robot following our method (blue) versus the baseline policies (orange, magenta). The MPC policy (orange) causes a collision due to the dynamic constraints and limited planning horizon. The DRL policy avoids the noncooperative agent, but due to its reactive nature, only avoids the noncooperative agent when very close, resulting in larger travel time. Finally, when using our approach, the robot initiates a collision avoidance maneuver early enough to lead to a smooth trajectory and faster arrival at the goal.
We present results for mixed settings in Fig. 5 and homogeneous settings in Fig. 6 with agents. In mixed settings, the robot follows our proposed policy while the other agents either follow an RVO [37] or a noncooperative policy (same distribution as in training). Fig. 5 demonstrates that our navigation policy behaves differently when dealing with only cooperative agents or both cooperative and noncooperative. Whereas in Fig. 4(a) the robot navigates through the crowd, Fig. 4(b) shows that the robot takes a longer path to avoid the congestion.
In the homogeneous setting, all agents follow our proposed policy. Fig. 6 shows that our method achieves faster timetogoal than two DRL baselines. Note that this scenario was never introduced during the training phase, nor have the agents ever experienced other agents with the same policy before. Following the DRL policy (Fig. 5(a)), all agents navigate straight to their goal positions leading to congestion in the center with reactive avoidance. The trajectories from the DRL2 approach (Fig. 5(b)) are more conservative, due to the limited acceleration available. In contrast, the trajectories generated by our approach (Fig. 5(c)), present a balance between going straight to the goal and avoiding congestion in the center, allowing the agents to reach their goals faster and with smaller distance traveled.
IvE Performance Results
This section aggregates performance of the various methods across 200 random scenarios. Performance is quantified by average time to reach the goal position, percentage of episodes that end in failures (either collision or timeout), and the average distance traveled.
The numerical results are summarized in Table III. Our method outperforms each baseline for both mixed and homogeneous scenarios. To evaluate the statistical significance, we performed pairwise Mann–Whitney Utests between GOMPC and each baseline (95 confidence). GOMPC shows statistically significant performance improvements over the DRL2 baseline in terms of travel time and distance, and the DRL baseline in term of travel time for six agents and travel distance for ten agents. . For homogeneous scenarios, GOMPC is more conservative than DRL and MPC baselines resulting in a larger average traveled distance. Nevertheless, GOMPC is reaches the goals faster than each baseline and is less conservative than DRL2, as measured by a significantly lower average distance traveled.
Finally, considering higherorder dynamics when training DRL agents (DRL2) improves the collision avoidance performance. However, it also increases the average time to goal and traveled distance, meaning a more conservative policy that still underperforms GOMPC in each metric.
Time to Goal (mean std) [s]  % failures (% collisions / % deadlocks)  Traveled Distance (mean std) [m]  
# agents  6  8  10  6  8  10  6  8  10 
Mixed Agents  
MPC  11.2 2.2  11.3 2.4  11.0 2.2  13 (0 / 0)  22 (0 / 0)  22 (22 / 0)  12.24 2.3  12.40 2.5  12.13 2.3 
DRL [10]  13.7 3.0  13.7 3.1  14.4 3.3  17 (17 / 0)  23 (23 / 0)  29 (29 / 0)  13.75 3.3  13.80 4.0  14.40 3.3 
DRL2 [10]+  15.3 2.3  16.1 2.2  16.7 2.2  6 (6 / 0)  10 (10 / 0)  13 (13 / 0)  14.86 2.3  16.05 2.2  16.66 2.2 
GOMPC  12.7 2.7  12.9 2.8  13.3 2.8  0 (0 / 0)  0 (0 / 0)  0 (0 / 0)  13.65 2.7  13.77 2.8  14.29 2.8 
Homogeneous  
MPC  17.37 2.9  16.38 1.5  16.64 1.7  30 (29 / 1)  36 (25 / 11)  35 (28 / 7)  11.34 2.1  10.86 2.3  10.62 2.8 
DRL [10]  14.18 2.4  14.40 2.7  14.64 3.3  16 (14 / 2)  20 (18 / 2)  20 (20 / 0)  12.81 2.3  12.23 2.3  12.23 3.2 
DRL2 [10]+  15.96 3.1  17.47 4.2  15.96 4.5  17 (11 / 6)  29 (21 / 8)  28 (24 / 4)  15.17 3.0  15.85 4.2  15.40 4.5 
GOMPC  13.77 2.9  14.30 3.3  14.63 2.9  0 (0 / 0)  0 (0 / 0)  2 (1 / 1)  14.67 2.9  15.09 3.3  15.12 2.9 
V Conclusions & Future Work
This paper introduced a subgoal planning policy for guiding a local optimization planner. We employed DRL methods to learn a subgoal policy accounting for the interaction effects among the agents. Then, we used an MPC to compute locally optimal motion plans respecting the robot dynamics and collision avoidance constraints. Learning a subgoal policy improved the collision avoidance performance among cooperative and noncooperative agents as well as in multirobot environments. Moreover, our approach can reduce travel time and distance in cluttered environments. Future work could account for environment constraints.
References
 [1] (2019) Combining benefits from trajectory optimization and deep reinforcement learning. External Links: 1910.09667 Cited by: §IA2.

[2]
(2009)
Curriculum learning.
In
Proceedings of the 26th annual international conference on machine learning
, pp. 41–48. Cited by: §IIIC.  [3] (2019) Model predictive contouring control for collision avoidance in unstructured dynamic environments. IEEE Robotics and Automation Letters 4 (4), pp. 4459–4466. Cited by: §I, §IIIB2, §IIIB4.
 [4] (2019) Crowdrobot interaction: crowdaware robot navigation with attentionbased deep reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6015–6022. Cited by: §IA1, §I, §IVB.
 [5] Socially aware motion planning with deep reinforcement learning. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §IA1.
 [6] (2017) Decentralized noncommunicating multiagent collision avoidance with deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 285–292. Cited by: §IA1, §I, §IIIA1.
 [7] (2016) Motion planning with diffusion maps. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §I.
 [8] (201407) FORCES Professional. Note: embotech GmbH (http://embotech.com/FORCESPro) Cited by: §IVA.
 [9] (2008) Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39 (2), pp. 517–529. Cited by: §IA3.
 [10] (2021) Collision avoidance in pedestrianrich environments with deep reinforcement learning. IEEE Access 9 (), pp. 10357–10377. External Links: Document Cited by: §IA1, §I, §IIC, §IIIA2, §IIIC, Fig. 3, 5(a), 5(b), 2nd item, §IVB, TABLE III, §IV.
 [11] Distributed multirobot collision avoidance via deep reinforcement learning for navigation in complex scenarios. The International Journal of Robotics Research. Cited by: §IA1, §I, §IVB.
 [12] Fully distributed multirobot collision avoidance via deep reinforcement learning for safe and efficient navigation in complex scenarios. arXiv preprint arXiv:1808.03841. Cited by: §IIC.
 [13] (2019) Deep value model predictive control. arXiv preprint arXiv:1910.03358. Cited by: §IA2, §IA3.
 [14] (2019) Reinforcement learning and model predictive control for robust embedded quadrotor guidance and control. Autonomous Robots 43 (7), pp. 1681–1693. Cited by: §IA3.
 [15] (1995) Social force model for pedestrian dynamics. Physical review E 51 (5), pp. 4282. Cited by: §IA1.
 [16] (2020) Learningbased model predictive control: toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems 3, pp. 269–296. Cited by: §IA2.
 [17] (2018) Stable baselines. GitHub. Cited by: §IVA.
 [18] (2019) Modelbased lookahead reinforcement learning. External Links: 1908.06012 Cited by: §IA3.
 [19] (2019) Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: §IIIA2.
 [20] (2020) Practical reinforcement learning for mpc: learning from sparse objectives in under an hour on a real robot. External Links: 2003.03200 Cited by: §IA2.
 [21] (2016) Socially adaptive path planning in human environments using inverse reinforcement learning. International Journal of Social Robotics 8 (1), pp. 51–66. Cited by: §IA1.
 [22] Planning algorithms. Cambridge university press. Cited by: §IIB.
 [23] (2013) Guided policy search. In International Conference on Machine Learning, pp. 1–9. Cited by: §IA3.
 [24] (2013) Variational policy search via trajectory optimization. In Advances in neural information processing systems, Cited by: §IA3.
 [25] (2018) Plan online, learn offline: efficient learning and exploration via modelbased control. arXiv preprint arXiv:1811.01848. Cited by: §IA2.
 [26] (2018) Using a memory of motion to efficiently warmstart a nonlinear predictive controller. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2986–2993. Cited by: §IA2.
 [27] (2019) Multiagent path topology in support of socially competent navigation planning. The International Journal of Robotics Research 38 (23), pp. 338–356. Cited by: §IA1.
 [28] (2018) Social momentum: a framework for legible navigation in dynamic multiagent environments. In Proceedings of the 2018 ACM/IEEE International Conference on HumanRobot Interaction, pp. 361–369. Cited by: §IA1.
 [29] (2014) Combining the benefits of function approximation and trajectory optimization.. In Robotics: Science and Systems, Vol. 4. Cited by: §IA3.
 [30] (2017) Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. External Links: 1708.02596 Cited by: §IA2.
 [31] (2016) A survey of motion planning and control techniques for selfdriving urban vehicles. IEEE Transactions on intelligent vehicles. Cited by: §I.
 [32] (2015) Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §IIIC.
 [33] (2017) Proximal policy optimization algorithms. External Links: 1707.06347 Cited by: §IIIC, §IIIC, 23.

[34]
(2018)
Socially compliant navigation through raw depth inputs with generative adversarial imitation learning
. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1111–1117. Cited by: §IA1.  [35] (2015) Robot navigation in dense human crowds: statistical models and experimental studies of human–robot cooperation. The International Journal of Robotics Research 34 (3), pp. 335–356. Cited by: §IA1.
 [36] (2011) Reciprocal nbody collision avoidance. In Robotics research, Cited by: §IA1, footnote 1.
 [37] (2008) Reciprocal velocity obstacles for realtime multiagent navigation. In 2008 IEEE International Conference on Robotics and Automation, pp. 1928–1935. Cited by: §IIC, §IVD.
 [38] Reciprocal collision avoidance with accelerationvelocity obstacles. In 2011 IEEE International Conference on Robotics and Automation, Cited by: §IA1.
 [39] (2017) Modeling cooperative navigation in dense human crowds. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1685–1692. Cited by: §IA1.
 [40] Exploring modelbased planning with policy networks. In International Conference on Learning Representations, Cited by: §IA3.
 [41] (2020) Safe reinforcement learninge using robust mpc. IEEE Transactions on Automatic Control. Cited by: §IA2.
 [42] (2013) Value function approximation and model predictive control. In 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp. 100–107. Cited by: §IA2.