Human-Inspired Multi-Agent Navigation using Knowledge Distillation

by   Pei Xu, et al.
Clemson University

Despite significant advancements in the field of multi-agent navigation, agents still lack the sophistication and intelligence that humans exhibit in multi-agent settings. In this paper, we propose a framework for learning a human-like general collision avoidance policy for agent-agent interactions in fully decentralized, multi-agent environments. Our approach uses knowledge distillation with reinforcement learning to shape the reward function based on expert policies extracted from human trajectory demonstrations through behavior cloning. We show that agents trained with our approach can take human-like trajectories in collision avoidance and goal-directed steering tasks not provided by the demonstrations, outperforming the experts as well as learning-based agents trained without knowledge distillation.



There are no comments yet.


page 6


Least-Restrictive Multi-Agent Collision Avoidance via Deep Meta Reinforcement Learning and Optimal Control

Multi-agent collision-free trajectory planning and control subject to di...

ALAN: Adaptive Learning for Multi-Agent Navigation

In multi-agent navigation, agents need to move towards their goal locati...

KnowRU: Knowledge Reusing via Knowledge Distillation in Multi-agent Reinforcement Learning

Recently, deep Reinforcement Learning (RL) algorithms have achieved dram...

Decentralized Runtime Synthesis of Shields for Multi-Agent Systems

A shield is attached to a system to guarantee safety by correcting the s...

Towards Disturbance-Free Visual Mobile Manipulation

Embodied AI has shown promising results on an abundance of robotic tasks...

Training an Interactive Helper

Developing agents that can quickly adapt their behavior to new tasks rem...

Probe-Based Interventions for Modifying Agent Behavior

Neural nets are powerful function approximators, but the behavior of a g...

Code Repositories


Human-Inspired Multi-Agent Navigation using Knowledge Distillation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The problem of decentralized multi-agent navigation has been extensively studied in a variety of fields including robotics, graphics, and traffic engineering. Existing state-of-the-art planners can provide formal guarantees about the collision-free motion of the agents allowing applicability on physical robots [FS98, orca, alonso2013optimal]. In addition, recent learning-based approaches are capable of end-to-end steering and human-aware robot navigation [long2017deep, long2018towards, chen2019crowd]. Despite recent advancements, though, existing agents cannot typically exhibit the level of sophistication and decision making that humans do in multi-agent settings. While the behavior of the agents can be improved through limited one-way communication [hildreth2019coordinating, godoy2020c], and accounting for social norms [alahi2016social, chen2017socially, gupta2018social], here we explore an alternative, data-driven approach that takes advantage of publicly available human trajectory datasets. Our approach starts with expert demonstrations obtained from such trajectories and learns human-like navigation policies in fully decentralized multi-agent environments using reinforcement learning.

The idea of imitating expert behaviors is not something new. For example, behavior cloning techniques have been successfully applied to autonomous driving [bojarski2016end], motion planning for autonomous ground robots [pfeiffer2017perception], and distributed robot navigation [long2017deep]

among others. However, pure imitation learning techniques are severely limited by the quality of the expert training dataset and cannot scale well to the multi-agent navigation domain due to its open-ended nature.

Combining expert demonstrations with reinforcement learning can address the problem of insufficient training samples, with typical approaches relying on bootstrapping reinforcement learning with supervised learning 

[peters2008reinforcement, rajeswaran2017learning], inverse reinforcement learning [ng2000algorithms], and generative adversarial imitation learning [ho2016generative]. However such techniques are not directly applicable to the task of human-like collision avoidance learning since they typically assume reliable and representative expert demonstrations for the task in hand. Unfortunately, the experts (pedestrians) in human trajectory datasets are biased to some degree due to the fact that i) we only have access to a limited number of interaction data, which does not necessarily capture the behavior that the same expert will exhibit in a different setting; and ii) trajectory datasets cannot capture the non-deterministic nature of human decision making, i.e., there are more than one trajectories that a human expert can take in the same setting.

To address these issues, we propose to use knowledge distillation [hinton2015distilling] to learn a human-like navigation policy through expert policies extracted from human demonstrations. Given the imperfect nature of the experts, we avoid directly optimizing over them by adding an extra term to the objective function, as typically proposed in the literature [bertsekas2011approximate, nair2018overcoming]. Instead, we utilize the expert policies to shape the reward function during reinforcement learning while promoting goal-directed and collision-free navigation. The resulting trained agents can surpass the experts, and achieve better performance than pure learning-based agents without any expert reward signal and planning-based agents, while behaving in a human-like and more adept manner.

Overall, this paper makes the following contributions:

  1. We introduce a reinforcement learning approach for human-inspired multi-agent navigation that exploits human trajectory demonstrations and knowledge distillation to train a collision avoidance policy in homogeneous, fully decentralized settings.

  2. We experimentally show that the trained policy enables agents to take human-like actions for collision avoidance and goal-directed steering, and can generalize well to unseen scenarios not provided by the demonstrations, surpassing the performance of the experts and that of pure reinforcement-learning based agents or planning-based agents.

  3. We provide related code and pre-trained policies that can be used by the research community as baselines to facilitate further research and advance the field of human-inspired multi-agent navigation.

Ii Related Work

Ii-a Multi-Agent Navigation

State-of-the-art techniques for decentralized multi-agent navigation can be broadly classified into local planning approaches and learning-based methods. Existing local planning approaches that rely on social forces, energy-based formulations, and rule-based techniques have been successfully applied to a variety of domains and have been shown to generate human-like collision avoidance behavior 

[RE99, helbing, prl]. In robotics, geometric local planners based on the concepts of velocity obstacles and time to collision [FS98, rvo, orca] are widely applicable as they provide formal guarantees about the collision-free behavior of the agents and can be extended to account for motion and sensing uncertainty allowing implementation on actual robots [alonso2013optimal, gorca, nhttc]. However, despite their robustness, local planning approaches typically require careful parameter tuning which can limit their applicability to unseen environments. In addition, such approaches typically rely on user-defined assumptions about the optimal motion principles that govern the interactions between the agents.

Learning-based decentralized methods typically employ a reinforcement learning paradigm to address the problem of limited training data, allowing agents to learn navigation policies through interactions with the environment [how, chen2017decentralized]. Such methods do not make assumptions explicitly about what the optimal policy of the agents should be, but rather let the agents learn that policy through trial and error based on a reward function. Despite lacking theoretical collision avoidance guarantees, such approaches allow for better generalization to new environments and conditions as compared to local planning methods, with some of the most recent works enabling crowd-aware navigation for physical robots [how2, chen2019crowd, liu2020social] as well as fully distributed multi-robot navigation [long2018towards, fan2018fully]. Our work is complementary to such learning-based methods, as we consider a reinforcement learning framework combined with imitation learning to train a human-like navigation policy applicable to homogeneous multi-agent navigation settings.

Ii-B Imitation Learning

Below we consider imitation learning approaches that rely on limited number of expert demonstrations collected offline. Under this assumption, prior work has focused on extracting an action policy to generate expert-alike controls. Behavior cloning methods learn policies by directly matching the state-action pair as the input and output of the policy to the expert demonstrations through supervised learning [pomerleau1989alvinn, bojarski2016end]. Such approaches have also been explored for motion planning and distributed multi-robot navigation where expert demonstrations are based on simulations [pfeiffer2017perception, long2017deep]. Similar ideas are also applicable when the expert is represented by a distribution policy such that we can perform policy distillation to minimize the divergence or crossing entropy between the target policy and the expert policies [rusu2015policy, czarnecki2019distilling]. Inverse reinforcement learning (IRL) [ng2000algorithms]

methods estimate a parameterized reward function based on the expert trajectories and perform training using reinforcement learning. IRL has been successfully applied for robot navigation through crowded environments in 

[henry2010learning, kretzschmar2016socially, kim2016socially]. Generative adversarial imitation learning (GAIL) [ho2016generative] have also been recently exploited for socially compliant robot navigation using raw depth images [tai2018socially].

While highly relevant, the aforementioned methods have difficulties when applied to the task of multi-agent human-like collision avoidance learning in general environments. A typical assumption in imitation learning is that the expert policies or demonstrations are reliable. However, the demonstrations provided by real human pedestrian datasets are biased to some degree as a dataset can only contain a limited number of human-human interactions in certain situations. Also, there is a lot of uncertainty in human decision making which cannot be captured by trajectory datasets. To address these issues, we leverage the idea of knowledge distillation [hinton2015distilling], and perform optimization implicitly through reward shaping during reinforcement learning based on the imperfect expert policies learned from human pedestrian trajectory demonstrations.

Iii Approach

We propose a reinforcement learning framework for collision-free and human-inspired multi-agent navigation. We assume a homogeneous, fully decentralized setup consisting of holonomic disk-shaped agents that do not explicitly communicate with each other but share the same action policy to perform navigation based on their own observations. To generate human-like actions, we take advantage of publicly available human trajectory data and perform reward shaping through knowledge distillation in the learning process. Overall, our approach adopts a two-stage training process: (1) supervised learning on expert demonstrations from human trajectory data, and (2) reinforcement learning with knowledge distillation to generate a general action policy.

Iii-a Problem Formulation

We consider a decentralized multi-agent control setting, where each agent acts independently, but all agents share the same action policy , with denoting the action that the agent samples at a given time step , is the observation that the agent receives at , and

denotes the parameters of the policy. The interaction process between an agent and the environment, which also includes other agents, is a partially observable Markov decision process, where each agent only relies on its own observations to make decisions.

For an -agent environment, let be the transient model of the Markov decision process with state space and action space . Then, at time , is a partial observable state for the -th agent and . Given a time horizon and the trajectory , we optimize the parameter by maximizing the cumulative reward


where is the total cumulative reward achieved by all agents with the step reward and discount factor , and is the state-action visitation distribution satisfying .

Given that the decision process of each agent is completely independent under the fully decentralized context, the problem in Eq. 1 can be solved using a standard reinforcement learning setup by optimizing the combined cumulative reward of all agents based on their own trajectories, i.e:


where is the observation-action trajectory for the -th agent itself.

Iii-B Observation and Action Space

The observation space of each agent is defined using a local coordinate system based on the agent’s current position, , and velocity, , as: , where

  • denotes the local state of the -th neighbor. We assume that the agent has an observing radius , and another agent is considered as its neighbor if .

  • is the neighborhood representation for the agent itself. This is a dummy neighbor representation to help data process when there is no neighbor in the observable range.

  • is the relative goal position where is the goal position of the agent defined in the global coordinate system.

  • is the current agent’s velocity.

The action space is a continuous 2D space denoting the expected velocity of the agent at the next time step, i.e. . When applied on the agent, the expected velocity is scaled to satisfy the maximal speed at which agent can move:


Iii-C Reward

Following prior work [how, chen2019crowd], we employ a reward function that gives a high reward signal as a bonus if an agent reaches its goal position, and a negative reward signal as a penalty to those that collide with any other agent. These reward signals urge the agent to reach its goal in as few steps as possible while avoiding collisions. As opposed to introducing additional terms in the reward function to regularize the agent trajectories, such as penalties for large angular speeds [long2018towards] and uncomfortable distances [liu2020social] or terms that promote social norms [chen2017socially], our work mainly relies on human expert policies, , extracted from real crowds trajectories to shape the reward.

In particular, we seek for the agents to exhibit navigation behavior in the style of the experts without explicitly dictating their behaviors. However, expert policies obtained by learning from human trajectory demonstrations are typically not general enough to lead agents to effectively resolving collisions in unseen environments. As such, we avoid directly optimizing the action policies through cloning behaviors from imperfect experts by introducing auxiliary loss terms in the objective function (Eq. 2). Instead, we propose to use knowledge distillation and exploit the behavior error averaged over expert policies to shape the reward function.

During inference, expert policies would become unreliable if the agent’s speed is significantly different from the demonstrations. Humans typically prefer to walk at a certain preferred speed depending on the environment and their own personality traits, and their behavior patterns may differ a lot at different walking speed. As such, we introduce a velocity regularization term in the reward function to encourage each agent to move at a preferred speed similar to the ones observed in the demonstrations. To promote goal seeking behavior and avoid meaningless exploration led by imperfect experts, is used to scale the goal velocity that points from an agent’s current position to its goal.

Given an environment with agents, the complete reward function of our proposed imitation learning approach with knowledge distillation from experts is defined as follows


where with , is the agent radius, and and are weight coefficients. The knowledge distillation reward term and the velocity regularization term are computed as:


where and are scale coefficients, and is the goal velocity of the agent having a magnitude equal to the preferred speed .

Iii-D Policy Learning

The learning process of our approach has two stages: (1) obtain expert policies by supervised learning from real human pedestrian trajectories, and (2) train a shared action policy in multi-agent environments through reinforcement learning to maximize Eq. 2 with the reward function defined in Eq. 4.

Supervised Learning. Expert policies are trained by supervised learning on human pedestrian trajectory datasets where the observation-action training data is extracted as described in Section III-B

. The goal position is defined by the last position of each pedestrian appearing in the datasets. We use a neural network denoted by

with parameters to perform action prediction, which is optimized by minimizing the mean squared error between the predicted and the ground truth action:


Trajectory data in human datasets is typically recorded by sensors at fixed and rather sparse time intervals. Training on such limited, discrete data is easily prone to overfitting, as the network simply remembers all the data. To alleviate this issue, we perform data augmentation during training by randomly sampling a continuous timestamp and using linear interpolation to estimate corresponding observation-action data from discrete trajectories. In addition to linear interpolation, we adopt two more types of data augmentation: scenario flipping and rotation. During training, the x- and/or y-axis of each observation-action datum is flipped by a stochastic process along with randomly rotating the observation-action local system. This helps increase the generality of the training samples without destroying the relational information between human-human interactions captured in the dataset.

While multiple expert policies can be trained by using different datasets, the action patterns of pedestrians recorded at different times and/or under different scenarios may vary a lot. This could lead to too much divergence when we ensemble the expert policies in the reward function of the reinforcement learning stage (cf. Eq 4). As such, in our implementation, we extract expert policies only from one pedestrian trajectory dataset. Under this setting, the deviation caused by stochastic factors during expert training can be effectively eliminated by data augmentation (Section IV-E). This allows us to employ just a single expert policy in all of our experiments and obtain a final action policy that is comparable to the one obtained with multiple experts but at a significant training speedup.

Reinforcement Learning: To perform reinforcement learning, we exploit the DPPO algorithm [heess2017emergence], which is a distributed training scheme that relies on the PPO algorithm [schulman2017proximal]. DPPO optimizes Eq. 2 using policy gradient method [sutton2000policy] by maximizing , where is an estimation to the discounted cumulative reward term with bias subtraction, which in our implementation is computed by the generalized advantage estimation (GAE) [schulman2015high].

To help exploration, we also introduce differential entropy loss to encourage exploration and avoid premature convergence, resulting the following objective function:


where represents the entropy of the action policy, and is a scalar coefficient of the policy entropy loss. The parameters and denote the observation and action, respectively, after performing goal direction alignment, i.e. after rotating the local coordinate system described in III-B such that the goal is always at a fixed direction (the positive direction of x axis in our implementation). By this method, we can remove one dimension of and use the distance to the goal position instead, i.e., . This trick helps the reinforcement learning, as it reduces the complexity of state space and is beneficial to exploration. Intuitively, agents in the goal-aligned local system are more likely to encounter similar observations and thus exploit previous experience more effectively.

We do not apply the goal-alignment trick during expert policy training, since it could result in too much overfitting. In human trajectory datasets, pedestrians move mainly towards their goals, and hence with the alignment trick the output actions (velocities) would mostly respond to the goal direction. As a result, expert policies would prefer orientation adaption more than speed adaptation which can often lead to understeering.

Figure 1:

Network architecture employed as the policy network during reinforcement learning. The policy is defined as a Gaussian distribution with

as the mean value and an input-independent parameter

as the standard deviation. The same architecture is adopted for the value network as well as the expert policy learning.

Network architecture. We use a bivariate Gaussian distribution with independent components as the action policy, where the mean values are provided by the policy network with architecture shown in Fig. 1, and the standard deviation is a 2-dimension observation-independent parameter. Neighbor representations, and , are simply added together with the agent representation, , after the embedding networks. denotes the neighborhood representation of the agent itself and is always kept such that the network can work normally when there is no neighbors observed. This network architecture can support observation representations with an arbitrary number of neighbors and the neighborhood embedding result does not rely on the input order of the neighbor representation sequence. We use the same architecture for the value network to perform value estimation for the GAE computation, and to train the expert policies through supervised learning without goal alignment observations.

[t] 20-Circle 24-Circle 20-Corridor 24-Corridor 20-Square 24-Square ORCA Success Rate SL (higher is better) RL Ours

Extra Distance SL (lower is better) RL Ours
Energy Efficiency SL (higher is better) RL Ours

11footnotetext: Footnote
Table I: Quantitative Evaluation

Iv Experiments

Iv-a Simulation Environment Setup

Our simulation environment has three different scenarios shown in Fig. 2. In the circle scenarios, agents are roughly placed on the circumference of a circle and have to reach their antipodal positions; in the corridor scenarios, agents are randomly initialized at the two sides of the environment, oriented either vertically or horizontally, and have random goals at the opposite sides; and in the square crossing scenarios, agents are placed at the four sides of the environment and need to walk across the opposite sides to reach randomly placed goals. To increase the generality, all scenarios are generated randomly during simulation. The circle environment has a random radius varying from 4m to 6m. The other two environments have widths and heights in the range of 8m to 12m. During each simulation, 6 to 20 agents sharing the same action policy under optimization are placed into the environments and are given stochastic goal positions.

The simulation time step size is 0.04s, while agents receive control signals every 0.12s. All agents are modeled as disks with a radius of 0.1m, and have a preferred speed of 1.3m/s with a maximum speed limit of 2.5m/s to imitate human walking patterns in the dataset for expert policy training. During simulation, collided agents are kept as static obstacles to other agents, while agents that arrive at their goals are removed from the environment. Each simulation episode terminates if all non-collided agents reach their goals or a time limit of 120s is met.

Iv-B Training Details

We exploit the students dataset [lerner2007crowds] for supervised learning of expert policies. This dataset records the trajectories of 434 pedestrians in an open campus environment, providing a more general setting to learn typical expert policies as compared to environments in other publicly available datasets consisting of too many obstacles and/or terrain restrictions. Each pedestrian is considered as a disk agent with a radius of 0.1m, estimated by the typical minimal distance between trajectories; the goal position is chosen as the last position of each pedestrian’s trajectory; velocities are estimated by finite differences.

Before training, we preprocess the raw data to remove pedestrians that stand still or saunter without clear goals, as trajectories of such pedestrians cannot help learn navigation policies and would become training noise. After data cleansing, we kept the observation-action records from 300 out of the 434 pedestrians for expert training (Eq. 6), while the rest of the pedestrian trajectories are only used as part of the neighborhood representation of the active pedestrians. During the following experiments, we, by default, use only one expert policy trained with data augmentation (see Section IV-E for related analysis). The default values of the reward function weights are: , , . We refer to

for all hyperparameters used, along with related videos and code.

Iv-C Quantitative Evaluation

Figure 2: Our environment consists of: (a) a circle crossing scenario, (b) a corridor scenario, and (c) a square crossing scenario. During each simulation episode, a scenario is randomly chosen and 6 to 20 agents are put into the scenario randomly. Red arrows indicate goal directions.

We compare our method to three approaches: the geometric-based method of Optimal Reciprocal Collision Avoidance (ORCA) [orca], a supervised learning approach (SL), and a reinforcement learning approach without knowledge distillation (RL). To prevent ORCA agents from staying too close to each other, we increase the agent radius by 20% in the ORCA simulator. SL denotes the performance of the expert policy obtained by stage 1 of our approach. RL uses the reward function from [long2018towards], which optimizes agents to reach their goals as fast as possible without collisions. Since RL performs optimization based on the maximal agent speed but our performance evaluation are computed based on a preferred speed of 1.3m/s, for fairness, we set the maximal speed in RL training and testing cases as 1.3m/s instead of the default value of 2.5m/s.

We perform 50 trials of testing, which are the same across different methods, using the three scenarios shown in Fig. 2

. We run comparisons using three evaluation metrics and report the results in Table 

I. Reported numbers are the mean std. Success rate denotes the percentage of agents that reached their goals. Extra distance is the additional distance in meters that an agent had to traverse instead of taking the straight line path to the goal. This metric is recorded only for agents that successfully reached their goals. Energy efficiency measures the ratio between an agent’s progress towards its goal and its energy consumption at each time step [godoy2020c]. The energy consumption is computed following the definition in [godoy2020c] such that the optimal speed in terms of energy expended per meter traversed is 1.3m/s.

As shown in Table I, our approach can achieve as good as or better results than ORCA in terms of success rate and energy efficiency, while it significantly outperforms RL and SL in most of the scenarios. Given that the expert policies obtained by SL are used in the second stage of our approach, it is clear that agents with knowledge distillation can surpass the experts to achieve collision-free and energy-efficient behavior patterns. Regarding the extra distance traversed, agents with our approach favor shortest paths in the crossing and square scenarios. In the circle scenarios, ORCA outperforms the rest of the methods as its agents prefer to take the straight line path to the goal that passes through the center of the environment by mainly adapting their speeds. However, as discussed below these are not typically the types of trajectories that humans would exhibit in similar situations. In addition, we note that both the performance and behavior of ORCA agents vary a lot depending on the choices of parameters used, including the size of the simulation time step and the safety buffer added to the radius of the agent.

Iv-D Comparisons to Human Reference Trajectories

Figure 3: Trajectories of different methods in 2-agent head-to-head scenario.
Figure 4: Value heatmaps of our approach in the 2-agent interaction scenario. Brighter radial bins indicate the action decisions that are considered better by the agents. Dark bins indicate the actions that agents do not prefer.
Figure 5: Agent trajectories in a 6-agent circle scenario. Each column is captured at the same time fraction of the whole trajectories.

Figure 3 compares the trajectories generated by different methods in a symmetric head-to-head interaction scenario to reference data obtained from motion captured human experiments [moussaid]. Due to the symmetric nature of the scenario, ORCA agents fail to reach their goals and end up standing still next to each other in the middle of the environment. RL agents maintain a high walking speed close to their maximal one and solve the interaction by performing an early orientation adaptation as opposed to the reference data where the avoidance happens later on. Trajectories generated by our approach more closely match the reference trajectories. To further analyze the decision making of the agents in our approach, Fig. 4 depicts the corresponding action value heatmaps obtained by estimation through the value network of DPPO. As shown, at 0.72s the agents implicitly negotiate to pass each other from the right side; and later on at the 0.92s mark every action that can lead to an imminent collision is forbidden to both agents, with agent 1 preferring to turn a bit to the right while moving at a speed of at least 1m/s while agent 2 decides to veer a bit to the right and accelerate.

Figure 5 shows trajectories from a 6-agent circle scenario, where the reference human trajectories were obtained from  [wolinski2014parameter]. In this scenario, ORCA agents mainly resolve collisions by adapting their speeds; they move slow at the beginning due to the uncertainty about what their neighbors are doing and eventually do a fast traverse mostly along straight lines towards their goals. RL agents, on the other hand, prefer to travel at their maximum speed towards the center of the environment, and then adapt their orientations in unison resulting in a vortex-like avoidance pattern. Agents in our approach exhibit more variety by using a combination of both speed and orientation adaptation to balance the tradeoff between collision avoidance and goal steering, As such, the resulting trajectories can capture to some degree the diversity that humans exhibit in the reference data.

To statistically analyze the quality of generated trajectories, we reproduce the students

scenario with different methods and run novelty detection using the

-LPE algorithm [zhao2009anomaly]. Figure 6 shows how the corresponding agent trajectories compare to the trajectories in the students reference dataset based on the metrics of the agent’s linear speed, angular speed, and distance to the nearest neighbor. As it can be seen, the majority of trajectories generated by our approach have low anomaly scores as compared to ORCA and RL. Figure 7 further highlights that agents in our approach can more closely match the velocity distribution of the ground truth data.

Iv-E Sensitivity Analysis

Figure 6: Distributions of anomaly scores on students scenario.
Figure 7: Agent velocity distributions on students scenario recurring.

Figure 8(a) shows the performance of action policies trained with different experts in the 24-Circle scenario. As it can be seen, the performance varies a lot when employing a single expert trained without data augmentation. However, the uncertainty caused by stochastic factors of supervised training can be effectively eliminated by introducing data augmentation during expert policy training. Using either multiple augmented experts or a single expert with data augmentation results in comparable performances. As such, and taking into account the fact that the inference time increases dramatically as more expert policies are exploited simultaneously, we employed a single expert policy with data augmentation in all of our experiments. In our tests, using two V100 GPUs, it took approximately six hours to perform training with three expert policies, while training with a single expert took only around three and a half hours.

Figure 8(b) compares the training results obtained by using the full reward function that considers both the knowledge distillation term and the velocity regularization term (Eq. 5) to results obtained by using only one of the two terms. As shown in the figure, the expert policy by itself is not quite reliable, as it leads to a worse performance across all three metrics when only the knowledge distillation term is used. However, when combined with the velocity reward term, the expert can help agents achieve the best overall performance.


(a) Performance with different expert policies.
(b) Performance with different reward terms.
Figure 8: Sensitivity analysis based on performance in the 24-agent circle. (a) Multiple experts testing cases use three experts simultaneously for reward computation. Single expert results are the average of three training trials each of which uses only one of the corresponding three experts. (b) Comparisons of different reward terms using a single expert with data augmentation.

We propose a framework for learning a human-like general collision avoidance policy for agent-agent interactions in fully decentralized multi-agent environments. To do so, we use knowledge distillation implicitly in a reinforcement learning setup to shape the reward function based on expert policies extracted from human pedestrian trajectory demonstrations. Our approach can help agents surpass the experts, and achieve better performance and more human-like action patterns, compared to using reinforcement learning without knowledge distillation and to existing geometric planners.

A limitation of our approach is that we assume all agents are of the same type, and particularly are holonomic disk-shaped agents. While, in theory, the same trained policy can be ported to other types of agents, the resulting behavior can be very conservative as the true geometry of the target agent will be ignored during training. In addition, we assume that trained agents can behave similarly to the expert pedestrians that provide the demonstrations, ignoring the kinodynamic constraints of specific robot platforms. Even though this issue can be addressed by relying on a controller to convert human-like velocity commands to low-level robot control inputs, it also opens an interesting direction for future work that focuses on mining robot-friendly action patterns rather than action commands from human trajectories. Another avenue for future work is extending our method to mixed settings where agents can interact with humans. The recent works of [gupta2018social, liu2020social, sathyamoorthy2020densecavoid] on socially-aware robot navigation can provide some interesting ideas towards this research direction.

Vi Acknowledgments

This work was partially supported by an Amazon Research Award.