SOCIALGYM: A Framework for Benchmarking Social Robot Navigation

09/22/2021
by   Jarrett Holtz, et al.
The University of Texas at Austin
0

Robots moving safely and in a socially compliant manner in dynamic human environments is an essential benchmark for long-term robot autonomy. However, it is not feasible to learn and benchmark social navigation behaviors entirely in the real world, as learning is data-intensive, and it is challenging to make safety guarantees during training. Therefore, simulation-based benchmarks that provide abstractions for social navigation are required. A framework for these benchmarks would need to support a wide variety of learning approaches, be extensible to the broad range of social navigation scenarios, and abstract away the perception problem to focus on social navigation explicitly. While there have been many proposed solutions, including high fidelity 3D simulators and grid world approximations, no existing solution satisfies all of the aforementioned properties for learning and evaluating social navigation behaviors. In this work, we propose SOCIALGYM, a lightweight 2D simulation environment for robot social navigation designed with extensibility in mind, and a benchmark scenario built on SOCIALGYM. Further, we present benchmark results that compare and contrast human-engineered and model-based learning approaches to a suite of off-the-shelf Learning from Demonstration (LfD) and Reinforcement Learning (RL) approaches applied to social robot navigation. These results demonstrate the data efficiency, task performance, social compliance, and environment transfer capabilities for each of the policies evaluated to provide a solid grounding for future social navigation research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

07/10/2020

NaviGAN: A Generative Approach for Socially Compliant Navigation

Robots navigating in human crowds need to optimize their paths not only ...
02/26/2021

SocNavBench: A Grounded Simulation Testing Framework for Evaluating Social Navigation

The human-robot interaction (HRI) community has developed many methods f...
08/31/2020

Benchmarking Metric Ground Navigation

Metric ground navigation addresses the problem of autonomously moving a ...
03/08/2021

Iterative Program Synthesis for Adaptable Social Navigation

Robot social navigation is influenced by human preferences and environme...
10/30/2019

Interactive Gibson: A Benchmark for Interactive Navigation in Cluttered Environments

We present Interactive Gibson, the first comprehensive benchmark for tra...
04/13/2021

Reward Shaping with Subgoals for Social Navigation

Social navigation has been gaining attentions with the growth in machine...
04/29/2021

Crowd against the machine: A simulation-based benchmark tool to evaluate and compare robot capabilities to navigate a human crowd

The evaluation of robot capabilities to navigate human crowds is essenti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

intro Deploying robot navigation safely alongside people such that they move in a social manner is one of the ultimate goals of robotics. However, training and evaluating in real-world human environments presents both safety concerns and scalability challenges for many learning algorithms, and this difficulty compounds the difficulty of developing robust approaches to robot social navigation. As such, an extensible benchmark for social navigation is a critical step along the path to deploying socially compliant robots.

In order to accomodate the rapidly expanding pool of promising machine learning techniques a framework for these benchmarks should provide the infrastructure for integrating approaches. In addition, the ability to adjust the reward, observation space, and properties of the agent and environment is important for capturing social navigation scenarios and enabling varied learning approaches. Finally, such a framework should provide a representation with reasonable simulation fidelity while also abstracting away the perception task to focus on social navigation. Many different solutions for simulating robots and humans in dynamic environments have been proposed for social navigation research. However, to the best of our knowledge no existing solution satisfies all of the desired properties for a benchmark framework.

Fig. 1: Diagram of interaction between modules in . Dashed boxes represent interchangeable modules, while purple boxes contain configureable parameters. Blue arrows represent requests between modules, and green arrows represent responses.

systemDiagram

Our proposed solution is , a 2D simulation environment built on the Robot Operating System (ROS) to provide a lightweight and configurable option for training and evaluating social navigation behaviors. 2D simulation is chosen to provide an option that abstracts away the perception problem to focus on interactions between navigation and people in dynamic human environments. To streamline integration in common reinforcement learning workflows, implements the OpenAI gym interface for training and evaluation. is modular such that modules can be exchanged for others to increase the range of possible benchmarks, an overview of module interactions and configureability is shown in systemDiagram, and a visualization of the simulation environment in introFigure. The included benchmark in focuses on the action selection problem presented in work on Multipolicy Decision Making [18] and program synthesis for social navigation [14], where social navigation is modeled as an action selection problem, and the optimal action is selected from a set of discrete sub-policies. Using the included benchmark, we compare and contrast a suite of engineered approaches, model-based symbolic learning, Learning from Demonstration (LfD), and Reinforcement Learning (RL); to provide a baseline for future comparisons; and highlight promising future directions for research. Our results demonstrate the data efficiency, scenario performance, social compliance, and generalizability of different policies, and our analysis highlights the promising strengths and weaknesses of different techniques. In addition to providing the benchmark for multipolicy decision-making in social navigation, is designed to be configurable and extensible in the hopes of providing a stable base for a variety of social navigation benchmarks.

The contributions of this work are as follows:

  • a 2D simulation environment for robot social navigation,

  • a complete toolkit that includes robot simulation, human simulation, navigation stack, and simple baseline behaviors,

  • OpenAI gym integration to provide a standard interface for many learning approaches,

  • a benchmark based on multipolicy decision making for robot social navigation,

  • evaluation results and analysis for several reinforcement learning and learning from demonstration approaches for use as baselines for future research, and

  • an open source repository for

    .

Ii Related work

related While model-based approaches with application-specific engineered behaviors have historically been applied to social navigation [18, 13, 9, 19, 2]. More recently, approaches have sought to leverage machine learning techniques to learn general social navigation behaviors [4, 12, 3, 6, 24].

Existing simulators and benchmarks are either at the level of high-fidelity 3D simulation that focuses on the raw perception problem of humans in the environment or are simplified grid world models of navigation. In contrast, provides sensor reading at the level of laser scans and human detections.

Many general-purpose high-fidelity 3D simulators have been developed for robotics. Of particular note is the Gazebo simulator [15], a powerful and extensible simulator commonly used with ROS. Other high-fidelity simulators have been developed using gaming physics and graphics engines such as Unity [11] or Unreal [8] with autonomous driving in mind, some of which can simulate other agents in the environment, such as AirSim [23] and CARLA [7]. In addition to these more general-purpose simulators, some solutions have been proposed specifically for the social navigation problem. The Social Environment for Autonomous Navigation (SEAN) [25] uses a similar Unity-based approach to more general simulators while providing social navigation-specific metrics and scenarios. While SEAN does not provide any particular benchmark, SocNavBench [1] is designed as a social benchmark that generates photo-realistic sensory input directly from pre-recorded real-world datasets to strike a balance between simulation and recorded datasets. In addition to simulation-based benchmarks, a small number of purely dataset-based benchmarks have been proposed, including SocNav1 [17] and Social Robot Indoor Navigation (SRIN) [20]. While datasets are a critical benchmark tool, they often require an initial sensor processing step and are not easily extensible. Alternatively, simplified grid-world type environments, such as MiniGrid [5], can also be used to approximate social navigation by modeling humans with dynamic obstacles, although this is not commonly used for benchmarking.

Iii Social Navigation in Robotics

background We frame social navigation as a discounted-reward partially observable Markov Decision Process (POMDP)

consisting of the state space where and is the robot pose, is the robot velocity, is the goal pose, and is a list of the human poses and velocities; actions represented either as discrete motion primitives [18] or continuous local planning actions [16]; the world transition function

(1)

for the probabilistic transition to states when taking action at previous state ; the reward function ; the set of observations

; a set of conditional observation probabilities

; and discount factor . The solution to this POMDP is represented as a policy that decides what actions to take based on the previous state-action pair. The optimal social navigation policy maximizes the expectation over the cumulative discounted rewards:

(2)

We next describe how provides modules to simulate the components of the POMDP and interfaces with different approaches to learn .

Iv System architecture

is built from four modules that simulate different components of the POMDP for social robot navigation and coordinate together to simulate the full POMDP. These modules are the Gym module, the Environment module, the Human module, and the Navigation module.

The Gym module is the top-level interface between the agent and the simulator that handles simulation of the agent’s action selection policy to select an action at each timestep based on the current observation . The Gym module then steps the simulation by passing to the Environment module, which returns a new state from which the Gym module calculates the reward and derives a new observation . The Environment module handles the representation of the state and coordinates with the Human and Navigation modules to simulate the transition function given at each timestep. The Human module takes in and and controls the components of governed by the humans’ response to the environment by sending updated human positions and velocities to the Environment module. In turn, the Navigation module controls the components of related to robot execution of by providing a baseline navigation behavior for evaluation and global and local planners for use in discrete action spaces.

While our current implementation provides a single complete benchmark scenario and extensions to enable various others, it is essential to note that each module can be freely replaced with any alternative that implements a prerequisite ROS service. In this way, ROS services make our approach accessible to a wide variety of existing approaches written in ROS. The components of and their interactions are shown in systemDiagram, and in the following sections, we will describe the abstractions of each module and high-level technical details, while in benchmark, we will describe the concrete instantiation used for our included benchmark.

Fig. 2: Visualization of one timestep of with humans shown as green circles, the walls as black lines, the laserscan as redlines, the robot as a blue box, and recent trajectories as dotted lines.

introFigure

Iv-a Gym integration

gym The Gym model builds on the OpenAI Gym framework to interface between our ROS-based simulation framework and the learning agent. This module has four responsibilities: coordinating the generation and resetting of scenarios, receiving actions from the agent using them to coordinate state transition of the environment, receiving states from the environment and converting them into observations , and evaluating performance by calculating metrics and the value of the reward function .

Generating scenarios

A scenario is described as a tuple , where is the map describing the static obstacles in the scene, is the timestep, is the start pose of the robot, is the goal pose of the robot, and is a list of pairs that describes the initial pose and the goal pose for each human, , in the scene. randomly generates new scenarios based on several parameters that describe the range of possible initial conditions and goals. These parameters are , the maximum and minimum number of humans , the number of iterations for a given scene , and the maximum total iterations . In addition, scenario generation requires a list of poses that describes the list of legal start and end locations in the map. Given this configuration, a new scenario is generated as follows:

and are configured by the user,

a random and are selected from ,

a random is chosen such that ,

humans are generated by selecting a random and from such that they do not overlap,

and finally, the environment is initialized and the initial state is sent to the agent. Each random scenario is run until it completes in either success or failure times, and then a new scenario is generated up to the maximum number of iterations . We define the success state to be when , and the failure state to be when .

Updating the environment and observation

The Gym module controls the update loop of the simulation as follows:

the Gym module receives an action from the agent,

is sent as a request to the environment,

the Environment module executes given ,

the Gym module receives a response state ,

an observation function is employed to map to an observation where an

is an observation vector containing both discrete and continuous variables,

the reward is calculated using , and finally

and are passed to the agent for the next iteration. For use with, and in addition to calculating reward, the Gym module can optionally produce several metric values for use in performance evaluation. The provided Gym module supports the following metrics:

time to goal: when the agent reaches ;

distance from goal: ;

distance traveled: ;

force: approximates maximum force exerted between the robot and humans ;

blame: takes into account velocity at times of close encounters; penalizing robot velocities towards humans. Let be the closest point on the line segment from to to , then blame is calculated as: ;

human collision count, and

static obstacle collision count.

Iv-B Environment simulation

The Environment module is responsible for representing the state and updating based on robot actions and the state transition function .

The environment updates the robot state according to the robot motion model and . For discrete actions, must first be converted to continuous velocities by the Navigation module. In addition to updating the robot pose, the environment also needs to update all humans in the scene. However, since does not provide velocities for the humans, we need to describe how humans will move in response to the environment. Here the Human module human utilizes the updated state from the environment to calculate new positions and velocities for each human in the scene.

Iv-C Human simulation

human The Human module is responsible for simulating the components of concerned with how humans behave. This requires at each timestep based on some model of humans moving through the environment towards a goal.

Each human in the scene moves back and forth between a starting position , and a goal pose , by planning a global path of intermediary nodes between the two that avoids static obstacles and employs a separate local planner to handle dynamic obstacles. For global planning, we utilize the Navigation module as described in navigation. The local planner employs is PedsimROS [21], a ROS module that models human behaviors according to the social force model [13, 9]. In brief, the Social Force Model represents each agent in the environment as exerting repelling force on each agent, while goals exert attractive forces. We refer the reader to the original work [13] and the PedsimROS implementation [21].

Iv-D Navigation

navigation The Navigation module serves three purposes. For general use, the navigation module provides a global planning interface for use by the human module and implementations of baseline deterministic behaviors that do not require learning that can be used as comparisons. For our included benchmark scenario, the navigation module further provides implementations of actions that make up the discrete sub-policies of our action space.

The global planning component of Navigation is responsible for planning an obstacle-free path from a start location to a goal location. Global planning receives a request of the form , where is a start position in the map and is the target position and returns a response of the form , where represents an intermediary local goal along an obstacle-free path from to .

The local planner is responsible for avoiding obstacles immediately visible to the robot and calculating control commands to take the robot to the next waypoint on the global plan. The local planner is only called as part of the complete navigation solution that receives a request of the form , and returns a response of the form , representing command velocity to be executed by the Environment. For the local planner, we implement a trajectory rollout planner [10] that interfaces with the global planner to get the next local goal, and plans obstacle-free trajectories if possible, and comes to a stop otherwise.

Iv-E Social Action Selection Benchmark Scenario

benchmark As part of , we include a complete benchmark for social navigation action selection in a multipolicy setting similar to the one employed for  [14, 18]. At each timestep, the agent is responsible for choosing one action from a discrete set of actions representing the existing sub-policies of a navigation behavior. These sub-policies are designed as robust behaviors built on the deterministic navigation module presented in navigation.

A given benchmark in is defined by the choice of module implementations, action space, observation space, and reward function and can be additionally customized for complexity and difficulty via the configuration parameters of the underlying modules. Examples of these configurable parameters include

the map,

the number of humans used,

the parameters of the human behavior,

the robot motion model,

the amount of noise in the simulation. In the following, we will describe the observation, action, and reward spaces for the Social Action Selection Benchmark. For more information on the underlying configuration of the submodules, we refer the reader to the supplementary material and implementation.

We define the discrete action space to contain four sub-policies: stopping in place (Halt), navigating towards the next goal (GoAlone), following a human (Follow), and passing a human (Pass). The observation space consists of the global and local goal poses and , respectively, the robot velocity , and the relative poses and velocities of the humans visible to the robot , with all non-visible human poses and velocities set to 0. As such, the observation function needs to zero out the observations of all humans that are not visible to the robot because they are occluded by static obstacles in the environment. We choose this observation space to minimize the environment-specific features in the observation by using only local coordinates and velocities to minimize the risk of overfitting to specific training examples.

Finally, the reward function is a weighted linear combination of three metrics: the distance gained towards the goal since the last timestep, the maximum force between the robot and a human, and finally, the maximum blame resulting from the robot’s actions and the humans in the environment. Additionally, the agent receives a bonus if the goal is reached during a timestep. This reward function can be represented using weights as: . The weights can be used to specify user or task-specific preferences with respect to the tradeoff between social compliance and time to goal.

technical

V Benchmark and evaluation

evaluation To demonstrate results obtained with we provide experimental results using the benchmark scenario described in benchmark, and a series of baseline learned and engineered policies. Our reference implementations include two Reinforcement Learning (RL) implementations from the Stable-Baselines project [22]

, Proximal Policy Optimization (PPO) and Deep Q-Networks (DQN), two Learning from Demonstration (LfD) approaches, Behavior Cloning (BC) and Generative Adversarial Imitation Learning (GAIL) from the HumanCompatibleAI project 

[26], as well as two engineered navigation solutions, trajectory rollout based navigation and a social reference solution (Ref) and a symbolic learning from demonstration approach, Iterative Dimension Informed Program Synthesis (IDIPS) [14]. We evaluate the learned techniques on three criteria in three sets of experiments. First we evaluate the training efficiency of each technique by comparing the number of steps and amount of data needed for learning reasonable policies in dataEfficiency. Then we evaluate the performance of each policy in terms of the social metrics defined in gym, and compare the learned policies to the engineered policies in performance. Finally, we consider the generalizability of the learned policies by transferring them to a new simulated environment and comparing their performance in transfer.

6

15

24

33

42

Training Iterations (x * 2000)

Success Count

6

15

24

33

42

Training Iterations (x * 2000)

Total Time

6

15

24

33

42

Training Iterations (x * 2000)

Average Reward per Step

DQN

PPO

Fig. 3:

Progressive performance with more training iterations. The shaded regions represent the 90% confidence interval.

rlTraining

V-a Data Efficiency and Learning Rates

dataEfficiency The learning algorithms we evaluate fit into two major categories that each require different learning procedures. Reinforcement learning approaches learn via interactive interaction with the environment and Learning from Demonstration approaches require apriori demonstrations of the desired behavior. For each set of techniques, we evaluate and compare the training process, learning rate, and data efficiency during learning procedures utilizing in the following paragraphs.

Reinforcement Learning

Training new RL algorithms is a plug-and-play process that enables the agent to interact with and iteratively update its policy. To evaluate the sample efficiency of the considered RL approaches, we trained each model for total timesteps and created a checkpoint of the learning progress every steps. For each model, we evaluate their performance on the same test scenarios and report their performance in terms of the reward described in benchmark in rlTraining. Our evaluation results show two key results, first, both RL algorithms require or more timesteps before peaking in performance, and second the PPO approach is significantly more data efficient than the DQN approach we evaluated. In general, the DQN approach is particularly weak to the initial conditions, and performs more poorly than the other evaluated approaches on the small sample of trials used for these experiments.

Learning from Demonstration

To evaluate the data efficiency of the considered LfD approaches, we generated a single demonstration set with timesteps from our reference behavior and trained multiple policies with decreasing quantities of data evaluating each model on test scenarios, as shown in lfdTable. We report the performance in terms of the success rate and reward earned. Notably, PIPS cannot learn policies with more than data points, as the symbolic synthesis approach used by PIPS does not scale to large numbers of demonstrations. Conversely, while the DNN-based approaches can be trained with examples or less, their performance decreases significantly as the amount of data used for training is reduced. Only performing similarly to the PIPS-based policies when samples are used. There are two likely reasons for this improved data-efficiency. First, the structure used for synthesis provides additional constraints, and second, PIPS selectively subsamples the data by windowing around transitions between actions, an optimization that is not part of the more general purpose neural approaches.

Training Size BC GAIL PIPS
() r () r () r
155400 84 0.12 59 0.06 - -
116600 16 0.01 40 0.04 - -
77700 37 0.04 38 0.02 - -
38800 70 0.08 21 0.02 - -
3500 49 0.05 27 0.01 88 0.16
Fig. 4: Performance of LfD approaches with varying numbers of demonstrations in terms of the success rate () and the average timestep reward r.

lfdTable

V-B Performance of Learned Models

performance To evaluate and compare the learned behaviors, we evaluate them on the same randomly generated set of trials. We report the results in terms of four primary metrics, successful trials, force, blame, and time to goal as described in gym. We report these metrics as opposed to the score or reward to better compare between LfD and RL approaches, and to better position the results with respect to task performance. We report the percentage of successful trials as a table in successTable and the metric results in evalResults.

In addition to the learned behaviors, successTable and evalResults feature the engineered GoAlone behavior utilizing only trajectory rollout as described in navigation, and the engineered Reference behavior (Ref) used for comparison and generating demonstrations for the LfD algorithms. In successTable, a trial is counted as successful if the goal is reached within a bounded time without any collisions between the robot and humans or walls in the environment. In terms of pure success rate, no social behavior is as efficient as the GoAlone policy, likely due to the halting robot problem, where the robot is often left waiting for humans to pass long enough that the scenario times out. When comparing the learned behaviors, we find that the LfD behaviors have a better overall success rate than the RL approaches, suggesting that continous scenario demonstrations better convey the delayed reward of reaching the goal than the reward function. When we look at the metric performance of each approach, we see the success rates reflected in the time to goal, with the more successful approaches featuring an overall lower time to goal, while the less successful approaches are slower. An exception to this is the GAIL policy, which was neither the most or least successful while also being the slowest policy. As would be expected, we see a tradeoff between time to goal and social compliance reflected to some degree in both the force and blame graphs, with the slower behaviors exhibiting lower force and blame. In particular, BC and PIPS roughly match the performance of the Ref behavior from which demonstrations were drawn in terms of all three metrics, while in contrast, GAIL, PPO, and DQN optimized for social compliance over time to goal.

Human Count

Force

Human Count

Blame

Human Count

Time To Goal

GoAlone

Ref

DQN

PPO

BC

GAIL

PIPS

Fig. 5: Performance of evaluated policies in the training environment. The shaded regions represent the 90% confidence interval.

evalResults

Human Count

Force

Human Count

Blame

Human Count

Time To Goal

GoAlone

Ref

DQN

PPO

BC

GAIL

PIPS

Fig. 6: Performance of evaluated policies in a novel environment. The shaded regions represent the 90% confidence interval.

transferResults

Policy Success Rates ()
Test Transfer
GoAlone 97.5 97.3
Ref 89.4 92.4
BC 92.3 49.9
GAIL 83.3 18.9
PIPS 88.9 89.2
DQN 54.1 8.8
PPO 55.1 47.2
Fig. 7: Success rates in the training environment and after transferring to a novel environment.

successTable

V-C Environment Transfer Performance

transfer To evaluate the ability of the learned policies to transfer between environments, we introduce a new set of scenarios using a novel map with a significantly different configuration than the map used for policy training. This new map consists of a different set of static obstacles and a new set of possible start and goal locations for both the robot and the humans. For comparison, the environment used for the training and the evaluation environment used for this experiment are shown in transferEnviron.

Fig. 8: Environment used for training and evaluation on the left, and new environment for transfer evaluation on the right.

We evaluate each policy on the same random trials drawn from the new environment and report the performance in terms of force, blame, and time to goal in transferResults and in terms of the success rate in successTable. The first key result is that the engineered policies (Goalone, Ref) do not degrade in success rate during the transfer between environments, while all but the PIPS-learned policy degrade significantly. Similarly, the time to goal metric reflects this, with more successful behaviors achieving lower average times to goal and behaviors more likely to fail achieving much slower average times to goal. The new environment is less difficult for the engineered policies, as the distribution of humans is sparser in the larger, more open, map. This suggests that despite our efforts to design an observation space and reward that are environment-agnostic, the non-symbolic learning algorithms are still making decisions that are highly environment-dependent. In terms of the social metrics, the results for the learned policies reflect what we stated about the engineered policies. The new environment is more manageable with lower force and blame achieved by all behaviors thanks to the more sparse human distributions. These results demonstrate what we would expect, that the model-based and symbolic approaches perform better in environment transfers than black-box model-free approaches.

Vi Discussion and future work

conclusion In this paper, we presented , a configurable simulator and benchmarking tool for socially navigating robots. provides an interface for easily integrating learning algorithms with a 2D simulator designed to abstract away perception and localization and focus on the difficult task of learning robust robot behaviors for navigating amongst humans in complex environments. Further, we present empirical results from evaluating a suite of baseline algorithms in our provided social action selection benchmark that demonstrate the utility of . While provides an extensible interface, it is difficult to imagine that any one solution will perfectly fit the needs of myriad researchers, and we provide only a single initial benchmark evaluation case initially. It is our hope that will reduce the barrier to entry for evaluating learning algorithms for social robot navigation by allowing other researchers to build on an extensible foundation for social navigation learning and evaluation.

References

  • [1] A. Biswas, A. Wang, G. Silvera, A. Steinfeld, and H. Admoni (2021) SocNavBench: a grounded simulation testing framework for evaluating social navigation. THRI. Cited by: §II.
  • [2] K. Charalampous, I. Kostavelis, and A. Gasteratos (2016) Robot navigation in large-scale social maps: an action recognition approach. Expert Systems with Applications 66, pp. 261 – 273. External Links: ISSN 0957-4174 Cited by: §II.
  • [3] C. Chen, Y. Liu, S. Kreiss, and A. Alahi (2019) Crowd-robot interaction: crowd-aware robot navigation with attention-based deep reinforcement learning. In ICRA, Vol. , pp. 6015–6022. External Links: Document Cited by: §II.
  • [4] Y. F. Chen, M. Everett, M. Liu, and J. P. How (2017) Socially aware motion planning with deep reinforcement learning. In IROS, Vol. , pp. 1343–1350. External Links: Document Cited by: §II.
  • [5] M. Chevalier-Boisvert, L. Willems, and S. Pal (2018) Minimalistic gridworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gym-minigrid Cited by: §II.
  • [6] P. Ciou, Y. Hsiao, Z. Wu, S. Tseng, and L. Fu (2018) Composite reinforcement learning for social robot navigation. In IROS, Vol. , pp. 2553–2558. External Links: Document Cited by: §II.
  • [7] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16. Cited by: §II.
  • [8] Unreal engine External Links: Link Cited by: §II.
  • [9] G. Ferrer, A. Garrell, and A. Sanfeliu (2013) Robot companion: A social-force based approach with human awareness-navigation in crowded environments. In IROS, Vol. , pp. 1688–1694. External Links: Document Cited by: §II, §IV-C.
  • [10] B. P. Gerkey (2008) Planning and control in unstructured terrain. Cited by: §IV-D.
  • [11] J. K. Haas (2014) A history of the unity game engine. Cited by: §II.
  • [12] T. V. D. Heiden, C. Weiss, N. S. Nagaraja, and H. V. Hoof (2020) Social navigation with human empowerment driven reinforcement learning. ICANN abs/2003.08158. Cited by: §II.
  • [13] D. Helbing and P. Molnár (1995-05) Social force model for pedestrian dynamics. Phys. Rev. E 51, pp. 4282–4286. Cited by: §II, §IV-C.
  • [14] J. Holtz, S. Andrews, A. Guha, and J. Biswas (2022) Iterative Program Synthesis for Adaptable Social Navigation . In IROS, Cited by: §I, §IV-E, §V.
  • [15] N. Koenig and A. Howard (2004-09) Design and use paradigms for gazebo, an open-source multi-robot simulator. In IEEE/RSJ International Conference on Intelligent Robots and Systems, Sendai, Japan, pp. 2149–2154. Cited by: §II.
  • [16] D. V. Lu, D. B. Allan, and W. D. Smart (2013) Tuning cost functions for social navigation. In Social Robotics, G. Herrmann, M. J. Pearson, A. Lenz, P. Bremner, A. Spiers, and U. Leonards (Eds.), Cham, pp. 442–451. External Links: ISBN 978-3-319-02675-6 Cited by: §III.
  • [17] L. Manso, P. Trujillo, L. Calderita, D. R. Faria, and P. Bachiller (2020) SocNav1: a dataset to benchmark and learn social navigation conventions. ArXiv abs/1909.02993. Cited by: §II.
  • [18] D. Mehta, G. Ferrer, and E. Olson (2016) Autonomous navigation in dynamic social environments using multi-policy decision making. In IROS, Vol. , pp. 1190–1197. External Links: Document Cited by: §I, §II, §III, §IV-E.
  • [19] J. Mumm and B. Mutlu (2011) Human-robot proxemics: physical and psychological distancing in human-robot interaction. In HRI, Vol. , pp. 331–338. External Links: Document Cited by: §II.
  • [20] K. M. O. and A. B.R. (2020) SRIN: a new dataset for social robot indoor navigation. Glob J Eng Sci.. Cited by: §II.
  • [21] PedsimROS External Links: Link Cited by: §IV-C.
  • [22] A. Raffin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, and N. Dormann (2019) Stable baselines3. GitHub. Note: https://github.com/DLR-RM/stable-baselines3 Cited by: §V.
  • [23] S. Shah, D. Dey, C. Lovett, and A. Kapoor (2017) AirSim: high-fidelity visual and physical simulation for autonomous vehicles. CoRR abs/1705.05065. External Links: Link, 1705.05065 Cited by: §II.
  • [24] L. Tai, J. Zhang, M. Liu, and W. Burgard (2018) Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In ICRA, pp. . External Links: Document Cited by: §II.
  • [25] N. Tsoi, M. Hussein, J. Espinoza, X. Ruiz, and M. Vázquez (2020) SEAN: social environment for autonomous navigation. In Proceedings of the 8th International Conference on Human-Agent Interaction, HAI ’20, New York, NY, USA, pp. 281–283. External Links: ISBN 9781450380546, Link, Document Cited by: §II.
  • [26] S. Wang, S. Toyer, A. Gleave, and S. Emmons (2020) The imitation library for imitation learning and inverse reinforcement learning. GitHub. Note: https://github.com/HumanCompatibleAI/imitation Cited by: §V.