Role Playing Learning for Socially Concomitant Mobile Robot Navigation

05/29/2017 ∙ by Mingming Li, et al. ∙ National University of Singapore 0

In this paper, we present the Role Playing Learning (RPL) scheme for a mobile robot to navigate socially with its human companion in populated environments. Neural networks (NN) are constructed to parameterize a stochastic policy that directly maps sensory data collected by the robot to its velocity outputs, while respecting a set of social norms. An efficient simulative learning environment is built with maps and pedestrians trajectories collected from a number of real-world crowd data sets. In each learning iteration, a robot equipped with the NN policy is created virtually in the learning environment to play itself as a companied pedestrian and navigate towards a goal in a socially concomitant manner. Thus, we call this process Role Playing Learning, which is formulated under a reinforcement learning (RL) framework. The NN policy is optimized end-to-end using Trust Region Policy Optimization (TRPO), with consideration of the imperfectness of robot's sensor measurements. Simulative and experimental results are provided to demonstrate the efficacy and superiority of our method.



There are no comments yet.


page 8

Code Repositories


Codes for Learning and Simulation of Role Playing Learning for Socially Concomitant Mobile Robot Navigation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The capability to navigate in densely populated and dynamic environments is one of the most important features that enable the deployment of mobile robots in unstructured environment, such as schools, shopping malls and transportation hubs. The key difference between the problem of navigating among humans and the traditional path planning and obstacle avoidance problems is that humans tend to smoothly evade each other interactively and cooperatively, rather than remaining static or maintaining an indifferent trajectory dynamics. In other words, there are social norms that need to be understood and complied to achieve maximum comfort of all involved pedestrians during navigation. We refer to this as the problem of social navigation, which aims to model such social norms and develop a robotic navigation policy that is socially acceptable to the pedestrians around.

For social navigation, the traditional approaches based on Dynamic Window Approach (DWA) [1] or potential fields [2, 3] are usually of limited efficacy as pedestrians are simply regarded as uncooperative obstacles. An illustrative example is the freezing robot problem (FRP) [4, 5], where a mobile robot will be stuck in a narrow corridor when facing a crowd of people if it lacks the ability to predict the joint collision avoidance behaviors of human pedestrians. To this end, researches have been done to understand the principles of humans’ joint collision avoidance strategies and one of the pioneering works are the social force model (SFM) [6, 7]. Other joint collision avoidance model such as reciprocal velocity obstacles (RVO) have been proposed in [8, 9, 10], with an underlying assumption that all involved agents adopt the same collision avoidance strategies. These ideas are also applied to visual tracking of pedestrians [11, 12]. More recently, several attempts are made to learn probabilistic models of pedestrians’ trajectories during joint collision avoidance, based on which the robot’s navigation decision is generated such that it is able to behave naturally and correctly in similar situations [13, 5, 14, 15].

In this paper, we propose to augment the dimensions of human-robot interaction in social navigation by further endowing robot with appropriate group behaviors when it is travelling with a human companion. This capability is highly desirable for assistive mobile robots [16, 17, 18], which serve as assistants and companions and are expected to travel along with theirx human partners in not only home environment but also possibly crowded public areas. In other words, apart from understanding the collision avoidance behaviors of pedestrians, the robot also needs to consider the motion of its companion so as to maintain a sense of affinity when they are travelling together towards a certain goal. We call this socially concomitant navigation (SCN) and it is more challenging than the aforementioned social navigation problem, where the robot is assumed to travel alone with a simpler pursuit of reaching a specific goal while being free of collision.

To address the problem of SCN, we develop a new learning scheme called Role Playing Learning (RPL). Particularly, we formulate such problem under the framework of Partially Observable Markov Decision Process (POMDP) and reinforcement learning (RL). A neural network (NN) is used to parameterize the navigation policy of the robot, which is optimized to gives proper steering commands for the next time instance based on the robot’s current and previous observations to its surroundings. To facilitate the RL process, we create a simulative navigation environment by mirroring a collections of real world pedestrians data sets and develop an on-policy optimization method called Partially Observable Trust Region Policy Optimization (PO-TRPO). In each run in an optimization iteration, the robot will attempt to play itself as a companion of a randomly chosen pedestrian by executing the NN navigation policy. The NN policy is then optimized using PO-TRPO based on a batch of collected trajectories. Compared to the existing analytically derived or data-driven approaches, our RPL scheme has the following advantages:

  1. RPL scheme is less restrictive. It does not rely on the assumption that the robot and other agents (pedestrians) share the same decision-making models [8, 9, 10, 5, 14] or that the navigation goals of pedestrians are known [5, 14].

  2. The formulation of RPL scheme is more generalizable and flexible. Our formulation contain no manually-defined feature and domain knowledge (e.g., statistics of pedestrians’ behaviors). It is not hardware-specific and can be easily modified to incorporate kinematics of different mobile robot platforms, sensor specifications and navigation objectives. In addition, unlike [15, 14], the learned navigation policy operates without assess to the global map of the environment. Therefore, it is not environment-specific and is well generalizable to unmet real-world scenarios.

  3. We explicitly consider the noise and limitation of the robot’s sensor measurements. Most approaches for social navigation assume that the robot has full and accurate knowledge of interested variables, such as positions or distance of pedestrians and obstacles [8, 9, 10, 14]. On the contrary, our RPL schemes is rooted from the situation where the robot can only perceive those lie within its sensor’s Field of View (FoV), with the existence of measurement noise.

  4. As a RL-based approach, RPL is efficient. Although RPL aims at solving tasks that involve interaction among robot, humans and physical environment, it does not require participation of human in both data collection and learning, which is known to be tedious and time-consuming. Instead, the learning process is safely automated in a simulative yet realistic environment with no human intervention.

We evaluate the performance of our approach in both simulations and real-world experiments, by comparing it with a baseline planner based on RVO [8] and humans, repectively. We also show that, with some tricks, the learned navigation policy can still be effective when the navigation scenario is reduced to the aforementioned social navigation, which means the robot is travelling without human companion.

The remainder of this paper is organized as follows. Related work is first reviewed in Section II. In Section III, the problem of SCN is formulated as a POMDP and associated definitions are given. RPL scheme and PO-TRPO algorithm are described in Section IV. Sections V and VI provide extensive results of simulation and experiment, followed by some concluding remarks in Section VII.

Ii Related Work

The problems of robot navigation in populated and dynamic environment can be addressed from a number of angles, which can be largely classified into two groups as in the following subsections.

Ii-a Interactive Behaviors Models

Many researches have been proposed to describe the interactive navigation behaviors of humans by fitting a computational model to the observed pedestrians trajectories [19]. In this way, the robot’s path planner is able to understand pedestrians’ intention during joint collision avoidance and actively calculate an optimal route towards its goal.

In the field of robotics, a majority of work in this direction is done via inverse reinforcement learning (IRL) [20], which learns a cost function that explains the observed behaviors. For example, maximum entropy IRL [21] is adopted in a number works [22, 23, 24, 25, 26] for discrete human behavior prediction and route planning. However, discrete representation is less desirable when modeling trajectories, which are in nature continuous and has higher order dynamics, such as velocities and acceleration. Instead, [15] adopts Maximum-A-Posteriori Bayesian IRL [27] to learn appropriate navigation behavior of a specific mobile robot from a set of demonstration trajectories. Note that, the demonstration data in [15] is specific to configurations of the robot and its sensor and has to be collected via human operation, which could be time-consuming. On the other hand, [13, 14] learns probabilistic models of composite trajectories of pedestrians from video data by maximum entropy learning and IRL. To better capture the characteristics of observed trajectories, they propose to develop their models based on a set of features that are hand-crafted according to the domain knowledge from psychological studies. In addition, those features contain velocities and accelerations of pedestrians, which, in practice, are hard to precisely measure. Besides, interacting Gaussian process (IGP) is derived in [5] to model the joint trajectories of pedestrian while explicitly considering the effects of observation noise. Nevertheless, the design of IGP also requires several hand-crafted kernels that are formulated based on the priori information in a specific application scenario.

Other than researchers in robotics, the community of computer vision also possess great interest in pedestrian modeling. One of the important topics is trajectory prediction in video space. In

[11], Linear Trajectory Avoidance (LTA) is developed as a dynamic model for pedestrians in video space for short-term trajectory prediction and it is integrated into visual tracking system. Gaussian process is adopted in [28] to learn the motion pattern of pedestrians. Recently, Social LSTM is proposed in [29] for human trajectory prediction in crowd space. Similarly, the feature of social sensitivity is developed in [30] to analyze trajectories of pedestrians and bicyclists. While the above methods can effectively predict the navigation intention of pedestrians in videos, it is still unclear how to apply these model to navigation of robot in real scenarios.

Ii-B Steering Models

In contrast to learning behavior models of pedestrians, a more direct perspective is to develop a steering model that outputs the immediate navigation actions given the robot’s current observation to the environment. One of the pioneering work in this direction is the social force model (SFM) [6], which uses energy/potential functions to encode the social status of pedestrian. Then, the navigation motivation of a pedestrian can be derived by taking the gradients of these energy functions. Following this idea, subsequent work [31, 32, 33, 12] propose to infer the optimal parameters of the energy function by fitting them to video data. However, they are likely to produce suboptimal results if the demonstration data from humans are imperfect. In [34], the authors integrate a people tracker and an iterative planner, with which the robot actively follows the pedestrian travelling in a similar direction to navigate through crowded environment. [35] follow the same idea and formulate the choice of a pedestrian to follow as a Multi-Policy Decision Making process. On the other hands, [36] develops a hierarchical POMDP for predictive navigation in dynamic environment. The idea is to predict the motion of pedestrians and generate a environment-specific cost map for path planning and obstacle avoidance.

Other than navigating in a pedestrian-aware manner, several reactive collision avoidance techniques have also been developed, such as DWA [1, 37], velocity obstacles [38] and reciprocal velocity obstacles (RVO) [8, 9, 10]. The common idea of these methods is to treat pedestrians as moving obstacles and reactively update the planner every short periods to achieve collision avoidance. As mentioned in Section I, these methods are less effective for social navigation as they lack predictive abilities and are based on some restrictive assumptions, such as accurate knowledge of moving agents’ velocities [37] and that all agents adopt the identical collision avoidance strategy [8, 9, 10].

Our proposed navigation policy belongs to the steering models. It takes an observation vector as input and outputs the navigation action through a stochastic neural networks. During RPL, our policy is optimized by the PO-TRPO algorithm, which is derived based on the recent advances in deep reinforcement learning (DRL)

[39, 40]. DRL exploits the massive representation power of deep neural networks (DNN) [41] to build a complex yet sophisticated decision model, with which an agent can directly learn from raw signals instead of carefully crafted feature and tends to act more intelligently. Recently, there are several attempts in using DNN and DRL for robot navigation. For example, an end-to-end motion planner is learned in [42] to map raw sensor data of a laser range finder onto steering commands of a mobile robot. In [43], a decentralized multi-agent collision avoidance policy is learned via DRL, which can be thought as a DRL version of the original RVO approach [8]. Finally, a target-driven visual navigation policy for home environment is learned in [44] via DRL. They create a set 3D virtual home environments for effective and efficient training of the agent, which shares a similar idea with our proposed RPL scheme.

Iii Problem Formulation

To formulate the problem of socially concomitant navigation, we gives the following rules of SCN:

  1. The robot should reach its goal as fast as possible;

  2. The robot should not collide with any of the pedestrians or its companion, or run into any obstacle;

  3. The robot should not run too far away from its companion.

The above rules serve as a generic description of the robot’s desired performance during navigation. To give concrete definitions, consider the navigation process as an infinite-horizon discounted POMDP in discrete time, defined by the tuple . is a finite set of states reflecting the navigation status of the robot. is a finite set of actions . In this paper, it is defined as a twosome of the translational and rotational velocities of a synchro-drive mobile robot, i.e., . is state-transition mapping, which is characterized by the dynamics of the robot, the other humans and the environment. Without loss of generality, we assume deterministic state transition, i.e., , where are the state and action taken at time . is the set of the robot’s observation to the state and

denotes the conditional observation probability distribution. Note that, in practice, the robot’s observation has only incomplete access to

or is subject to certain measurement noise, which implies . is the initial state distribution, i.e., . is a scalar reward given to the robot and is the reward discount factor.

Robot motion dynamics: In this paper, synchro-drive mobile robots are considered, whose motion equation can be approximated by assuming the robot’s velocities to be constant within a certain short time period [1] with length . Particularly, let and denote the robot’s heading and its positions in a 2D Cartesian space at time , respectively. and represent the robot’s translational and rotational velocities. Define and . When the robot has nonzero rotational velocity, i.e., , we have


Otherwise, when ,


With the above formulations, our goal is optimizing a stochastic navigation policy with parameters in order to maximize the expected discounted reward:


where denotes the whole trajectory and . The specific definitions of the above ingredients for SCN will be elaborated as follows:

State: Given and , define the distance and direction of a point to the robot as follows:


Then, the robot’s distance to the goal located at are computed as and denotes the offset angle between the robot’s current heading and its goal. Similarly, we can define the twosomes , or to describe the relative position of a pedestrians , a companion or an obstacle to the robot. With such definitions, the state is defined to incorporate the information related to the robot’s navigation status as follows:


where is the current action vector and


The vector includes the distances and directions of closest pedestrians while includes those of the robot’s companion.

The vector is a compact description to the robot’s perception of the surrounding environment. Particularly, the boundaries of the occupied space (obstacles) in the environment are represented as a finite point set . Then, the 9 variables in are defined based on the following assumption

Assumption 1

An obstacle has no effect on the robot’s navigation decision if it satisfies , where is a predefined finite constant.

By Assumption 1, it is sufficient to consider only obstacles in that are closed enough to the robot, whose distances are less than . In practice, this limit may correspond to the robot’s perception range. Let


The components in vector are described as follows:

The distance to the nearest obstacle located at heading of the robot, i.e.,


where is a small constant.

For , they represent the distance and direction of the closest and farthest obstacles on the robot’s left () and right side (), respectively, which are defined mathematically as follows:


Then, the variables in can be simply determined as the distance and directions of the above points according to Eqs. (6) and (7). Figure. 1 provides a comprehensive illustration of the state variables and .

Fig. 1: Illustration of the state variables in Eq. (8). The blue, yellow and green circles represent the robot, its companion (Com.) and the pedestrians (Ped.) respectively. The red dashed circle with a radii represents the boundary of the set in Eq. 12. The black arrow shows the current heading of the robot. Considering the robot’s current position as the origin, the polar coordinates of the pedestrians, the companion, the closest( and the farthest) obstacles in each direction are compactly represented as vectors .

Observation: As discussed in the previous sections, sensors mounted on the robot are always subject to various kinds of limitation and measurement noise, which must be taken into account in order to develop a robust and practical navigation system. To this end, we define as the robot’s observation to the true state as follows:


By Eq. (18), we assume that the robot has accurate information about the goal position and its current taken action (i.e., the velocity commands output to the robot’s motor) while its observations to may be imperfect. Particularly, consider the Field of Views (FoVs) for the robot’s pedestrian and obstacle detectors illustrated as Fig. 2.

Fig. 2: Field of Views of the pedestrians (green) and obstacles (blue) detectors. The arrow () points towards the current heading of the robot. The constants and denote the maximum and minimum offset angles in the corresponding FoVs. Finally, and represent the maximum detection ranges for the pedestrian and obstacle detectors, respectively. The values of these constants should be determined according the specific configurations of the robot’s sensor and the corresponding detection algorithms. Any pedestrian/obstacle outside the FoVs is not observable and therefore will be omitted.

Mathematically, let finite point sets and denote the current FoVs of pedestrian and obstacle detectors, characterized by threesomes and , respectively. The robot’s observations to the pedestrians’ relative positions are obtained as






for , with being the measurement noise/error.

Similarly, define


Compared to the states in (13) to (17), only the obstacles within are observable. Thus, is formulated as


where is the measurement noise/error for obstacle detection. The closest observed obstacles on the robot’s left and right sides are defined in a similar way as:


Then, their distance and directions to the robot are calculated using Eqs. (6) and (7). For observation to the robot’s companions, we rely on the following assumptions.

Assumption 2

The companions are always observable to the robot.

Then, , where

Remark 1

By Eqs. (20) and (23) to (28), it is implied that the observation/measurement noises and

are additive and independent in different observations. A typical example of such noise is the Additive Gaussian White Noise (AGWN).

Remark 2

Our general formulations of states (8) and observations (18) are applicable to various types of onboard sensors, such as range sensors [45, 46], RGB-D [47], Time-of-Flight (ToF) [48] and omnidirectional cameras [49]

, as long as the interested positions can be extracted/estimated from the sensor’s raw measurements.

Remark 3

The mathematical definitions of the variables in observations are given for better understanding and are required only in the simulative RPL process. In practice, it is clear that these values can be directly measured via the robot’s onboard sensors without accessing the actual 2-D Cartesian coordinates of the considered point sets (e.g., , and ). For example, consider a robot equipped with a laser range finder. These distances and offset angles can be easily obtained from the returned ranges array[50].

Reward function: A scalar reward will be given to the robot as an award of reaching the goal or a penalty of colliding with obstacles/pedestrians/companions or losing its companions. Particularly, at time , the process of SCN will be terminated if any of the following three termination conditions is true.

1) Goal Reaching Condition


2) Collision Conditions


3) Stray Condition


Based on the above three terminal conditions, a reward will be given to the robot as follows:


Clearly, a positive reward will be given to the robot if it reaches its goal and it will receive a large negative reward if it collides with anything or be stray from its companion. Otherwise, the robot will receive an intermediate reward , which penalizes the robot for its rotational velocity to encourage a smoother trajectory with less turning behaviors.

Iv Role Playing Learning

In this section, we described the RPL scheme to learn an effective navigation policy for SCN in an efficient data-driven manner. The core idea is to transform the crowd trajectories data collected from real-world into a simulative and dynamic navigation environment, where the robot can play itself as a virtual pedestrian and iteratively improve the performance of via Partially Observable Trusted Region Policy Optimization (PO-TRPO).

Consider a set of simulative navigation environment . Each environment contains a set of pedestrian trajectories and a binary map that annotates the 2-D Cartesian coordinates of obstacles/occupied space in the environment. With , the abstract process of RPL is described by the following pseudo codes in Algorithm 1.

  Initialize navigation policy
  for  do
     while Number of collected sample time steps Batch_size do
        Randomly choose an environment from and then a trajectory from .
        Initialize the robot’s position at and initial velocities . Set . Choose the robot’s heading such that .
        Choose SCN Mode with probability 0.5
        if SCN Mode then
           Assign as the trajectory of the robot’s companion, where
           Create a synthesized companion that moves along the robot
        end if
        Assign all other trajectories in as pedestrians.
        while None of the termination conditions in (29) to (33) is satisfied do
           Update the states and observations of the robot according to Eqs. (8) and (18).
           Let the robot execute its policy .
           Update the robot’s position according to dynamics (1) to (4)
           Calculate the current reward from Eq. (34)
           Update the positions of the companion and pedestrians according to the trajectories in .
        end while
     end while
     Update using PO-TRPO.
  end for
Algorithm 1 Role Playing Learning

Companion Synthesization in non-SCN mode: As described in Algorithm 1, RPL actually incorporates two different navigation scenarios: the SCN proposed in this paper and the traditional social navigation scenario, where the robot has no human companion. This helps develop a navigation policy adaptable to both situations, with no restrictive assumption on the existence of companion. Particularly, the companion position vector and its observation are synthesized, with for every time step while . It is clear that the synthesized is equivalent to the situation where the companion is travelling non-distractively along the robot with a constant distance and guarantee that termination conditions (31) and (33) are always false.

On the other hand, in SCN mode, the companion is assigned with a truncated trajectory such that the initial robot-companion distance is sufficiently large.

Fig. 3: Structure of the deep policy network . At time , the observation vector

is input to the feature network, which is a feedforward multi-layer perceptron (MLP). The output of the feature network is then fed to a LSTM network

[51], a recurrent network for aggregation of the information collected through the navigation process. The LSTM network’s outputs are assigned as the mean vector of the diagonal Gaussian unit on the right. The covariance matrix , however, is independent of amd it is designed to be gradually decreasing during training and fixed during tests and experiments. Finally, the actions are drawn according to .

In this paper, we construct a deep policy neural network to parameterize the navigation policy , whose structure is shown in Fig. 3. The policy network is to be trained with the Trust Region Policy Optimization (TRPO) [40] method. However, the original TRPO method is derived based on fully observable MDP, which can not be directly applied to our problem due to the imperfect observation in our formulation and practice. Thus, we proposed to extend the original TRPO algorithm as PO-TRPO, which will be described in the following subsections.

Iv-a Trusted Region Policy Optimization

The TRPO [40] algorithm is an effective on-policy optimization method for large nonlinear policies and tends to give monotonic improvement during the iterative optimization process. To be specific, a fully observable MDP is considered by TRPO and therefore the policy to be optimized is formulated as , where is the parameter vector of the policy . Note that, determines the action directly from the true state , which differs from our observation-based policy . Let us consider the following standard definitions of the state-action value function , the value function and the advantage function :




In addition, define as the discounted visitation frequencies


where , and are generated according to and . Let denote the old parameters in last iteration. TRPO proposes to optimize the parameters iteratively regarding the following objective function:

maximize (40)
subject to (41)

where is the importance sampling distribution and

is the Kullback-Leibler divergence between the old and current policies.

Iv-B Partially Observable TRPO

As mentioned, our navigation problem is considered as a POMDP. The policy depends on the observation instead of the true state. Therefore, we write the objective function (40) and the constraint (41) as

maximize (42)
subject to (43)

For PO-TRPO, samples are collected by executing the old policy to generate a set of trajectories, such as . Therefore,


where .

Next, for a trajectory , we use the generalized advantage estimation (GAE) [39] to construct an empirical estimation of the advantage function as the following:




and is the estimation of the value function (36) with parameters (and being the old parameters). By collecting a set of trajectories , is obtained by solving the following constrained regression problem [39]:

minimize (47)
subject to (48)

Finally, as the conditional observation probability distribution is independent of parameters and time, we obtain an estimation of the objective function (42) and the constraints (43) by replacing the expectations with sample averages as:

maximize (49)
subject to (50)



which has the same form as the one obtained in [39], except that the policy is conditioned on observation instead.

Finally, the constrained optimization problem described in (49) and (50) is solved by conjugate gradient algorithm [52] algorithm. To summarize, the pseudo code for PO-TRPO update in Algorithm 1 is given as below:

  Compute the estimated advantages for all time steps using GAE with the estimated value function .
  Update with objective function (49) and constraints (50)
  Update with objective function (47) and constraints (48)
Algorithm 2 PO-TRPO

V Simulation

As a data-driven approach, our deep neural network policy requires a massive amount of data to learn the socially concomitant navigation behavior. In this section, we describe how to construct a simulative environment according to the proposed RPL scheme. Particularly, the environments, the deep neural network policy and the PO-TRPO algorithm (Algorithm 2) are developed under the framework of RLLAB [53]. We make use of trajectories of interacting pedestrians collected from five different data sets, which includes the ETH and Hotel video clips from the ETH Walking Pedestrians (EWAP) [11], the motion capture (MC) data set from [14], as well as the Zara and UCY video clips from [32]. Note that, the Zara and UCY data sets have multiple subsets: Zara01, Zara02, Zara03, UCY01 and UCY03. Thus, there are totally 8 different RPL environments, i.e., . The details of these 8 environments are summarized in Tab. I.

Name ETH Hotel MC Zara01
No. Trajectories 365 420 324 148
Name Zara02 Zara03 UCY01 UCY03
No. Trajectories 204 137 413 434
TABLE I: Details of the 8 RPL environments

Each trajectory in these environment provides the ID and a sequences of 2-D Cartesian positions of a pedestrian with a sampling period second. In addition, eight binary grid maps representing the occupied space/static obstacles are given. However, these maps are kept unknown to the robot throughout training and evaluation. They are only used to simulate the robot’s perception to the environment as the state and observation . Without loss of generality, we use the ETH data set as the evaluation environment and all other data sets in the Tab. I as training environments. In other words, the learned policy’s performance will be assessed in an RPL environment that is excluded during training, which reflects whether it can properly generalize to uncovered situations.

As some of the trajectories in these environments are of people who were wandering or remained approximately stationary, they are excluded from the candidates of the robot’s companion but will still be considered as pedestrians when the robot is navigating in the same environment.

We use a feed-forward neural network with 2 hidden layers as the feature network in our NN policy, containing 256 and 64 Tanh units, respectively. Its output is then fed to a LSTM network with 64 units. The variance of of the Gaussian output unit

is chosen to be linearly decaying from 0.5 to 0.05 in 100 training iterations, which effectively encourages exploration during the early stage of learning and ensures convergence of the navigation policy. For GAE, a 3 layer feed-forward network with 256,64,16 Tanh units are used, with and . The update step size for policy network is adaptively chosen as . For GAE update, a fixed step size is used. The update batch size (Batch_size in Algorithm 1) is 50000.

In RPL, we consider at most 3 pedestrians (i.e., ). Thus, the state and observation will only describe the 3 closest pedestrians and omit the others. For the situation where less that 3 pedestrians are perceived, dummy static pedestrians will be created in the remote corner of the environment so as to maintain the dimensions of and .

Considering a Kobuki Turtlebot 2 with a Hokuyo URG-04LX laser range finder [50] mounted on its top, we specify the sensor limitation of the robot in simulation as follows:


The measurement noises and are modeled by zero-mean Gaussian and with their variances specified as follows:


Finally, the maximum translational and rotational velocities are assigned as and , i.e., and and . An example of our RPL environment constructed from the ETH data set is illustrated in Fig. 4

(a) Real-world Environment
(b) Simulative Environment for RPL
Fig. 4: An illustrative example our RPL simulative environment. The black curve represents the trajectory of the robot navigating toward its goal (the red dot). The yellow curve denotes the trajectory of the robot’s companion. Besides, there are a number of blue curves representing the pedestrians perceived by the robot and the green lines denotes the fences around the entrance of the university (bottom center). Note that, all trajectories of pedestrians are not synthesized but captured from the video. Thus, the robot can be thought as playing a role as an extra person in an realistic environment.

V-a Results

We trained our deep policy network for 1200 iterations with the data from RPL environments except for the held-out ETH environment. The curve of average discounted return obtained from each batch of trajectories is visualized in Fig. 5

Fig. 5: Average discounted return as RPL progresses

We compare the performance of our policy with a planner based on RVO [8], where the robot, its companion and the surrounding pedestrians are treated agents. In every time steps, the positions and velocities of all agents are given to the planner. Note that, for fair comparison, the agents’ positions are subject to noise described in (54) and (55). For observations to obstacles, we assume the planner has full and perfect knowledge as required in the original RVO algorithm. With this protocol, we update the robot’s position according to the planner’s output and update the positions of the other agents according to their own trajectories in the RPL environments. The same termination conditions in Section III are applied to the robot directed by the RVO-based planner to determine whether the robot has conducted an successful navigation. For both of our policy and the RVO-based planner, we conduct 300 trials in the evaluation environment and compute the rates (in percentages) of different terminal conditions (RG: the robot reaches the goal successfully; HC/HP/HO: the robot hits a companion/pedestrian/obstacle; and LC: the robot loses its companion). The performance statistics of our policy and the RVO-based planner in SCN scenarios are listed in Tab. II.

Terminal Condition RG LC HC HP HO
Our policy 77 5.7 6.7 9.6 1
RVO 29.7 47 0.3 23 0
TABLE II: Rates of different terminal conditions of our policy and RVO-based planner in SCN scenarios

It can bee seen from Tab. II that our policy performs much better than the RVO-based planner in SCN. The RVO-based planner has a much lower success rate (29.7%) while its rate of LC is , suggesting that it frequently losses its companion in SCN. Clearly, this is due to the fact that RVO is in nature a collision avoidance algorithm. Thus, it simply takes the robot’s companion as another normal agent and the robot tends to stay far behind its companion to avoid collision instead of actively following it. On the contrary, our policy achieves a much higher success rate (77%). This indicates that it learns to effectively balance the objectives of SCN so that the robot is able to reach the prescribed goal while maintaining its distance to its companion and avoiding collision with other agents in the environment.

In addition to SCN, the scenarios without companion are also tested, which, as analyzed in the previous sections, reduces to the traditional social navigation scenarios. The comparative results are shown in Tab. III.

Terminal Condition RG HP HO
Our policy 84 13.7 2.3
RVO 80 18 2
TABLE III: Rates of different terminal conditions of our policy and RVO-based planner in traditional social navigation scenarios

For situations without companion, our policy still outperforms the RVO-based planner with higher success rate (84% to 80%) and lower HP rate (13.7% to 18%).

Finally, it is worth noting that the RVO-based planner requires velocities of the companion/pedestrians and an accurate global map of the static obstacles. Conversely, our policy depends only on position measurements that are directly accessible from the robot’s onboard sensors, which is therefore much simpler and more practical.

Vi Experiments

In experiments, we assess the performance of our developed navigation policy by comparing it with humans in the same scenarios. Particularly, a robot and a human are to repeat each specific navigation scenario for 10 times, respectively. Then, the the following two metrics are calculated:

  1. Average minimum distance to the pedestrians (): the average of the minimum distance between the robot/compared human to other pedestrians throughout a trajectory .

  2. Average maximum distance to the companion (): the average of the maximum distance between the robot/compared human to its/his companion throughout a trajectory.

We use the same mobile platform (a synchron-drive Turtlebot 2 with a Kobuki base) and the same laser range finder (Hokuyo URG-04LX) simulated in last section. For pedestrian detection and localization, we adopt the ROS-compatible leg tracker in [54]. We use an ultra wideband (UWB) indoor positioning system to localize the companion and the navigation goal, which can then be easily mapped to the observations and based on the odometry of the robot. Finally, a laptop is placed onboard as the processing unit and the policy is operated with a period of 0.1 second. The experiments are conducted in a narrow corridor with width of 1.56 meters as shown in Fig. 6, which is a typical scenario that requires pedestrians to navigate cooperatively.

Fig. 6: The narrow corridor where experiments are performed

Vi-a Scenario 1: Traditional Social Navigation

In this subsection, we examine our method’s performance in traditional social navigation scenario. Particularly, the robot is required to pass the corridor with two oncoming pedestrians and arrive at a goal that is 7 meters ahead. In addition, a control experiment of 3 humans (one as the compared human and the other two as pedestrians) is conducted in the same space. The metric is computed. Example trajectories of the robot and the human control are shown in Fig. 7. In the robotic experiments, the trajectories of pedestrians are obtained from the robot’s laser range finder while the robot’s trajectory is based on its own odometry sensor. On the other hand, all trajectories in the human control experiments are captured using the UWB localization system.

(a) Trajectories of the robot (moving from left to right) and other two pedestrians (moving from right to left).
(b) Human control experiment in a similar navigation scenario. The black trajectory is from left to right and the other two are from right to left.
Fig. 7: Comparison between the robot with our policy and human control experiment in a social navigation scenario

From Fig. 7, it is clear that the robot with our policy is able understand human’s cooperative behavior for collision avoidance and navigate in an appropriate manner such that both itself and the other two pedestrians can successfully pass through the corridor. Specifically, when observing the two pedestrians (blue and purple) 4 meters ahead. The robot started to approach the wall on its left side so as to create free space on the right for the pedestrians to smoothly walk through. By comparing both figures in Fig. 7, we can see that the robot is as proactive as human since both black trajectories in Fig. 7(a) and Fig. 7(b) started to make space for the oncoming pedestrians at the early stage of cooperative avoidance process. As for the performance metrics, the average minimal distance to pedestrians for our robot is m. Although it is smaller than that of the human control experiments (=0.56m), this value still indicates a safe and decent navigation behavior of our robot as its radius is only 0.17m.

Vi-B Scenario 2: Socially Concomitant Navigation

In this subsection, the scenario of SCN is studied. A human companion initially standing in front of the robot will start to walk through the same corridor while another pedestrian is passing from the other end. As described in the previous sections, the robot with our policy should closely navigate with its companion and avoid the oncoming pedestrian cooperatively. An additional metric is used to evaluate the performance of our policy by comparing with the statistics obtained from another 10 human control experiments. Example trajectories are shown in Fig. 8 and the performance metrics and are summarized in Tab. IV.

(a) Trajectories of the robot and its companion (moving from left to right) and a pedestrian (moving from right to left).
(b) Human control experiment in a similar SCN. The black (compared human) and orange (companion) trajectories are from left to right and the blue (pedestrian) trajectory is from right to left.
Fig. 8: Comparison between the robot with our policy and human control experiment in a SCN scenario
(m) (m)
Robot 0.49 1.05
Compared Human 0.37 1
TABLE IV: Performance metrics of the robot and human controls in SCN scenarios

As shown in Fig. 8 and Tab. IV, the robot is able to achieve both objectives of SCN. On one hand, it is effectively engaged into the joint collision avoidance process. The resulted behavior is similar to that observed in the last subsection and the robot even has a slightly larger . On the other hand, the average maximum distance is 1.05m, which is within the limit (2m) we specified in the learning process and nearly the same as that of the compared human, showing that the robot can actively navigate along with its companion instead of deviating to other areas or lagging itself behind. This shows that the robot driven by our policy is able to understand the pace of its companion and achieve a similar sense of companionship in terms of distance.

In sum, the above results demonstrate the practical efficacy of our methods for both the traditional social navigation and the more complicated SCN scenarios. It proves that the policy learned from our RPL simulative environment is transferable to uncovered real-world situations.

Vii Conclusions

In this paper, the problem of socially concomitant navigation (SCN) has been investigated and formulated under a POMDP framework, with explicit considerations of the limitation and inaccuracy of mobile robots’ onboard sensors. The Partially Observable TRPO (PO-TRPO) algorithm has been proposed for optimization of navigation policies. The Role Playing Learning (RPL) scheme has been developed to enable efficient and safe reinforcement learning of navigation policies by mirroring a large amount of real-world pedestrian trajectories into simulative environments. Comparative simulation and experiment studies have demonstrated the efficacy and superiority of our policy in both SCN and traditional social navigation scenarios.


  • [1] D. F. W. B. S. Thrun, D. Fox, and W. Burgard, “The dynamic window approach to collision avoidance,” IEEE Transactions on Robotics and Automation, vol. 4, p. 1, 1997.
  • [2] Y. K. Hwang and N. Ahuja, “A potential field approach to path planning,” IEEE Transactions on Robotics and Automation, vol. 8, no. 1, pp. 23–32, 1992.
  • [3] S. S. Ge and Y. J. Cui, “New potential functions for mobile robot path planning,” IEEE Transactions on robotics and automation, vol. 16, no. 5, pp. 615–620, 2000.
  • [4] P. Trautman and A. Krause, “Unfreezing the robot: Navigation in dense, interacting crowds,” in Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, 2010, pp. 797–803.
  • [5] P. Trautman, J. Ma, R. M. Murray, and A. Krause, “Robot navigation in dense human crowds: Statistical models and experimental studies of human–robot cooperation,” The International Journal of Robotics Research, vol. 34, no. 3, pp. 335–356, 2015.
  • [6] D. Helbing and P. Molnar, “Social force model for pedestrian dynamics,” Physical review E, vol. 51, no. 5, p. 4282, 1995.
  • [7] D. Helbing, I. Farkas, and T. Vicsek, “Simulating dynamical features of escape panic,” Nature, vol. 407, no. 6803, pp. 487–490, 2000.
  • [8] J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha, “Reciprocal n-body collision avoidance,” in Robotics research, 2011, pp. 3–19.
  • [9] J. Van Den Berg, P. Abbeel, and K. Goldberg, “Lqg-mp: Optimized path planning for robots with motion uncertainty and imperfect state information,” The International Journal of Robotics Research, vol. 30, no. 7, pp. 895–913, 2011.
  • [10] J. Van den Berg, M. Lin, and D. Manocha, “Reciprocal velocity obstacles for real-time multi-agent navigation,” in Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, 2008, pp. 1928–1935.
  • [11] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never walk alone: Modeling social behavior for multi-target tracking,” in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 261–268.
  • [12] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg, “Who are you with and where are you going?” in

    Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on

    , 2011, pp. 1345–1352.
  • [13] M. Kuderer, H. Kretzschmar, C. Sprunk, and W. Burgard, “Feature-based prediction of trajectories for socially compliant navigation.” in Robotics: science and systems, 2012.
  • [14] H. Kretzschmar, M. Spies, C. Sprunk, and W. Burgard, “Socially compliant mobile robot navigation via inverse reinforcement learning,” The International Journal of Robotics Research, p. 0278364915619772, 2016.
  • [15] B. Kim and J. Pineau, “Socially adaptive path planning in human environments using inverse reinforcement learning,” International Journal of Social Robotics, vol. 8, no. 1, pp. 51–66, 2016.
  • [16] A. Bicchi, A. Fagiolini, and L. Pallottino, “Towards a society of robots,” IEEE Robotics & Automation Magazine, vol. 17, no. 4, pp. 26–36, 2010.
  • [17] H.-M. Gross, C. Schroeter, S. Mueller, M. Volkhardt, E. Einhorn, A. Bley, C. Martin, T. Langner, and M. Merten, “Progress in developing a socially assistive mobile home robot companion for the elderly with mild cognitive impairment,” in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, 2011, pp. 2430–2437.
  • [18] H. Wang and X. P. Liu, “Adaptive shared control for a novel mobile assistive robot,” IEEE/ASME Transactions on Mechatronics, vol. 19, no. 6, pp. 1725–1736, 2014.
  • [19] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and autonomous systems, vol. 57, no. 5, pp. 469–483, 2009.
  • [20] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in

    Proceedings of the twenty-first international conference on Machine learning

    , 2004, p. 1.
  • [21] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.” in AAAI, vol. 8, 2008, pp. 1433–1438.
  • [22] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum margin planning,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 729–736.
  • [23] B. D. Ziebart, N. Ratliff, G. Gallagher, C. Mertz, K. Peterson, J. A. Bagnell, M. Hebert, A. K. Dey, and S. Srinivasa, “Planning-based prediction for pedestrians,” in Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, 2009, pp. 3931–3936.
  • [24] P. Henry, C. Vollmer, B. Ferris, and D. Fox, “Learning to navigate through crowded environments,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, 2010, pp. 981–986.
  • [25] P. Vernaza and D. Bagnell, “Efficient high dimensional maximum entropy modeling via symmetric partition functions,” in Advances in Neural Information Processing Systems, 2012, pp. 575–583.
  • [26] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity forecasting,” in European Conference on Computer Vision.   Springer, 2012, pp. 201–214.
  • [27] J. Choi and K.-E. Kim, “Map inference for bayesian inverse reinforcement learning,” in Advances in Neural Information Processing Systems, 2011, pp. 1989–1997.
  • [28] K. Kim, D. Lee, and I. Essa, “Gaussian process regression flow for analysis of motion trajectories,” in Computer vision (ICCV), 2011 IEEE international conference on, 2011, pp. 1164–1171.
  • [29] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 961–971.
  • [30] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning social etiquette: Human trajectory understanding in crowded scenes,” in European Conference on Computer Vision, 2016, pp. 549–565.
  • [31] A. Johansson, D. Helbing, and P. K. Shukla, “Specification of the social force pedestrian model by evolutionary adjustment to video tracking data,” Advances in complex systems, vol. 10, no. supp02, pp. 271–288, 2007.
  • [32] A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by example,” in Computer Graphics Forum, vol. 26, no. 3.   Wiley Online Library, 2007, pp. 655–664.
  • [33] D. Helbing and A. Johansson, “Pedestrian, crowd and evacuation dynamics,” in Encyclopedia of Complexity and Systems Science, 2009, pp. 6476–6495.
  • [34] J. Müller, C. Stachniss, K. Arras, and W. Burgard, “Socially inspired motion planning for mobile robots in populated environments,” in Proc. of International Conference on Cognitive Systems, 2008.
  • [35] D. Mehta, G. Ferrer, and E. Olson, “Autonomous navigation in dynamic social environments using multi-policy decision making,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, 2016, pp. 1190–1197.
  • [36] A. F. Foka and P. E. Trahanias, “Probabilistic autonomous robot navigation in dynamic environments with human motion prediction,” International Journal of Social Robotics, vol. 2, no. 1, pp. 79–94, 2010.
  • [37] M. Seder and I. Petrovic, “Dynamic window based approach to mobile robot motion control in the presence of moving obstacles,” in Robotics and Automation, 2007 IEEE International Conference on, 2007, pp. 1986–1991.
  • [38] P. Fiorini and Z. Shiller, “Motion planning in dynamic environments using velocity obstacles,” The International Journal of Robotics Research, vol. 17, no. 7, pp. 760–772, 1998.
  • [39] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
  • [40] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, abs/1502.05477, 2015.
  • [41]

    Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

    Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [42] M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena, “From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots,” arXiv preprint arXiv:1609.07910, 2016.
  • [43] Y. F. Chen, M. Liu, M. Everett, and J. P. How, “Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning,” arXiv preprint arXiv:1609.07845, 2016.
  • [44] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” arXiv preprint arXiv:1609.05143, 2016.
  • [45] D.-G. Choi, Y. Bok, J.-S. Kim, and I. S. Kweon, “Extrinsic calibration of 2-d lidars using two orthogonal planes,” IEEE Transactions on Robotics, vol. 32, no. 1, pp. 83–98, 2016.
  • [46] L. M. Miller and T. D. Murphey, “Optimal planning for target localization and coverage using range sensing,” in Automation Science and Engineering (CASE), 2015 IEEE International Conference on, 2015, pp. 501–508.
  • [47] F. Endres, J. Hess, J. Sturm, D. Cremers, and W. Burgard, “3-d mapping with an rgb-d camera,” IEEE Transactions on Robotics, vol. 30, no. 1, pp. 177–187, 2014.
  • [48] S. Foix, G. Alenya, J. Andrade-Cetto, and C. Torras, “Object modeling using a tof camera under an uncertainty reduction approach,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, 2010, pp. 1306–1312.
  • [49]

    M. Liu and R. Siegwart, “Topological mapping and scene recognition with lightweight color descriptors for an omnidirectional camera,”

    IEEE Transactions on Robotics, vol. 30, no. 2, pp. 310–324, 2014.
  • [50] L. Kneip, F. Tâche, G. Caprari, and R. Siegwart, “Characterization of the compact hokuyo urg-04lx 2d laser range scanner,” in Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, 2009, pp. 1447–1454.
  • [51]

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

    Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [52] J. Nocedal and S. J. Wright, “Conjugate gradient methods,” Numerical optimization, pp. 101–134, 2006.
  • [53] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” arXiv preprint arXiv:1604.06778, 2016.
  • [54] A. Leigh, J. Pineau, N. Olmedo, and H. Zhang, “Person tracking and following with 2d laser scanners,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, 2015, pp. 726–733.