Codes for Learning and Simulation of Role Playing Learning for Socially Concomitant Mobile Robot Navigation
In this paper, we present the Role Playing Learning (RPL) scheme for a mobile robot to navigate socially with its human companion in populated environments. Neural networks (NN) are constructed to parameterize a stochastic policy that directly maps sensory data collected by the robot to its velocity outputs, while respecting a set of social norms. An efficient simulative learning environment is built with maps and pedestrians trajectories collected from a number of real-world crowd data sets. In each learning iteration, a robot equipped with the NN policy is created virtually in the learning environment to play itself as a companied pedestrian and navigate towards a goal in a socially concomitant manner. Thus, we call this process Role Playing Learning, which is formulated under a reinforcement learning (RL) framework. The NN policy is optimized end-to-end using Trust Region Policy Optimization (TRPO), with consideration of the imperfectness of robot's sensor measurements. Simulative and experimental results are provided to demonstrate the efficacy and superiority of our method.READ FULL TEXT VIEW PDF
This paper proposes an end-to-end deep reinforcement learning approach f...
Humans are known to construct cognitive maps of their everyday surroundi...
Moving in dynamic pedestrian environments is one of the important
We propose Improved Memories Learning (IMeL), a novel algorithm that tur...
In Reinforcement Learning (RL), an agent explores the environment and
We aim to enable a mobile robot to navigate through environments with de...
The architecture of the neural networks used in Deep Reinforcement Learn...
Codes for Learning and Simulation of Role Playing Learning for Socially Concomitant Mobile Robot Navigation
The capability to navigate in densely populated and dynamic environments is one of the most important features that enable the deployment of mobile robots in unstructured environment, such as schools, shopping malls and transportation hubs. The key difference between the problem of navigating among humans and the traditional path planning and obstacle avoidance problems is that humans tend to smoothly evade each other interactively and cooperatively, rather than remaining static or maintaining an indifferent trajectory dynamics. In other words, there are social norms that need to be understood and complied to achieve maximum comfort of all involved pedestrians during navigation. We refer to this as the problem of social navigation, which aims to model such social norms and develop a robotic navigation policy that is socially acceptable to the pedestrians around.
For social navigation, the traditional approaches based on Dynamic Window Approach (DWA)  or potential fields [2, 3] are usually of limited efficacy as pedestrians are simply regarded as uncooperative obstacles. An illustrative example is the freezing robot problem (FRP) [4, 5], where a mobile robot will be stuck in a narrow corridor when facing a crowd of people if it lacks the ability to predict the joint collision avoidance behaviors of human pedestrians. To this end, researches have been done to understand the principles of humans’ joint collision avoidance strategies and one of the pioneering works are the social force model (SFM) [6, 7]. Other joint collision avoidance model such as reciprocal velocity obstacles (RVO) have been proposed in [8, 9, 10], with an underlying assumption that all involved agents adopt the same collision avoidance strategies. These ideas are also applied to visual tracking of pedestrians [11, 12]. More recently, several attempts are made to learn probabilistic models of pedestrians’ trajectories during joint collision avoidance, based on which the robot’s navigation decision is generated such that it is able to behave naturally and correctly in similar situations [13, 5, 14, 15].
In this paper, we propose to augment the dimensions of human-robot interaction in social navigation by further endowing robot with appropriate group behaviors when it is travelling with a human companion. This capability is highly desirable for assistive mobile robots [16, 17, 18], which serve as assistants and companions and are expected to travel along with theirx human partners in not only home environment but also possibly crowded public areas. In other words, apart from understanding the collision avoidance behaviors of pedestrians, the robot also needs to consider the motion of its companion so as to maintain a sense of affinity when they are travelling together towards a certain goal. We call this socially concomitant navigation (SCN) and it is more challenging than the aforementioned social navigation problem, where the robot is assumed to travel alone with a simpler pursuit of reaching a specific goal while being free of collision.
To address the problem of SCN, we develop a new learning scheme called Role Playing Learning (RPL). Particularly, we formulate such problem under the framework of Partially Observable Markov Decision Process (POMDP) and reinforcement learning (RL). A neural network (NN) is used to parameterize the navigation policy of the robot, which is optimized to gives proper steering commands for the next time instance based on the robot’s current and previous observations to its surroundings. To facilitate the RL process, we create a simulative navigation environment by mirroring a collections of real world pedestrians data sets and develop an on-policy optimization method called Partially Observable Trust Region Policy Optimization (PO-TRPO). In each run in an optimization iteration, the robot will attempt to play itself as a companion of a randomly chosen pedestrian by executing the NN navigation policy. The NN policy is then optimized using PO-TRPO based on a batch of collected trajectories. Compared to the existing analytically derived or data-driven approaches, our RPL scheme has the following advantages:
The formulation of RPL scheme is more generalizable and flexible. Our formulation contain no manually-defined feature and domain knowledge (e.g., statistics of pedestrians’ behaviors). It is not hardware-specific and can be easily modified to incorporate kinematics of different mobile robot platforms, sensor specifications and navigation objectives. In addition, unlike [15, 14], the learned navigation policy operates without assess to the global map of the environment. Therefore, it is not environment-specific and is well generalizable to unmet real-world scenarios.
We explicitly consider the noise and limitation of the robot’s sensor measurements. Most approaches for social navigation assume that the robot has full and accurate knowledge of interested variables, such as positions or distance of pedestrians and obstacles [8, 9, 10, 14]. On the contrary, our RPL schemes is rooted from the situation where the robot can only perceive those lie within its sensor’s Field of View (FoV), with the existence of measurement noise.
As a RL-based approach, RPL is efficient. Although RPL aims at solving tasks that involve interaction among robot, humans and physical environment, it does not require participation of human in both data collection and learning, which is known to be tedious and time-consuming. Instead, the learning process is safely automated in a simulative yet realistic environment with no human intervention.
We evaluate the performance of our approach in both simulations and real-world experiments, by comparing it with a baseline planner based on RVO  and humans, repectively. We also show that, with some tricks, the learned navigation policy can still be effective when the navigation scenario is reduced to the aforementioned social navigation, which means the robot is travelling without human companion.
The remainder of this paper is organized as follows. Related work is first reviewed in Section II. In Section III, the problem of SCN is formulated as a POMDP and associated definitions are given. RPL scheme and PO-TRPO algorithm are described in Section IV. Sections V and VI provide extensive results of simulation and experiment, followed by some concluding remarks in Section VII.
The problems of robot navigation in populated and dynamic environment can be addressed from a number of angles, which can be largely classified into two groups as in the following subsections.
Many researches have been proposed to describe the interactive navigation behaviors of humans by fitting a computational model to the observed pedestrians trajectories . In this way, the robot’s path planner is able to understand pedestrians’ intention during joint collision avoidance and actively calculate an optimal route towards its goal.
In the field of robotics, a majority of work in this direction is done via inverse reinforcement learning (IRL) , which learns a cost function that explains the observed behaviors. For example, maximum entropy IRL  is adopted in a number works [22, 23, 24, 25, 26] for discrete human behavior prediction and route planning. However, discrete representation is less desirable when modeling trajectories, which are in nature continuous and has higher order dynamics, such as velocities and acceleration. Instead,  adopts Maximum-A-Posteriori Bayesian IRL  to learn appropriate navigation behavior of a specific mobile robot from a set of demonstration trajectories. Note that, the demonstration data in  is specific to configurations of the robot and its sensor and has to be collected via human operation, which could be time-consuming. On the other hand, [13, 14] learns probabilistic models of composite trajectories of pedestrians from video data by maximum entropy learning and IRL. To better capture the characteristics of observed trajectories, they propose to develop their models based on a set of features that are hand-crafted according to the domain knowledge from psychological studies. In addition, those features contain velocities and accelerations of pedestrians, which, in practice, are hard to precisely measure. Besides, interacting Gaussian process (IGP) is derived in  to model the joint trajectories of pedestrian while explicitly considering the effects of observation noise. Nevertheless, the design of IGP also requires several hand-crafted kernels that are formulated based on the priori information in a specific application scenario.
Other than researchers in robotics, the community of computer vision also possess great interest in pedestrian modeling. One of the important topics is trajectory prediction in video space. In, Linear Trajectory Avoidance (LTA) is developed as a dynamic model for pedestrians in video space for short-term trajectory prediction and it is integrated into visual tracking system. Gaussian process is adopted in  to learn the motion pattern of pedestrians. Recently, Social LSTM is proposed in  for human trajectory prediction in crowd space. Similarly, the feature of social sensitivity is developed in  to analyze trajectories of pedestrians and bicyclists. While the above methods can effectively predict the navigation intention of pedestrians in videos, it is still unclear how to apply these model to navigation of robot in real scenarios.
In contrast to learning behavior models of pedestrians, a more direct perspective is to develop a steering model that outputs the immediate navigation actions given the robot’s current observation to the environment. One of the pioneering work in this direction is the social force model (SFM) , which uses energy/potential functions to encode the social status of pedestrian. Then, the navigation motivation of a pedestrian can be derived by taking the gradients of these energy functions. Following this idea, subsequent work [31, 32, 33, 12] propose to infer the optimal parameters of the energy function by fitting them to video data. However, they are likely to produce suboptimal results if the demonstration data from humans are imperfect. In , the authors integrate a people tracker and an iterative planner, with which the robot actively follows the pedestrian travelling in a similar direction to navigate through crowded environment.  follow the same idea and formulate the choice of a pedestrian to follow as a Multi-Policy Decision Making process. On the other hands,  develops a hierarchical POMDP for predictive navigation in dynamic environment. The idea is to predict the motion of pedestrians and generate a environment-specific cost map for path planning and obstacle avoidance.
Other than navigating in a pedestrian-aware manner, several reactive collision avoidance techniques have also been developed, such as DWA [1, 37], velocity obstacles  and reciprocal velocity obstacles (RVO) [8, 9, 10]. The common idea of these methods is to treat pedestrians as moving obstacles and reactively update the planner every short periods to achieve collision avoidance. As mentioned in Section I, these methods are less effective for social navigation as they lack predictive abilities and are based on some restrictive assumptions, such as accurate knowledge of moving agents’ velocities  and that all agents adopt the identical collision avoidance strategy [8, 9, 10].
Our proposed navigation policy belongs to the steering models. It takes an observation vector as input and outputs the navigation action through a stochastic neural networks. During RPL, our policy is optimized by the PO-TRPO algorithm, which is derived based on the recent advances in deep reinforcement learning (DRL)[39, 40]. DRL exploits the massive representation power of deep neural networks (DNN)  to build a complex yet sophisticated decision model, with which an agent can directly learn from raw signals instead of carefully crafted feature and tends to act more intelligently. Recently, there are several attempts in using DNN and DRL for robot navigation. For example, an end-to-end motion planner is learned in  to map raw sensor data of a laser range finder onto steering commands of a mobile robot. In , a decentralized multi-agent collision avoidance policy is learned via DRL, which can be thought as a DRL version of the original RVO approach . Finally, a target-driven visual navigation policy for home environment is learned in  via DRL. They create a set 3D virtual home environments for effective and efficient training of the agent, which shares a similar idea with our proposed RPL scheme.
To formulate the problem of socially concomitant navigation, we gives the following rules of SCN:
The robot should reach its goal as fast as possible;
The robot should not collide with any of the pedestrians or its companion, or run into any obstacle;
The robot should not run too far away from its companion.
The above rules serve as a generic description of the robot’s desired performance during navigation. To give concrete definitions, consider the navigation process as an infinite-horizon discounted POMDP in discrete time, defined by the tuple . is a finite set of states reflecting the navigation status of the robot. is a finite set of actions . In this paper, it is defined as a twosome of the translational and rotational velocities of a synchro-drive mobile robot, i.e., . is state-transition mapping, which is characterized by the dynamics of the robot, the other humans and the environment. Without loss of generality, we assume deterministic state transition, i.e., , where are the state and action taken at time . is the set of the robot’s observation to the state and
denotes the conditional observation probability distribution. Note that, in practice, the robot’s observation has only incomplete access toor is subject to certain measurement noise, which implies . is the initial state distribution, i.e., . is a scalar reward given to the robot and is the reward discount factor.
Robot motion dynamics: In this paper, synchro-drive mobile robots are considered, whose motion equation can be approximated by assuming the robot’s velocities to be constant within a certain short time period  with length . Particularly, let and denote the robot’s heading and its positions in a 2D Cartesian space at time , respectively. and represent the robot’s translational and rotational velocities. Define and . When the robot has nonzero rotational velocity, i.e., , we have
Otherwise, when ,
With the above formulations, our goal is optimizing a stochastic navigation policy with parameters in order to maximize the expected discounted reward:
where denotes the whole trajectory and . The specific definitions of the above ingredients for SCN will be elaborated as follows:
State: Given and , define the distance and direction of a point to the robot as follows:
Then, the robot’s distance to the goal located at are computed as and denotes the offset angle between the robot’s current heading and its goal. Similarly, we can define the twosomes , or to describe the relative position of a pedestrians , a companion or an obstacle to the robot. With such definitions, the state is defined to incorporate the information related to the robot’s navigation status as follows:
where is the current action vector and
The vector includes the distances and directions of closest pedestrians while includes those of the robot’s companion.
The vector is a compact description to the robot’s perception of the surrounding environment. Particularly, the boundaries of the occupied space (obstacles) in the environment are represented as a finite point set . Then, the 9 variables in are defined based on the following assumption
An obstacle has no effect on the robot’s navigation decision if it satisfies , where is a predefined finite constant.
By Assumption 1, it is sufficient to consider only obstacles in that are closed enough to the robot, whose distances are less than . In practice, this limit may correspond to the robot’s perception range. Let
The components in vector are described as follows:
The distance to the nearest obstacle located at heading of the robot, i.e.,
where is a small constant.
For , they represent the distance and direction of the closest and farthest obstacles on the robot’s left () and right side (), respectively, which are defined mathematically as follows:
Then, the variables in can be simply determined as the distance and directions of the above points according to Eqs. (6) and (7). Figure. 1 provides a comprehensive illustration of the state variables and .
Observation: As discussed in the previous sections, sensors mounted on the robot are always subject to various kinds of limitation and measurement noise, which must be taken into account in order to develop a robust and practical navigation system. To this end, we define as the robot’s observation to the true state as follows:
By Eq. (18), we assume that the robot has accurate information about the goal position and its current taken action (i.e., the velocity commands output to the robot’s motor) while its observations to may be imperfect. Particularly, consider the Field of Views (FoVs) for the robot’s pedestrian and obstacle detectors illustrated as Fig. 2.
Mathematically, let finite point sets and denote the current FoVs of pedestrian and obstacle detectors, characterized by threesomes and , respectively. The robot’s observations to the pedestrians’ relative positions are obtained as
for , with being the measurement noise/error.
where is the measurement noise/error for obstacle detection. The closest observed obstacles on the robot’s left and right sides are defined in a similar way as:
The companions are always observable to the robot.
Then, , where
Our general formulations of states (8) and observations (18) are applicable to various types of onboard sensors, such as range sensors [45, 46], RGB-D , Time-of-Flight (ToF)  and omnidirectional cameras  , as long as the interested positions can be extracted/estimated from the sensor’s raw measurements.
, as long as the interested positions can be extracted/estimated from the sensor’s raw measurements.
The mathematical definitions of the variables in observations are given for better understanding and are required only in the simulative RPL process. In practice, it is clear that these values can be directly measured via the robot’s onboard sensors without accessing the actual 2-D Cartesian coordinates of the considered point sets (e.g., , and ). For example, consider a robot equipped with a laser range finder. These distances and offset angles can be easily obtained from the returned ranges array.
Reward function: A scalar reward will be given to the robot as an award of reaching the goal or a penalty of colliding with obstacles/pedestrians/companions or losing its companions. Particularly, at time , the process of SCN will be terminated if any of the following three termination conditions is true.
1) Goal Reaching Condition
2) Collision Conditions
3) Stray Condition
Based on the above three terminal conditions, a reward will be given to the robot as follows:
Clearly, a positive reward will be given to the robot if it reaches its goal and it will receive a large negative reward if it collides with anything or be stray from its companion. Otherwise, the robot will receive an intermediate reward , which penalizes the robot for its rotational velocity to encourage a smoother trajectory with less turning behaviors.
In this section, we described the RPL scheme to learn an effective navigation policy for SCN in an efficient data-driven manner. The core idea is to transform the crowd trajectories data collected from real-world into a simulative and dynamic navigation environment, where the robot can play itself as a virtual pedestrian and iteratively improve the performance of via Partially Observable Trusted Region Policy Optimization (PO-TRPO).
Consider a set of simulative navigation environment . Each environment contains a set of pedestrian trajectories and a binary map that annotates the 2-D Cartesian coordinates of obstacles/occupied space in the environment. With , the abstract process of RPL is described by the following pseudo codes in Algorithm 1.
Companion Synthesization in non-SCN mode: As described in Algorithm 1, RPL actually incorporates two different navigation scenarios: the SCN proposed in this paper and the traditional social navigation scenario, where the robot has no human companion. This helps develop a navigation policy adaptable to both situations, with no restrictive assumption on the existence of companion. Particularly, the companion position vector and its observation are synthesized, with for every time step while . It is clear that the synthesized is equivalent to the situation where the companion is travelling non-distractively along the robot with a constant distance and guarantee that termination conditions (31) and (33) are always false.
On the other hand, in SCN mode, the companion is assigned with a truncated trajectory such that the initial robot-companion distance is sufficiently large.
In this paper, we construct a deep policy neural network to parameterize the navigation policy , whose structure is shown in Fig. 3. The policy network is to be trained with the Trust Region Policy Optimization (TRPO)  method. However, the original TRPO method is derived based on fully observable MDP, which can not be directly applied to our problem due to the imperfect observation in our formulation and practice. Thus, we proposed to extend the original TRPO algorithm as PO-TRPO, which will be described in the following subsections.
The TRPO  algorithm is an effective on-policy optimization method for large nonlinear policies and tends to give monotonic improvement during the iterative optimization process. To be specific, a fully observable MDP is considered by TRPO and therefore the policy to be optimized is formulated as , where is the parameter vector of the policy . Note that, determines the action directly from the true state , which differs from our observation-based policy . Let us consider the following standard definitions of the state-action value function , the value function and the advantage function :
In addition, define as the discounted visitation frequencies
where , and are generated according to and . Let denote the old parameters in last iteration. TRPO proposes to optimize the parameters iteratively regarding the following objective function:
where is the importance sampling distribution and
is the Kullback-Leibler divergence between the old and current policies.
As mentioned, our navigation problem is considered as a POMDP. The policy depends on the observation instead of the true state. Therefore, we write the objective function (40) and the constraint (41) as
For PO-TRPO, samples are collected by executing the old policy to generate a set of trajectories, such as . Therefore,
Next, for a trajectory , we use the generalized advantage estimation (GAE)  to construct an empirical estimation of the advantage function as the following:
and is the estimation of the value function (36) with parameters (and being the old parameters). By collecting a set of trajectories , is obtained by solving the following constrained regression problem :
Finally, as the conditional observation probability distribution is independent of parameters and time, we obtain an estimation of the objective function (42) and the constraints (43) by replacing the expectations with sample averages as:
which has the same form as the one obtained in , except that the policy is conditioned on observation instead.
As a data-driven approach, our deep neural network policy requires a massive amount of data to learn the socially concomitant navigation behavior. In this section, we describe how to construct a simulative environment according to the proposed RPL scheme. Particularly, the environments, the deep neural network policy and the PO-TRPO algorithm (Algorithm 2) are developed under the framework of RLLAB . We make use of trajectories of interacting pedestrians collected from five different data sets, which includes the ETH and Hotel video clips from the ETH Walking Pedestrians (EWAP) , the motion capture (MC) data set from , as well as the Zara and UCY video clips from . Note that, the Zara and UCY data sets have multiple subsets: Zara01, Zara02, Zara03, UCY01 and UCY03. Thus, there are totally 8 different RPL environments, i.e., . The details of these 8 environments are summarized in Tab. I.
Each trajectory in these environment provides the ID and a sequences of 2-D Cartesian positions of a pedestrian with a sampling period second. In addition, eight binary grid maps representing the occupied space/static obstacles are given. However, these maps are kept unknown to the robot throughout training and evaluation. They are only used to simulate the robot’s perception to the environment as the state and observation . Without loss of generality, we use the ETH data set as the evaluation environment and all other data sets in the Tab. I as training environments. In other words, the learned policy’s performance will be assessed in an RPL environment that is excluded during training, which reflects whether it can properly generalize to uncovered situations.
As some of the trajectories in these environments are of people who were wandering or remained approximately stationary, they are excluded from the candidates of the robot’s companion but will still be considered as pedestrians when the robot is navigating in the same environment.
We use a feed-forward neural network with 2 hidden layers as the feature network in our NN policy, containing 256 and 64 Tanh units, respectively. Its output is then fed to a LSTM network with 64 units. The variance of of the Gaussian output unitis chosen to be linearly decaying from 0.5 to 0.05 in 100 training iterations, which effectively encourages exploration during the early stage of learning and ensures convergence of the navigation policy. For GAE, a 3 layer feed-forward network with 256,64,16 Tanh units are used, with and . The update step size for policy network is adaptively chosen as . For GAE update, a fixed step size is used. The update batch size (Batch_size in Algorithm 1) is 50000.
In RPL, we consider at most 3 pedestrians (i.e., ). Thus, the state and observation will only describe the 3 closest pedestrians and omit the others. For the situation where less that 3 pedestrians are perceived, dummy static pedestrians will be created in the remote corner of the environment so as to maintain the dimensions of and .
Considering a Kobuki Turtlebot 2 with a Hokuyo URG-04LX laser range finder  mounted on its top, we specify the sensor limitation of the robot in simulation as follows:
The measurement noises and are modeled by zero-mean Gaussian and with their variances specified as follows:
Finally, the maximum translational and rotational velocities are assigned as and , i.e., and and . An example of our RPL environment constructed from the ETH data set is illustrated in Fig. 4
We trained our deep policy network for 1200 iterations with the data from RPL environments except for the held-out ETH environment. The curve of average discounted return obtained from each batch of trajectories is visualized in Fig. 5
We compare the performance of our policy with a planner based on RVO , where the robot, its companion and the surrounding pedestrians are treated agents. In every time steps, the positions and velocities of all agents are given to the planner. Note that, for fair comparison, the agents’ positions are subject to noise described in (54) and (55). For observations to obstacles, we assume the planner has full and perfect knowledge as required in the original RVO algorithm. With this protocol, we update the robot’s position according to the planner’s output and update the positions of the other agents according to their own trajectories in the RPL environments. The same termination conditions in Section III are applied to the robot directed by the RVO-based planner to determine whether the robot has conducted an successful navigation. For both of our policy and the RVO-based planner, we conduct 300 trials in the evaluation environment and compute the rates (in percentages) of different terminal conditions (RG: the robot reaches the goal successfully; HC/HP/HO: the robot hits a companion/pedestrian/obstacle; and LC: the robot loses its companion). The performance statistics of our policy and the RVO-based planner in SCN scenarios are listed in Tab. II.
It can bee seen from Tab. II that our policy performs much better than the RVO-based planner in SCN. The RVO-based planner has a much lower success rate (29.7%) while its rate of LC is , suggesting that it frequently losses its companion in SCN. Clearly, this is due to the fact that RVO is in nature a collision avoidance algorithm. Thus, it simply takes the robot’s companion as another normal agent and the robot tends to stay far behind its companion to avoid collision instead of actively following it. On the contrary, our policy achieves a much higher success rate (77%). This indicates that it learns to effectively balance the objectives of SCN so that the robot is able to reach the prescribed goal while maintaining its distance to its companion and avoiding collision with other agents in the environment.
In addition to SCN, the scenarios without companion are also tested, which, as analyzed in the previous sections, reduces to the traditional social navigation scenarios. The comparative results are shown in Tab. III.
For situations without companion, our policy still outperforms the RVO-based planner with higher success rate (84% to 80%) and lower HP rate (13.7% to 18%).
Finally, it is worth noting that the RVO-based planner requires velocities of the companion/pedestrians and an accurate global map of the static obstacles. Conversely, our policy depends only on position measurements that are directly accessible from the robot’s onboard sensors, which is therefore much simpler and more practical.
In experiments, we assess the performance of our developed navigation policy by comparing it with humans in the same scenarios. Particularly, a robot and a human are to repeat each specific navigation scenario for 10 times, respectively. Then, the the following two metrics are calculated:
Average minimum distance to the pedestrians (): the average of the minimum distance between the robot/compared human to other pedestrians throughout a trajectory .
Average maximum distance to the companion (): the average of the maximum distance between the robot/compared human to its/his companion throughout a trajectory.
We use the same mobile platform (a synchron-drive Turtlebot 2 with a Kobuki base) and the same laser range finder (Hokuyo URG-04LX) simulated in last section. For pedestrian detection and localization, we adopt the ROS-compatible leg tracker in . We use an ultra wideband (UWB) indoor positioning system to localize the companion and the navigation goal, which can then be easily mapped to the observations and based on the odometry of the robot. Finally, a laptop is placed onboard as the processing unit and the policy is operated with a period of 0.1 second. The experiments are conducted in a narrow corridor with width of 1.56 meters as shown in Fig. 6, which is a typical scenario that requires pedestrians to navigate cooperatively.
In this subsection, we examine our method’s performance in traditional social navigation scenario. Particularly, the robot is required to pass the corridor with two oncoming pedestrians and arrive at a goal that is 7 meters ahead. In addition, a control experiment of 3 humans (one as the compared human and the other two as pedestrians) is conducted in the same space. The metric is computed. Example trajectories of the robot and the human control are shown in Fig. 7. In the robotic experiments, the trajectories of pedestrians are obtained from the robot’s laser range finder while the robot’s trajectory is based on its own odometry sensor. On the other hand, all trajectories in the human control experiments are captured using the UWB localization system.
From Fig. 7, it is clear that the robot with our policy is able understand human’s cooperative behavior for collision avoidance and navigate in an appropriate manner such that both itself and the other two pedestrians can successfully pass through the corridor. Specifically, when observing the two pedestrians (blue and purple) 4 meters ahead. The robot started to approach the wall on its left side so as to create free space on the right for the pedestrians to smoothly walk through. By comparing both figures in Fig. 7, we can see that the robot is as proactive as human since both black trajectories in Fig. 7(a) and Fig. 7(b) started to make space for the oncoming pedestrians at the early stage of cooperative avoidance process. As for the performance metrics, the average minimal distance to pedestrians for our robot is m. Although it is smaller than that of the human control experiments (=0.56m), this value still indicates a safe and decent navigation behavior of our robot as its radius is only 0.17m.
In this subsection, the scenario of SCN is studied. A human companion initially standing in front of the robot will start to walk through the same corridor while another pedestrian is passing from the other end. As described in the previous sections, the robot with our policy should closely navigate with its companion and avoid the oncoming pedestrian cooperatively. An additional metric is used to evaluate the performance of our policy by comparing with the statistics obtained from another 10 human control experiments. Example trajectories are shown in Fig. 8 and the performance metrics and are summarized in Tab. IV.
As shown in Fig. 8 and Tab. IV, the robot is able to achieve both objectives of SCN. On one hand, it is effectively engaged into the joint collision avoidance process. The resulted behavior is similar to that observed in the last subsection and the robot even has a slightly larger . On the other hand, the average maximum distance is 1.05m, which is within the limit (2m) we specified in the learning process and nearly the same as that of the compared human, showing that the robot can actively navigate along with its companion instead of deviating to other areas or lagging itself behind. This shows that the robot driven by our policy is able to understand the pace of its companion and achieve a similar sense of companionship in terms of distance.
In sum, the above results demonstrate the practical efficacy of our methods for both the traditional social navigation and the more complicated SCN scenarios. It proves that the policy learned from our RPL simulative environment is transferable to uncovered real-world situations.
In this paper, the problem of socially concomitant navigation (SCN) has been investigated and formulated under a POMDP framework, with explicit considerations of the limitation and inaccuracy of mobile robots’ onboard sensors. The Partially Observable TRPO (PO-TRPO) algorithm has been proposed for optimization of navigation policies. The Role Playing Learning (RPL) scheme has been developed to enable efficient and safe reinforcement learning of navigation policies by mirroring a large amount of real-world pedestrian trajectories into simulative environments. Comparative simulation and experiment studies have demonstrated the efficacy and superiority of our policy in both SCN and traditional social navigation scenarios.
Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, 2011, pp. 1345–1352.
Proceedings of the twenty-first international conference on Machine learning, 2004, p. 1.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, 2015.
M. Liu and R. Siegwart, “Topological mapping and scene recognition with lightweight color descriptors for an omnidirectional camera,”IEEE Transactions on Robotics, vol. 30, no. 2, pp. 310–324, 2014.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.