In the last years, an increasing amount of robots have been deployed in public environments making evident the need of human-aware navigation capabilities. Environments where robots are capable of navigating such as hospitals, hotels, restaurants, train stations or airports have been subject of research studies .
Classical robot navigation approaches tend to focus on efficiency during the path planning and execution; variables such as path length, clearance and smoothness are used to quantitatively evaluate performance 
. Nevertheless, robots using such approaches tend to disturb the crowd flows and to move in unexpected manners. Tackling these problems, more recent works have have proposed different strategies for socially acceptable navigation for robots in complex environments where human presence and activities are unpredictable. These works, can be classified into two groups: model-based and learning-based methods.
The major exponents of model-based methods rely on social psychology and cognitive sciences to generate human-like paths for robot navigation. One of the most relevant approaches is the Social Force Model (SFM) , which proposed to model pedestrian’s behaviour whose motion is influenced by other pedestrians by means of repulsive forces. Many studies have implemented this method but also produced several variations , later applied to real world environment navigation where robots can avoid or go along with people . However, these works show limitations such as the need for parameter calibration in different robots or the need of additional sensors for pedestrian tracking. Also, model-based methods are based on geometric relations, but it is still unclear if pedestrians always follow such models.
In contrast, learning-based methods use policies for defining human-like behaviours, which are usually learnt from human demonstrations by matching feature statistics about pedestrians. These methods apply machine learning techniques such as Inverse Reinforcement Learning (IRL) to model the factors that motivate people’s actions instead of the actions themselves. An experimental comparison of features and learning algorithms based on IRL is presented in, where the authors conclude that it is more effective to invest effort on designing features rather than on learning algorithms. More recent approaches include the use of deep reinforcement learning in order to set restrictions  (i.e. passing at the left side of the people) instead of learning the features that describe human paths.
Nevertheless, few efforts have been made concerning the problem of labelling humans as an interactive agents when blocking the robot motion trajectory. Current state of the art navigation planners will either propose an alternative path (if possible) or freeze the motion until the path is free.
Our work is part of the CrowdBot European project. This project articulates ethical questions with technical problems in the goal of enabling robots, autonomous ones or semi-autonomous ones to navigate in crowded environments. Navigating through human crowds is a tough challenge for a robot. Crowds can cause severe sensor occlusions and often don’t leave much free space for the robot to move in, leading to what’s known as the ”freezing robot problem”. Furthermore, we contemplate semi-static scenarios where the robot can not pass unless it interact with the humans blocking its path. In a previous work  we presented a robot behavior for interactive navigation used in deadlock situations. In such work the robot used speech with the purpose of asking for permission to pass when its path was blocked by humans. Nevertheless, there exist situations where a physical contact is needed when a robot is navigating, as is the case of crowded and noisy environments (i.e. a cocktail party), where people standing and talking are unaware of the intentions of the robot.
The goal of this work is to provide a wheeled humanoid robot with a behavior that is capable of unfreezing the robot in deadlock situations in semi-static, crowded and loud environments. One of the constraints on the design of this behavior was that the modality should be other than audio, as a the voice of the robot could be not heard in a loud environment. Also, as the robot used in this work is the humanoid Pepper robot, we take advantage of the physical modality. This modality should respect social norms such as touching only body parts that are socially allowed. A study on the meaning of touch  showed that the hands and shoulders are considered regions that serve to direct the recipient’s attention, also most of the touches were initiated by hand alone.
A robot behavior that could be used in the context of this work is the one of touching the shoulder of a person from the back, unfortunately the robot used in our work is too small to perform such behaviour. Then, we decided that the behavior should consist of moving one arm of the robot until reaching the hand of the person blocking the path. Furthermore, in order to increase the safety of the robot behavior, it should track the target with its head while performing the movement.
Ii Related Work
Ii-a Unfreezing Robot Problem
Navigation methods described in the introduction, usually focus on the motion of the robot. The work presented in  presents an approach called ”Interactive Gaussian Processes” (IGP) based on the assumption that agents solve the freezing problem by engaging in a cooperative strategy to create feasible trajectories, such approach should work for agents in motion. A more recent approach for unfreezing the robot in crowded environments, is presented in  which consist of a recovery behavior in case of SLAM failure, this behavior used a reinforcement learning based local navigation policy to find a set of recovery positions with rich features which allow the robot to continue its navigation.
Nevertheless, there are situations where more complex social behaviours are needed in order to deal with the freezing robot problem. Only a few research studies have focus on interactive navigation using robot capabilities. An approach for management of deadlock situations at narrow passages is presented in , where the robot lets the conflicting person pass and waits in a non-disturbing waiting position. Other work  proposes an interactive navigation in two simulated scenarios: In the first one, the robot asks for collaboration to enter a room. In the second one, the robot asks for permission to navigate between two people who are talking. In another work 
, the authors of propose the use of multi-modal rule-based system for interactive navigation, the interaction modals included visual, acoustics, and haptics, where the possible actions of the robot for clearing its path were: robot movement, speech, arm contraction, passive touch, and notifying touch, results showed that their system solved freezing problems, provided a safe and efficient trajectory while improving humans perception.
Ii-B Target Reaching Behavior
The choice of using a target reaching behavior for solving the unfreezing the robot problem relates our work with several works on the target reaching problem, which is one of the most important problems in robotics, as it is needed for tasks such as manipulation, grasping and object interaction.
The problem of the movement of a kinematic chain given a desired position of the end effector is a well-known problem in classical robotics known as Inverse Kinematics (IK) where the most common approach is to use a Jacobian matrix to find a linear approximation 
. There are many public software libraries for solving motion planning for robotics. However, IK solvers do not ensure a solution and they could require variable time to find one. Approaches using neural networks has been proposed for stereo-head robot arm coordination
allowing the robot to reach a given a spatial target using different joints. Also, using genetic algorithms with neural networks has allowed robotic arms to reach target at random target locations while avoiding obstacles Novel approaches for solving the target reaching problem include the use of Spiking Neural Networks (SNN) , which shows promising results but are still far for being a practical solution.
Current approaches in the state of the art for generating robot motions use Deep Reinforcement Learning (DRL) methods, i.e.  a quadruped robot learns to walk using such approach. Controllers based on DRL can be very computationally efficient at runtime, they provide a control policy learning system, but they do suffer from extremely slow training times. Because of this, the common way to train DRL in robotics is through simulation, an example of this is the work of  where the authors teach a simulated robot to move like an animal, then the learned policy is adapted to a real robot. An empirical evaluation of off-policy DRL algorithms on vision-based robotic grasping tasks in a simulation environment is presented in .
DRL has been used before for teaching the Pepper robot to interact with humans, in 
the authors propose the use of a dual stream convolutional neural network, the actions of the robot include waving hand, waiting, looking towards the human, and handshake.
In this work, we use a DRL method to perform a target reaching behavior on the Pepper robot given a dynamic spatial target.
Iii Proposed System
The proposed Target Reaching Behavior is composed of two modules that are coordinated using the Robotic Operative System (ROS) framework, see Fig.2. The first module, in charge of the perception of the robot. This module consist of a target detector using the Yolo v3 algorithm trained on the Open Image Dataset. The target detection module outputs a 3D target position using a RealSense D435 device set on the head of the robot. The target detector module is described in section IV. The second module, in charge of the actions of the robot, consist of a DRL method using the Proximal Policy Optimization algorithm trained on a physical simulator. The target reaching module is presented in section V.
Iv Target Detection
The target of our system for reaching by the robot are human hands and human arms that will be located at a close range distance.
There exist solutions in the state of the art of hand detection that are readily available. One of these systems is OpenPose , which can not only detect hands, but also the whole human skeleton. Nevertheless, OpenPose was proved to be unsuitable for our needs, this is because of the short distance between the robot and the person in front of it in our scenario. Also, because of the specifications of the RGB-D sensor and the size of the robot, we need a system capable of detecting hands at a distance between 0.2 m. and 0.8 m.
Due to the high success of deep neural networks in computer vision tasks, YOLOv3
was chosen as the object detector to be used, which is one of the state-of-the-art deep learning algorithms that perform real-time object detection.
The system uses, as other deep learning methods, a supervised learning approach for which it needs to be trained on a database of annotated images. For this reason, the Open Images Dataset(OID) was selected. This database provides a large collection of labelled images and their bounding boxes. This dataset was chosen because the images are very diverse and often contain complex scenes with several objects,allowing the model to learn more complex patterns. For training the target detector, two classes were selected: human arm, and human hand. For each class, 10,000 images were downloaded. The arm class allows a secondary target in case the hands of the person are not visible or are not recognized.
Iv-B YOLO algorithm
The version of YOLO used in this work is the implementation of PyTorch-YOLOv3111https://github.com/eriklindernoren/PyTorch-YOLOv3#pytorch-yolov3. The images labels were pre-processed in order to meet the requirements of YOLOv3. The network was trained from scratch with the images downloaded from the Open Images Dataset.
Iv-C YOLO Training
The model was trained for 400 Epochs of 2400 steps each epoch, the mean average precision (mAP) on the testing dataset was around 0.24% and a loss value of 0.67, while it still has room for improvement it is acceptable for our needs.
The training took about 30 hours on a PC composed of a 1080 Tegra Ti GPU. Further improvement on the mAP would require more and more time and it will increase only a small fraction of the current result. The loss value of the neural network is presented in Fig.3 showing the relative time for each step, as the training was done in separate times each 100 epochs, it can be seen that the log after 100 epochs had descended to 2.82, after 200 epochs it was 1.14, after 300 epochs it was 0.83 and the final value after 400 epochs was 0.67. In the validation set, the arms present a better mAP (0.33) than the hands (0.27).
The model was also tested with images streamed from the RealSense depth camera D435 set on the robot in order to verify that it fulfils the needs of this module. An example is shown in Fig.4, where it can be seen that the model is able to detect human hands at a close distance.
Iv-D Model Inference
The inference is done using the PyTorch framework inside a ROS package in order to facilitate the communication with the other modules of the project. Depending on the availability of CUDA the implementation of the model uses GPU or CPU to make the inference. The detections are re-scaled to the original size of the image, discarding too small (1:10) or too big images (8:10) using empirically obtained thresholds.
In order to speed up the conversion of the depth points in the region of interest to a single 3D position, the 3D target is calculated taking the depth information of n random pixels (pre-defined to 10) from the center of the bounding box and computing its mean. Afterwards, the ROS tf2 package is used to translate the hand position to the base of the robot in order to perform the hand gesture. Finally, the target is published to a ROS topic if it is a new detection or if the previous published point is a range of 0.2m.
V Target Reaching Behavior
In this section, we define our formulation for the target reaching behavior. Then we describe the training of the method in simulation. Finally we show the results in the real robot.
V-a Problem Statement
The goal of this module is to generate a robot movement for allowing it to reach a given 3D position target with its hand. The scenario where the robot gesture will be used, which is a Human-Robot Interaction, imposes some restrictions such as the movement of the robot needs to be fast enough to follow changes in the position of the target (hand or arm), also as the size of the robot is small compared to an adult person, using the hip joint to reach a point with the hand would be useful as this increase the action space of the robot. Currently, the NAOqi framework has a module for pointing with the arm to a given 3D position. Nevertheless, it has some limitations like the movement does not include the hip and the head of the robot and it is done in a blocking call, which means that the robot will execute following orders until the movement is done. This is a disadvantage for having a tracking with the head at the same time. Instead, having a head tracking the target and maintaining it in the field of view of the robot could be of aid for assuring a reactive gesture.
V-B Problem Formulation
Policy gradients methods are one of the major pieces in recent improvements in deep learning for control. Based on this, we formulate the problem of finding a trajectory of joint positions for a kinematic chain of a robot as a partially observable Markov decision process as a tuplewhere is a set of states of the agent, is a set of observations, is a set of Actions,
is a state transition probability function, where, and is a reward function = .
We chose the Proximal Policy Optimization (PPO) algorithm  for learning the policy for our target reaching behaviour. PPO seeks to improve previous policy gradient methods, while using only first-order optimization. Also, PPO uses an objective with clipped probability ratios, which forms a lower bound of the performance of the policy. To optimize policies, it is alternated between sampling data from the policy and performing several epochs of optimization on the sampled data.
We describe the observations, actions, and reward as follows:
The observations are represented as [, , , , ] where are the the angle in radians of the joints of the kinematic chain, is the 3D position of the right hand of the training robot, is the 3D target position ,
is the direction vector of the head of the training robot, and is the direction vector of the position of the head of the training robot towards the position of the hand of the second robot . All observations are normalized between .
The actions are represented as  the normalized velocities of the joints. The velocities were also limited to for the head joints and to for the rest of the joints.
We use a combination of two rewards, the first one is related to the movement of the arm of the robot, and the second one is related to the movement of the head.
The arm reward is defined as:
The head reward is defined as:
The rewards were combined as follows:
Where and are weights empirically obtained, a greater weight for the head reward would not allow the learning of the rest of the joints which need a more complex behavior.
V-C Training Environment
In this work, OpenAI Baselines222https://github.com/openai/baselines are used, it is a set of implementations of RL algorithms of the state of the art, more specifically a fork of Baselines called Stable Baselines333https://stable-baselines.readthedocs.io/en/master/index.html  is used, which provides unified structure and documentation for the algorithms.
In order to train the model, we used a physical simulator called qiBullet  which is a Bullet-based simulator for the Pepper and NAO robots. The qiBullet simulator inherits the cross-platform properties of the PyBullet Python module and Bullet physics engine. Also, PyBullet is a fast and easy to use Python module for robotics simulation and machine learning, with a focus on sim-to-real transfer.
The simulated environment used in this work involves two robots, one of them is controlled by the PPO algorithm while the other is set at random positions. The goal is to train a policy able to perform a hand gesture that touches the hand of the other robot regardless of its position, while tracking it with the head and the right hand of the robot. The second robot is used because of simplicity. The simulation runs at the default rate of 1/240 seconds using the method stepSimulation().
V-C1 Robot model
The model of the simulated Pepper robot is given by the qiBullet simulator. The joints for training the target reaching behavior are: hip pitch, right shoulder pitch, right shoulder roll, right shoulder elbow, head yaw, and head pitch. Other joints were excluded even if they could improve the action space of the robot, because it would lead to loss of naturalness of the motions.
The target is given by the 3D position of the left hand of a second robot having the world as reference frame. The position of the second robot and its left hand are generated randomly at each episode inside a region in front of the robot to be trained. The x coordinate is set in a range of and the coordinate is set between . The left shoulder pitch is set in a range of . This range gives a target space that contains several reachable targets and some that are not. The position of the second robot is intentionally set at the right side of the training robot in order to facilitate the learning as the physical design of the robot poses further restrictions to the reaching space. Furthermore, an offset of cm in the axe was added to the target, so the robot would not try to reach the joint position inside the hand of the other robot.
Training the robot in simulation, the combined reward increased in the first 10 million steps, after that it decreased and only got similar results after more than 30 million steps. The stop condition to restart the episodes was when the arm of the robot made a collision with itself, and after 5 seconds of the beginning of the episode. The model was saved each 10 million steps. The discounted reward of the training is shown in Fig.5, where each color represents 10 million episodes
The robot achieved to reach the target at different positions that were randomly generated, also it was able to adapt its movement to dynamic targets. Different robot motions for different target positions are shown in Fig.6 where the motion can be seen from a side view, and Fig.7 where the motion can be observed from a top view.
Vi Real Robot Implementation
This section present the target reaching module implemented in ROS for using it in the real robot.
The Target reaching ROS module is composed of 3 nodes:
Target frame publisher
Frame transformations publisher
The target frame publisher node initialize a target frame having as reference the base of the robot.
The frame transformation publisher node is in charge of transforming between different reference frames. In the simulation environment, the reference frame was the world frame, in the real robot it is the base of the robot. The Realsense sensor was added to the urdf file in order to use the sensor position and orientation for the ROS frame transformations.
The model prediction node obtains the observations and update the model for making a prediction for the next actions.
The ROS modules where setup to run at a speed of 50 Hz. The movement of the robot joints was controlled by relative angle and velocity, the angle rotation was fixed at 0.2 radians while the velocity was controlled by the predictions of the actions of the trained model.
Vii Conclusion and Future Work
In this work we presented a method for solving the freezing robot problem in a semi-static crowded and noisy environment. The method is composed of two separated modules. The ”perception” module is in charge of detecting the target which in our case was either the hand or the arm of the person blocking the path of the robot. The ”action” module is in charge of generating the gesture of the robot in order to touch the target. Having separated modules for the perception and the action of the robot, give the advantage of solve each problem separately. Each module can be replaced for other method of better accuracy. We used the Yolo v3 algorithm for fast detection, it could be replaced for other detection algorithms. For the action module, we used a policy trained on simulation using the PPO reinforcement learning algorithm. The generated policy was able to generate motion gestures that reached a dynamic target on simulation and tested on a real robot. Further work will measure the performance of the robot behaviour in real life-like scenarios. The code of the environment for training the policy in simulation can be found in: https://github.com/softbankrobotics-research/qi_gym/tree/reaching_target.
-  (2004) Introduction to inverse kinematics with jacobian transpose, pseudoinverse and damped least squares methods. IEEE Journal of Robotics and Automation 17 (1-19), pp. 16. Cited by: §II-B.
-  (2019) QiBullet, a bullet-based simulator for the pepper and nao robots. arXiv preprint arXiv:1909.00779. Cited by: §V-C.
-  (2017) Socially aware motion planning with deep reinforcement learning. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1343–1350. Cited by: §I.
-  (2019) Enabling socially competent navigation through incorporating hri. arXiv preprint arXiv:1904.09116. Cited by: §I.
-  (2019) Getting robots unfrozen and unlost in dense pedestrian crowds. IEEE Robotics and Automation Letters 4 (2), pp. 1178–1185. Cited by: §II-A.
-  (2013) Robot companion: a social-force based approach with human awareness-navigation in crowded environments. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1688–1694. Cited by: §I.
-  (1999) A neural network for target reaching with a robot arm using a stereohead. In IEEE SMC’99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 99CH37028), Vol. 2, pp. 842–847. Cited by: §II-B.
-  (2018) Learning to walk via deep reinforcement learning. arXiv preprint arXiv:1812.11103. Cited by: §II-B.
-  (1995) Social force model for pedestrian dynamics. Physical review E 51 (5), pp. 4282. Cited by: §I.
-  (2017) A guide robot at the airport: first impressions. In Proceedings of the Companion of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, pp. 149–150. Cited by: §I.
-  (2019) A preliminary study of interactive navigation framework with situation-adaptive multimodal inducement: pass-by scenario. International Journal of Social Robotics, pp. 1–22. Cited by: §II-A.
-  (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §IV-A.
Networks of spiking neurons: the third generation of neural network models. Neural networks 10 (9), pp. 1659–1671. Cited by: §II-B.
-  (2018) Issues in existing robotic service in restaurants and hotels. In Proceedings of 3rd International Conference on Internet of Things and Connected Technologies (ICIoTCT), pp. 26–27. Cited by: §I.
-  (1996) Evolving obstacle avoidance behavior in a robot arm. In From animals to animats 4: Proceedings of the fourth international conference on simulation of adaptive behavior, Vol. 4, pp. 468. Cited by: §II-B.
-  (2020) Learning agile robotic locomotion skills by imitating animals. arXiv preprint arXiv:2004.00784. Cited by: §II-B.
-  (2018) Deep reinforcement learning for vision-based robotic grasping: a simulated comparative evaluation of off-policy methods. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6284–6291. Cited by: §II-B.
-  (2016) Robot gains social intelligence through multimodal deep reinforcement learning. In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), pp. 745–751. Cited by: §II-B.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §IV.
-  (2017) Long-term demonstration experiment of autonomous mobile robot in a science museum. In 2017 IEEE International Symposium on Robotics and Intelligent Sensors (IRIS), pp. 304–310. Cited by: §I.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §V-B.
-  (2010) Unfreezing the robot: navigation in dense, interacting crowds. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 797–803. Cited by: §II-A.
-  (2015) “Go ahead, please”: recognition and resolution of conflict situations in narrow passages for polite mobile robot navigation. In International Conference on Social Robotics, pp. 643–653. Cited by: §II-A.
-  (2018) Mobile robot control and navigation: a global overview. Journal of Intelligent & Robotic Systems 91 (1), pp. 35–58. Cited by: §I.
-  (2014) Inverse reinforcement learning algorithms and features for robot navigation in crowds: an experimental comparison. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1341–1346. Cited by: §I.
-  (2018) Planning human-robot interaction for social navigation in crowded environments. In Workshop of Physical Agents, pp. 195–208. Cited by: §II-A.
-  (2011) Social force model with explicit collision prediction. EPL (Europhysics Letters) 93 (6), pp. 68005. Cited by: §I.