The next generation of mobile robots needs to be socially-compliant to be accepted by humans. As simple as this task may seem, defining compliance formally is not trivial. Yet, classical reinforcement learning (RL) relies upon hard-coded reward signals. In this work, we go beyond this approach and provide the agent with intrinsic motivation using empowerment. Empowerment maximizes the influence of an agent on its near future and has been shown to be a good model for biological behaviors. It also has been used for artificial agents to learn complicated and generalized actions. Self-empowerment maximizes the influence of an agent on its future. On the contrary, our robot strives for the empowerment of people in its environment, so they are not disturbed by the robot when pursuing their goals. We show that our robot has a positive influence on humans, as it minimizes the travel time and distance of humans while moving efficiently to its own goal. The method can be used in any multi-agent system that requires a robot to solve a particular task involving humans interactions.READ FULL TEXT VIEW PDF
The challenge of developing powerful and general Reinforcement Learning ...
We derive a new intrinsic social motivation for multi-agent reinforcemen...
We transform reinforcement learning (RL) into a form of supervised learn...
Social navigation has been gaining attentions with the growth in machine...
Seamlessly interacting with humans or robots is hard because these agent...
Observational learning is a type of learning that occurs as a function o...
Using touch devices to navigate in virtual 3D environments such as compu...
Recent advances in sensor and control technologies have allowed the development of robots that assist people. These autonomous agents are seen in household , industrial  and traffic environments . A key challenge in these settings is that the robot must plan safe, collision-free paths, and almost equally as important, is that these paths must be socially compliant. As an example of why this is necessary is that if the human observer interprets the motion of the robot correctly, the likelihood of a collision is lower .
When people navigate they follow certain unwritten rules , which can be different from situation to situation  and even vary depending on the people involved . Robots that interact with people need to act according to these unwritten rules , , , . According to Kruse et al. , there are three main requirements for a robot to navigate in a social way, i.e., comfort, naturalness, and sociability. If these criteria are violated, a robot could create dangerous situations that it could have avoided otherwise .
Most methods for social navigation explicitly model interactions among agents or social conventions , , , , , . Explicitly defining rules for social navigation is hard as human behavior often varies between persons and from one context to another. To tackle this problem, other works implicitly model these aspects through imitation , , , . The drawback of this approach is that the learning outcomes depend on the availability of sufficient high-quality demonstrations, which can be resource consuming to obtain. An alternative is to learn these aspects via trial and error , . This approach relies on (hard-coded) reward signals. However, defining social compliance formally as a reward function is not trivial.
A different approach is to provide the agent with rewards that are generated by itself. One of these intrinsic reward systems is called empowerment . Empowerment maximizes the influence of an agent on its near future and has been shown to be good at modeling biological behaviors as well as for artificial agents that must learn complicated and generalized behaviors , .
Self-empowered robots try to impact their environment maximally. In contrast, robots that strive for human empowerment try to maintain the human’s influence on the environment . We propose a robot that behaves such that others are maximally empowered, so it drives to states where its neighbors are at their full potential. For example, when moving around people, the robot must keep an appropriate distance, far enough to not limit their space or block their way. Keeping distance is regarded as being one aspect of human comfort .
Our agent strives for the empowerment of people in order to minimize the disturbance of pursuing their goals. Our contribution is to use the concept of human empowerment introduced by Salge and Polani  as an intrinsic reward function for social navigation and to compare the resulting method on state-of-the-art robotic navigation benchmarks. Also, inspired by  and , we test it on two metrics that assess social behavior. Last, we study the influence of the robot on people and vice-versa with two new metrics. These demonstrations show that social characteristics can be achieved by using human empowerment. Our agent tries to empower humans instead of itself. In addition, since this does not require a cost function, it is applicable to any multi-agent system that requires a robot to interact with humans in a socially compliant way.
Many works have designed interaction models that enhance the social awareness in robot navigation. We discuss these methods first and motivate the practicality of deep reinforcement learning. We proceed by describing empowerment as a member of a greater family of intrinsic motivators for reinforcement learning and argue why it can be used for social navigation.
Goals for social navigation can be divided into three broad categories, comfort, naturalness and sociability . Examples of comfort are, respecting personal space, avoiding erratic behavior and not interfering the other’s movement. Naturalness is mostly related to how similar a robot’s motion is to human behavior, e.g. smooth and interpretable, while sociability is mostly related to social conventions and etiquettes.
Having defined these, one might be tempted to create a navigation framework that satisfies all the currently known requirements. Previous works have tried this exactly. Well-engineered methods are the Social Force Model , Interacting Gaussian Process (IGP) , ORCA  and RVO . With these methods collision-free paths are obtained, however, with a limited amount of additional social characteristics. They also heavily rely upon hand-crafting the model.
In contrary to these model-based approaches, deep learning models have been shown to produce more human-like paths
. Deep neural networks allow a policy to have a better understanding of humans and comply with their social rules. Early works separate the prediction of the environment and the planning task of the policy with two neural networks , . However, this may cause the freezing robot problem as the predicted human motion could take up all the available future space .
In order to directly obtain an action policy, imitation learning (IL) and inverse reinforcement learning (IRL) obtain policies from demonstrations, , . While this might seem promising, a large data set is required, due to the stochastic nature of people. As an alternative, deep reinforcement learning (DRL) aims to learn cooperative strategies by interacting with the environment , ,  and . Finding a proper cost function that encourages a robot to navigate in a social manner is all but trivial. Even if a cost function might appear obvious in some cases (e.g. collision-free and keeping distance to neighbours), it often has to be regularised, e.g., to be smooth and risk-averse. This leads to an alternating process of reviewing the behaviour of the agent and adapting the cost function accordingly to achieve the desired result.
Instead of shaping the reward function to achieve the desired behavior, an emerging field within reinforcement learning focuses on intrinsic motivation , , . There are many different ways to intrinsically motivate an agent ,  and one possible technique is called empowerment , . Empowerment was applied to teach agents task-independent behavior and training in settings with sparse rewards. Examples of such tasks are stabilizing an inverted pendulum, learning a biped to walk  and even win Atari video games . For the interested reader, we advise the survey of Aubret et al. , which provides an extensive survey of empowerment for reinforcement learning.
Empowerment is the channel capacity between actions and future states and maximizes the influence of an agent on its near future. This quantity can be computed for individual states of the world. Once these quantities are known, an agent could go to highly empowered states giving it more control than otherwise.
Self empowerment maximizes the influence of an agent on its own future. This may have the exact opposite effect to cooperation and social interaction as a self empowered agent will try to push people away so it can reach as many future locations as possible. In contrast, an agent that strives for the empowerment of others, maintains the influence of the them on their futures .
To this end, we propose a robot that aims to maximize the empowerment of its neighbors. As a consequence, the robot will respect people’s personal space and not hinder them in pursuing their goals. In addition, our method is generally applicable to any human-robot environment, since it does not require a (hard-coded) reward function on which many other DRL methods for social navigation rely upon.
Our goal is to teach an agent how to safely navigate to its goal in a socially compliant manner. These two objectives can be achieved by a combination of two rewards. In this section, we will describe our agent and the two types of rewards.
We consider the system to be Markovian in which each next state is dependent only on the current state and agent’s action and no prior history. A value network model is trained to accurately approximate the optimal value function that implicitly encodes social cooperation between agents and the empowerment of the people, see 1.
is the reward function and is the optimal policy that maximizes the expected return, with discount factor .
The first task of the agent is to reach its goal while avoiding collisions and keeping a comfortable distance to humans.
Equation 2 defines the environmental reward function for this task with the robot’s state denoted as and its action with . Similar to other DRL methods for social navigation , ,  we award task accomplishments and penalize collisions or uncomfortable distances.
Here is the robot’s distance to the goal during a time interval and is the robot’s distance to neighbor . It gets rewarded when its current position reaches the position of the goal , but penalized if its position is too close to another one’s position .
The robot’s own state,
, consists of a 2D position vectorand 2D velocity vector . The human states are denoted by , which is a concatenated vector of states of all humans participating in the scene. Each entry is similar to the robot’s state, namely, . The final state of the robot is the joined state of the humans and robot, . Its action is a desired velocity vector,
The second task of the robot is to consider people in its neighborhood and respond to their intentions in a socially compliant manner. Designing a reward function for that is not trivial, among other things due to the stochasticity in people’s behaviors. This is where we use empowerment , , an information-theoretic formulation of an agent’s influence on its near future.
Empowerment in our case, motivates the robot to go to states in which its neighbors are most empowered. Now the robot aims to maximize the empowerment of another person rather than its own, which Salge and Polani  call human empowerment in contrast to robot empowerment. As a result the robot will prevent obstructing the human, for example, by getting too close or by interfering with the human’s actions, both of which Kruse et al.  defined as social skills.
Equation 3 describes the definition of empowerment , being the maximal mutual information for a state . It is the channel capacity between action and future state , maximized over source policy . Policy is part of the human’s decision making system.
The lower part defines the empowerment with entropies . It corresponds to increasing the diversity of decisions, while at the same time limiting those decision that have no effect. Intuitively, the empowerment of the person reflects his or her ability to influence their future.
The human state takes an ego-centric parameterization . Each state is an occupancy grid map centered around the person, denoted with
. It is a 3D tensor with dimensions, where and run over the height and width of the grid. Each entry contains the presence and velocity vector of a neighbor at that location . The resulting state of the humans is a concatenated vector denoted by and action are continuous values in .
Empowerment can be defined by the KL divergence between the joint and product of the marginal distributions and :
many have designed methods to deal with this. Recent works provide an efficient method to estimate a lower bound on empowerment, via variational methods , , .
Instead of a planning distribution is used, which is approximated with the variational approximation to obtain a lower bound. can now be maximized over the parameters of the source, , variational and planning networks. is a third neural network that computes the future state from and
The gradient can be computed as follows, in which the joint parameters of , and is denoted by :
Using Monte-Carlo integration to estimate the continuous case, we can obtain the following gradient:
We are free to choose any type of distribution and since human movement is not discrete, we model both , and
The robot with policy learns to safely navigate to its goal and got to human empowered states. This is achieved by training a value network with the reward function in Eq. 8, combining the mutual information and the environmental reward . The hyper-parameter is used to regulate the trade-off between social compliance and safety.
Algorithm 1 shows the full algorithm. A set of demonstrations from the ORCA policy is used to give the robot a head start (line 1-3). This speeds up learning, because experiences in which the robot reaches the goal are now part of the memory. Lines 4-22 describes the exploration phase during an episode, the calculation of and network updates. The behavior policy collects samples of experience tuples
until a final state is reached (line 6-9). Random actions are selected with probability. Once these are collected, our hypothetical human policy together with and are used to estimate (line 10-14).
Finally, the networks are trained with a random mini-batch obtained from the memory (line 15-21). Our value network is optimized by the temporal-difference method (TD(0) ) with standard experience replay and fixed target network techniques , , . denotes the target network. The networks , and are updated through gradient ascent.
One distinction with other works on empowerment (e.g. ) and social navigation policies (e.g. ) is the state representation. The behavior policy uses the joined state to navigate collision-free to its goal. , and take the occupancy grids centered around each human as states for the computation of .
The first three experiments quantitatively compares our social compliant robot (SCR) with other robot strategies. We continue our evaluation by simulating an experiment with several people that show the robot and human movements, which allows to evaluate their interaction.
We compare our robot with four existing state-of-the-art methods, ORCA , CADRL , LSTM-RL  and SARL . First, we use similar metrics as defined in , namely the success rates, collision rates, times to reach goal, discomfort distance rate and rewards. Next, inspired by  and  we evaluate the jerk. Last, we test the time of the people to reach their goal and robot path length to assess the influence of the robot on the people relative to reaching its own goal efficiently.
The simulator used in this work is obtained from . It starts and terminates an episode with five humans and the robot. The human’s decisions are simulated by van den Berg et al. , which uses the ORCA policy  to calculate their actions. This policy uses the optimal reciprocal assumption, which avoids other agents while moving.
We implemented the networks in PyTorch and trained them with a batch size of 100 for 10k episodes. For the value network, the learning rate isand the discount factor is 0.9. The exploration rate of the decays linearly from 0.5 to 0.1 in the first 5k episodes and stays 0.1 for the remaining 5k episodes. These values are the same as Chen et al. . The parameter was chosen to be .25.
Table 1 reports the success, collision, time, discomfort distance rate and rewards for state-of-the art robot navigation strategies. Success is the rate of robot reaching its goal without a collision and Collision is the rate of robot colliding with humans averaged over 100 episodes. Our Socially Compliant Robot (SCR) and SARL both outperform other baselines on the standard metrics. Next, we look more thoroughly into the robot’s navigation time and compare it with the time of the humans.
Table 2 shows travel times and distances of both humans and robot. Time is the robot’s navigation time to reach its goal in seconds and H time is the average navigation time of a human to reach his/her goal in seconds. The simulator allows to make the robot invisible to the humans. This setting serves as a test bed for validating the other policies’ (SARL visible and SCR) abilities in reasoning about the interactions with the humans. Keeping the robot’s as well as humans’ time low, indicates that the policy does not disturb the humans in pursuing their goals as well as moving quickly to its own goal. The path length is calculated, to make sure that the robot moves efficiently and rule out that it’s not taking unnecessary detours, since this cannot be evaluated from travel time only. The invisible SARL has no influence on the time humans need to travel. On the other hand, its travel distance and time are higher than the visible SARL and SCR. This indicates that the robot makes detours around the humans. On the contrary, the visible SARL has a low travel distance and time, but the human travel times are highest. This can be a result of that it learned that humans avoid the robot. The numbers for SCR show that its travel time and that of the humans are nearly the same. The numbers suggest that due to the application of empowerment our method has learned to minimally disturb other persons, while moving to its own goal efficiently.
Next, we examine how we can evaluate social compliance further. Disc. is the discomfort frequency of when the separation distance between a human and the robot is less than 0.2m, see Table 1, column 5. Both SARL and SCR spend least amount of times close to a human, but  and  state that people judge robots negatively if the separation distance between them is low. Therefore, in Fig 2 we show that even though on average SCR is even as close to humans as SARL, it does not exceed a minimum distance of .1m. This can be explained by the fact that a low proximity would result in a lower empowerment, since the chance of a collision is high. The occupancy grids, have a resolution of .1m, which is fine enough to compute a collision.
Figure 3 shows SARL and SCR navigating through a crowd of five people. The left figures shows SARL (a, b) and right SCR (c, d) at two different time steps. The trajectories indicate that SARL goes directly to its goal, while SCR waits at t=6 (c). Moreover, at t=9.2, SARL has reached its goal, but only two out of five humans reach theirs (b, purple and light blue stars). In contrast, SCR reaches its goal at t=10.5, but all people reached their final destinations (d). SARL overtakes two people (a, red and green) and alters the path of another (a, blue). On the contrary, SCR lets them pass (c, red, green and blue). SARL uses occupancy maps to model the pairwise interaction between humans , so it cannot incorporate the influence of the robot on each human. On the contrary, SCR uses empowerment maps for each human that have high values if it does not block anyone.
Inspired by state-of-the-art reinforcement learning techniques, we applied a method called empowerment to give an agent socially intelligent traits. Contrary to self-empowerment, we taught the agent to maximize the empowerment of its neighbors. Our experiments show that our policy outperforms other strategies on state-of-the-art robot navigation benchmarks. We continued our evaluation with additional metrics that assess the social intelligence of the robot more accurately. On these additional metrics, our robot showed the best performance. The influence of the robot’s motion is difficult to evaluate by people in simulation. Thus, future work includes the deployment of the policy on hardware and evaluating it in real-world experiments.
Importance weighted autoencoders. arXiv preprint arXiv:1509.00519. Cited by: §3.2.2.
2005 IEEE Congress on Evolutionary Computation, Vol. 1, pp. 128–135. Cited by: §2.2, §3.2.2, §3.2.
European conference on computer vision, pp. 549–565. Cited by: §1.
Human-like hand reaching by motion prediction using long short-term memory. In International Conference on Social Robotics, pp. 156–166. Cited by: §4.1.