Social navigation with human empowerment driven reinforcement learning

03/18/2020 ∙ by Tessa van der Heiden, et al. ∙ BMW University of Amsterdam 5

The next generation of mobile robots needs to be socially-compliant to be accepted by humans. As simple as this task may seem, defining compliance formally is not trivial. Yet, classical reinforcement learning (RL) relies upon hard-coded reward signals. In this work, we go beyond this approach and provide the agent with intrinsic motivation using empowerment. Empowerment maximizes the influence of an agent on its near future and has been shown to be a good model for biological behaviors. It also has been used for artificial agents to learn complicated and generalized actions. Self-empowerment maximizes the influence of an agent on its future. On the contrary, our robot strives for the empowerment of people in its environment, so they are not disturbed by the robot when pursuing their goals. We show that our robot has a positive influence on humans, as it minimizes the travel time and distance of humans while moving efficiently to its own goal. The method can be used in any multi-agent system that requires a robot to solve a particular task involving humans interactions.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in sensor and control technologies have allowed the development of robots that assist people. These autonomous agents are seen in household [25], industrial [1] and traffic environments [35]. A key challenge in these settings is that the robot must plan safe, collision-free paths, and almost equally as important, is that these paths must be socially compliant. As an example of why this is necessary is that if the human observer interprets the motion of the robot correctly, the likelihood of a collision is lower [31].

Figure 1: Our social compliant robot (SCR) uses it’s position, speed an goal to solve its task. It also uses occupancy maps centered around each human to compute empowerment. This allows to move to its goal, while minimally disturbing people to pursue their goals.

When people navigate they follow certain unwritten rules [49], which can be different from situation to situation [46] and even vary depending on the people involved [41]. Robots that interact with people need to act according to these unwritten rules [13], [31], [32], [30]. According to Kruse et al. [31], there are three main requirements for a robot to navigate in a social way, i.e., comfort, naturalness, and sociability. If these criteria are violated, a robot could create dangerous situations that it could have avoided otherwise [4].

Most methods for social navigation explicitly model interactions among agents or social conventions [19], [33], [34], [29], [39], [21]. Explicitly defining rules for social navigation is hard as human behavior often varies between persons and from one context to another. To tackle this problem, other works implicitly model these aspects through imitation [30], [14], [45], [48]. The drawback of this approach is that the learning outcomes depend on the availability of sufficient high-quality demonstrations, which can be resource consuming to obtain. An alternative is to learn these aspects via trial and error [50], [7]. This approach relies on (hard-coded) reward signals. However, defining social compliance formally as a reward function is not trivial.

A different approach is to provide the agent with rewards that are generated by itself. One of these intrinsic reward systems is called empowerment [28]. Empowerment maximizes the influence of an agent on its near future and has been shown to be good at modeling biological behaviors as well as for artificial agents that must learn complicated and generalized behaviors [9], [23].

Self-empowered robots try to impact their environment maximally. In contrast, robots that strive for human empowerment try to maintain the human’s influence on the environment [44]. We propose a robot that behaves such that others are maximally empowered, so it drives to states where its neighbors are at their full potential. For example, when moving around people, the robot must keep an appropriate distance, far enough to not limit their space or block their way. Keeping distance is regarded as being one aspect of human comfort [31].

Our agent strives for the empowerment of people in order to minimize the disturbance of pursuing their goals. Our contribution is to use the concept of human empowerment introduced by Salge and Polani [44] as an intrinsic reward function for social navigation and to compare the resulting method on state-of-the-art robotic navigation benchmarks. Also, inspired by [13] and [31], we test it on two metrics that assess social behavior. Last, we study the influence of the robot on people and vice-versa with two new metrics. These demonstrations show that social characteristics can be achieved by using human empowerment. Our agent tries to empower humans instead of itself. In addition, since this does not require a cost function, it is applicable to any multi-agent system that requires a robot to interact with humans in a socially compliant way.

2 Related work

Many works have designed interaction models that enhance the social awareness in robot navigation. We discuss these methods first and motivate the practicality of deep reinforcement learning. We proceed by describing empowerment as a member of a greater family of intrinsic motivators for reinforcement learning and argue why it can be used for social navigation.

2.1 Social navigation

Goals for social navigation can be divided into three broad categories, comfort, naturalness and sociability [31]. Examples of comfort are, respecting personal space, avoiding erratic behavior and not interfering the other’s movement. Naturalness is mostly related to how similar a robot’s motion is to human behavior, e.g. smooth and interpretable, while sociability is mostly related to social conventions and etiquettes.

Having defined these, one might be tempted to create a navigation framework that satisfies all the currently known requirements. Previous works have tried this exactly. Well-engineered methods are the Social Force Model [18], Interacting Gaussian Process (IGP) [51], ORCA [22] and RVO [53]. With these methods collision-free paths are obtained, however, with a limited amount of additional social characteristics. They also heavily rely upon hand-crafting the model.

In contrary to these model-based approaches, deep learning models have been shown to produce more human-like paths 


. Deep neural networks allow a policy to have a better understanding of humans and comply with their social rules

[10]. Early works separate the prediction of the environment and the planning task of the policy with two neural networks [16], [3]. However, this may cause the freezing robot problem as the predicted human motion could take up all the available future space [51].

In order to directly obtain an action policy, imitation learning (IL) and inverse reinforcement learning (IRL) obtain policies from demonstrations

[32], [30], [39]. While this might seem promising, a large data set is required, due to the stochastic nature of people. As an alternative, deep reinforcement learning (DRL) aims to learn cooperative strategies by interacting with the environment [7], [40], [6] and [12]. Finding a proper cost function that encourages a robot to navigate in a social manner is all but trivial. Even if a cost function might appear obvious in some cases (e.g. collision-free and keeping distance to neighbours), it often has to be regularised, e.g., to be smooth and risk-averse. This leads to an alternating process of reviewing the behaviour of the agent and adapting the cost function accordingly to achieve the desired result.

2.2 Empowerment

Instead of shaping the reward function to achieve the desired behavior, an emerging field within reinforcement learning focuses on intrinsic motivation [42], [9], [38]. There are many different ways to intrinsically motivate an agent [20], [11] and one possible technique is called empowerment [27], [43]. Empowerment was applied to teach agents task-independent behavior and training in settings with sparse rewards. Examples of such tasks are stabilizing an inverted pendulum, learning a biped to walk [23] and even win Atari video games [24]. For the interested reader, we advise the survey of Aubret et al. [2], which provides an extensive survey of empowerment for reinforcement learning.

Empowerment is the channel capacity between actions and future states and maximizes the influence of an agent on its near future. This quantity can be computed for individual states of the world. Once these quantities are known, an agent could go to highly empowered states giving it more control than otherwise.

Self empowerment maximizes the influence of an agent on its own future. This may have the exact opposite effect to cooperation and social interaction as a self empowered agent will try to push people away so it can reach as many future locations as possible. In contrast, an agent that strives for the empowerment of others, maintains the influence of the them on their futures [44].

To this end, we propose a robot that aims to maximize the empowerment of its neighbors. As a consequence, the robot will respect people’s personal space and not hinder them in pursuing their goals. In addition, our method is generally applicable to any human-robot environment, since it does not require a (hard-coded) reward function on which many other DRL methods for social navigation rely upon.

Earlier computations were only applicable in discrete state-action spaces, but recently [15], [23] and [37] show efficient implementations for continuous settings. In our work, we will built upon these models.

3 Methodology

Our goal is to teach an agent how to safely navigate to its goal in a socially compliant manner. These two objectives can be achieved by a combination of two rewards. In this section, we will describe our agent and the two types of rewards.

We consider the system to be Markovian in which each next state is dependent only on the current state and agent’s action and no prior history. A value network model is trained to accurately approximate the optimal value function that implicitly encodes social cooperation between agents and the empowerment of the people, see 1.


is the reward function and is the optimal policy that maximizes the expected return, with discount factor .

3.1 Reward for safe navigation

The first task of the agent is to reach its goal while avoiding collisions and keeping a comfortable distance to humans.

Equation 2 defines the environmental reward function for this task with the robot’s state denoted as and its action with . Similar to other DRL methods for social navigation [8], [6], [7] we award task accomplishments and penalize collisions or uncomfortable distances.


Here is the robot’s distance to the goal during a time interval and is the robot’s distance to neighbor . It gets rewarded when its current position reaches the position of the goal , but penalized if its position is too close to another one’s position .

The robot’s own state,

, consists of a 2D position vector

and 2D velocity vector . The human states are denoted by , which is a concatenated vector of states of all humans participating in the scene. Each entry is similar to the robot’s state, namely, . The final state of the robot is the joined state of the humans and robot, . Its action is a desired velocity vector,

3.2 Empowerment for social compliance

The second task of the robot is to consider people in its neighborhood and respond to their intentions in a socially compliant manner. Designing a reward function for that is not trivial, among other things due to the stochasticity in people’s behaviors. This is where we use empowerment [27], [43], an information-theoretic formulation of an agent’s influence on its near future.

3.2.1 Human empowerment

Empowerment in our case, motivates the robot to go to states in which its neighbors are most empowered. Now the robot aims to maximize the empowerment of another person rather than its own, which Salge and Polani [44] call human empowerment in contrast to robot empowerment. As a result the robot will prevent obstructing the human, for example, by getting too close or by interfering with the human’s actions, both of which Kruse et al. [31] defined as social skills.

Equation 3 describes the definition of empowerment , being the maximal mutual information for a state . It is the channel capacity between action and future state , maximized over source policy . Policy is part of the human’s decision making system.


The lower part defines the empowerment with entropies . It corresponds to increasing the diversity of decisions, while at the same time limiting those decision that have no effect. Intuitively, the empowerment of the person reflects his or her ability to influence their future.

The human state takes an ego-centric parameterization [8]. Each state is an occupancy grid map centered around the person, denoted with

. It is a 3D tensor with dimensions

, where and run over the height and width of the grid. Each entry contains the presence and velocity vector of a neighbor at that location . The resulting state of the humans is a concatenated vector denoted by and action are continuous values in .

3.2.2 Estimating empowerment with neural networks

Empowerment can be defined by the KL divergence between the joint and product of the marginal distributions and :


The main problem in the formulation in Eq. 4 is the intractability due to the integral of all future states. Since the introduction of empowerment [27], [43]

many have designed methods to deal with this. Recent works provide an efficient method to estimate a lower bound on empowerment

, via variational methods [23], [37], [5].


Instead of a planning distribution is used, which is approximated with the variational approximation to obtain a lower bound. can now be maximized over the parameters of the source, , variational and planning networks. is a third neural network that computes the future state from and

The gradient can be computed as follows, in which the joint parameters of , and is denoted by :


Using Monte-Carlo integration to estimate the continuous case, we can obtain the following gradient:


We are free to choose any type of distribution and since human movement is not discrete, we model both , and

as Gaussian distributions.

3.3 Algorithm

The robot with policy learns to safely navigate to its goal and got to human empowered states. This is achieved by training a value network with the reward function in Eq. 8, combining the mutual information and the environmental reward . The hyper-parameter is used to regulate the trade-off between social compliance and safety.


Algorithm 1 shows the full algorithm. A set of demonstrations from the ORCA policy is used to give the robot a head start (line 1-3). This speeds up learning, because experiences in which the robot reaches the goal are now part of the memory. Lines 4-22 describes the exploration phase during an episode, the calculation of and network updates. The behavior policy collects samples of experience tuples

until a final state is reached (line 6-9). Random actions are selected with probability

. Once these are collected, our hypothetical human policy together with and are used to estimate (line 10-14).

Finally, the networks are trained with a random mini-batch obtained from the memory (line 15-21). Our value network is optimized by the temporal-difference method (TD(0) [47]) with standard experience replay and fixed target network techniques [8], [36], [6]. denotes the target network. The networks , and are updated through gradient ascent.

One distinction with other works on empowerment (e.g. [23]) and social navigation policies (e.g. [6]) is the state representation. The behavior policy uses the joined state to navigate collision-free to its goal. , and take the occupancy grids centered around each human as states for the computation of .

1:Initialize value network with demonstration
2:Initialize target value network
3:Initialize experience replay memory
4:for episode m = 1: M do
5:     Obtain from environment.
6:     repeat
8:          Store in
9:     until  or or
10:     for time = 1: T do
11:          Obtain and
12:          Add to
13:     end for
14:     for batch b = 1: B do
15:          Get random experience tuple and from
16:          Compute target
17:           with gradient descent
18:           with gradient ascent
19:     end for
20:     Update
21:end for
22:return , , and
Algorithm 1 Human empowerment estimation, for value network with parameters , source , planning and transition networks with parameters

4 Experiments

The first three experiments quantitatively compares our social compliant robot (SCR) with other robot strategies. We continue our evaluation by simulating an experiment with several people that show the robot and human movements, which allows to evaluate their interaction.

4.1 Metrics and models

We compare our robot with four existing state-of-the-art methods, ORCA [52], CADRL [8], LSTM-RL [12] and SARL [6]. First, we use similar metrics as defined in [6], namely the success rates, collision rates, times to reach goal, discomfort distance rate and rewards. Next, inspired by [54] and [31] we evaluate the jerk. Last, we test the time of the people to reach their goal and robot path length to assess the influence of the robot on the people relative to reaching its own goal efficiently.

4.2 Implementation details

The simulator used in this work is obtained from [6]. It starts and terminates an episode with five humans and the robot. The human’s decisions are simulated by van den Berg et al. [52], which uses the ORCA policy [53] to calculate their actions. This policy uses the optimal reciprocal assumption, which avoids other agents while moving.

We implemented the networks in PyTorch and trained them with a batch size of 100 for 10k episodes. For the value network, the learning rate is

and the discount factor is 0.9. The exploration rate of the decays linearly from 0.5 to 0.1 in the first 5k episodes and stays 0.1 for the remaining 5k episodes. These values are the same as Chen et al. [6]. The parameter was chosen to be .25.

Similar values for the learning rates for the other networks were used. The value network was trained with stochastic gradient descent, similar as in

Chen et al. [6]. The planning, source and transition networks were trained with Adam [26], similar to [23].

4.3 State-of-the-art navigation benchmark

Table 1 reports the success, collision, time, discomfort distance rate and rewards for state-of-the art robot navigation strategies. Success is the rate of robot reaching its goal without a collision and Collision is the rate of robot colliding with humans averaged over 100 episodes. Our Socially Compliant Robot (SCR) and SARL both outperform other baselines on the standard metrics. Next, we look more thoroughly into the robot’s navigation time and compare it with the time of the humans.

max width= Methods Success Collision Time Disc Reward ORCA 0.99 .000* 12.3 0.00* .284 CADRL 0.94 .035 10.8 0.10 .291 LSTM-RL 0.98 .022 11.3 0.05 .299 SARL 1.00** .001 10.6 0.03 .334 SCR (ours) 1.00** .002 10.9 0.03 .331

Table 1: Both SCR and SARL outperform the other baselines, which can be seen by the best values (bold) and second best (underline). ORCA does not have any collisions, because this is the central idea behind the method (*). The numbers are computed for 500 different test scenarios. Both SARL and SCR reached their goals in more than 497 out of 500 tests (**).

4.4 Influence of robot on humans and vice versa

Table 2 shows travel times and distances of both humans and robot. Time is the robot’s navigation time to reach its goal in seconds and H time is the average navigation time of a human to reach his/her goal in seconds. The simulator allows to make the robot invisible to the humans. This setting serves as a test bed for validating the other policies’ (SARL visible and SCR) abilities in reasoning about the interactions with the humans. Keeping the robot’s as well as humans’ time low, indicates that the policy does not disturb the humans in pursuing their goals as well as moving quickly to its own goal. The path length is calculated, to make sure that the robot moves efficiently and rule out that it’s not taking unnecessary detours, since this cannot be evaluated from travel time only. The invisible SARL has no influence on the time humans need to travel. On the other hand, its travel distance and time are higher than the visible SARL and SCR. This indicates that the robot makes detours around the humans. On the contrary, the visible SARL has a low travel distance and time, but the human travel times are highest. This can be a result of that it learned that humans avoid the robot. The numbers for SCR show that its travel time and that of the humans are nearly the same. The numbers suggest that due to the application of empowerment our method has learned to minimally disturb other persons, while moving to its own goal efficiently.

max width= Time [s] Distance [m] H Time [s] SARL (invisible) 11.5 10.7 9.1 SARL (visible) 10.6 9.2 10.7  SCR (ours) 10.9 9.3 9.1

Table 2: As a result of empowerment,  SCR does not disturb people’s movements as their travel times are low. It reaches its own goal efficiently as well. On the contrary, SARL (visible) has learned that people avoid it, so their travel times are higher than its own. SARL (invisible) takes a large detour, because it is not seen by people, so they cannot avoid it.
(a) separation distance [m] (b) jerk
Figure 2: The robot with our policy SCR (a, blue) is on average as close to humans as policy SARL (a, red), but not closer than .1m. Our policy (b, blue) also has a lower jerk () compared to SARL (b, red). SCR avoids being too close to a person or non-smooth behavior as this lowers the empowerment of its neighbors.

4.5 Separation distance

Next, we examine how we can evaluate social compliance further. Disc. is the discomfort frequency of when the separation distance between a human and the robot is less than 0.2m, see Table 1, column 5. Both SARL and SCR spend least amount of times close to a human, but [13] and [31] state that people judge robots negatively if the separation distance between them is low. Therefore, in Fig 2 we show that even though on average SCR is even as close to humans as SARL, it does not exceed a minimum distance of .1m. This can be explained by the fact that a low proximity would result in a lower empowerment, since the chance of a collision is high. The occupancy grids, have a resolution of .1m, which is fine enough to compute a collision.

(a) SARL t=6 (b) SARL t=9.2 (c) SCR t=6 (d) SCR t=10.5
Figure 3: SARL (a, b) and SCR (c, e) in a scene with 5 humans. The humans’ destinations are the opposite of the x, y axis’ origin from their initial locations. Where SARL has reached its destination quickly, only two out of five humans have reached it (b, purple and light blue stars). On the contrary, SCR waits at t=6 (c) and all humans reach their destination (d), denoted with stars. Note the two persons in (red, blue) that need to adjust their motion for avoiding the robot (orange) with policy SARL (a) compared to SCR (c).

4.6 Qualitative results

Figure 3 shows SARL and SCR navigating through a crowd of five people. The left figures shows SARL (a, b) and right SCR (c, d) at two different time steps. The trajectories indicate that SARL goes directly to its goal, while SCR waits at t=6 (c). Moreover, at t=9.2, SARL has reached its goal, but only two out of five humans reach theirs (b, purple and light blue stars). In contrast, SCR reaches its goal at t=10.5, but all people reached their final destinations (d). SARL overtakes two people (a, red and green) and alters the path of another (a, blue). On the contrary, SCR lets them pass (c, red, green and blue). SARL uses occupancy maps to model the pairwise interaction between humans [6], so it cannot incorporate the influence of the robot on each human. On the contrary, SCR uses empowerment maps for each human that have high values if it does not block anyone.

5 Conclusion and future work

Inspired by state-of-the-art reinforcement learning techniques, we applied a method called empowerment to give an agent socially intelligent traits. Contrary to self-empowerment, we taught the agent to maximize the empowerment of its neighbors. Our experiments show that our policy outperforms other strategies on state-of-the-art robot navigation benchmarks. We continued our evaluation with additional metrics that assess the social intelligence of the robot more accurately. On these additional metrics, our robot showed the best performance. The influence of the robot’s motion is difficult to evaluate by people in simulation. Thus, future work includes the deployment of the policy on hardware and evaluating it in real-world experiments.


  • [1] R.E. Andersen, S. Madsen, A.B.K. Barlo, S.B. Johansen, M. Nør, R.S. Andersen, and S. Bøgh (2019) Self-learning processes in smart factories: deep reinforcement learning for process control of robot brine injection. In 29th International Conference on Flexible Automation and Intelligent Manufacturing, Cited by: §1.
  • [2] A. Aubret, L. Matignon, and S. Hassas (2019) A survey on intrinsic motivation in reinforcement learning. arXiv preprint arXiv:1908.06976. Cited by: §2.2.
  • [3] S. Bansal, V. Tolani, S. Gupta, J. Malik, and C. Tomlin (2019) Combining optimal control and learning for visual navigation in novel environments. arXiv preprint arXiv:1903.02531. Cited by: §2.1.
  • [4] A. Bautin, L. Martinez-Gomez, and T. Fraichard (2010) Inevitable collision states: a probabilistic perspective. In 2010 IEEE ICRA, pp. 4022–4027. Cited by: §1.
  • [5] Y. Burda, R. Grosse, and R. Salakhutdinov (2015)

    Importance weighted autoencoders

    arXiv preprint arXiv:1509.00519. Cited by: §3.2.2.
  • [6] C. Chen, Y. Liu, S. Kreiss, and A. Alahi (2019) Crowd-robot interaction: crowd-aware robot navigation with attention-based deep reinforcement learning. In 2019 ICRA, pp. 6015–6022. Cited by: §2.1, §3.1, §3.3, §3.3, §4.1, §4.2, §4.2, §4.2, §4.6.
  • [7] Y. Chen, M. Everett, M. Liu, and J. P. How (2017) Socially aware motion planning with deep reinforcement learning. In 2017 IEEE/RSJ IROS, pp. 1343–1350. Cited by: §1, §2.1, §3.1.
  • [8] Y. Chen, M. Liu, M. Everett, and J. P. How (2017) Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In 2017 IEEE ICRA, pp. 285–292. Cited by: §3.1, §3.2.1, §3.3, §4.1.
  • [9] N. Chentanez, A.G. Barto, and S.P. Singh (2005) Intrinsically motivated reinforcement learning. In NeurIPS, pp. 1281–1288. Cited by: §1, §2.2.
  • [10] E. Cross, R. Hortensius, and A. Wykowska (2019-04) From social brains to social robots: applying neurocognitive insights to human-robot interaction. Philosophical Transactions of the Royal Society of London. Series B, Biological sciences 374, pp. . External Links: Document Cited by: §2.1.
  • [11] N. Dilokthanakul, C. Kaplanis, N. Pawlowski, and M. Shanahan (2019) Feature control as intrinsic motivation for hierarchical reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §2.2.
  • [12] M. Everett, Y. Chen, and J.P. How (2018) Motion planning among dynamic, decision-making agents with deep reinforcement learning. In 2018 IEEE/RSJ IROS, pp. 3052–3059. Cited by: §2.1, §4.1.
  • [13] T. Fong, I. Nourbakhsh, and K. Dautenhahn (2003) A survey of socially interactive robots. Robotics and autonomous systems 42 (3-4), pp. 143–166. Cited by: §1, §1, §4.5.
  • [14] W. Gao, D. Hsu, W. Lee, S. Shen, and K. Subramanian (2017) Intention-net: integrating planning and deep learning for goal-directed autonomous navigation. arXiv preprint arXiv:1710.05627. Cited by: §1.
  • [15] K. Gregor, D.J. Rezende, and D. Wierstra (2016) Variational intrinsic control. arXiv preprint arXiv:1611.07507. Cited by: §2.2.
  • [16] D. Gu and H. Hu (2002-05) Neural predictive control for a car-like mobile robot. Robotics and Autonomous Systems 39, pp. 73–86. External Links: Document Cited by: §2.1.
  • [17] T. Gu and J. Dolan (2014) Toward human-like motion planning in urban environments. In 2014 IEEE Intelligent Vehicles Symposium Proceedings, pp. 350–355. Cited by: §2.1.
  • [18] D. Helbing and P. Molnar (1995) Social force model for pedestrian dynamics. Physical review E 51 (5), pp. 4282. Cited by: §2.1.
  • [19] P. Henry, C. Vollmer, B. Ferris, and D. Fox (2010) Learning to navigate through crowded environments. In 2010 IEEE International Conference on Robotics and Automation, pp. 981–986. Cited by: §1.
  • [20] N. Jaques, A. Lazaridou, E. Hughes, C. Gulcehre, P. Ortega, D. Strouse, J.Z. Leibo, and N. De Freitas (2019) Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In ICML, pp. 3040–3049. Cited by: §2.2.
  • [21] C. Johnson and B. Kuipers (2018) Socially-aware navigation using topological maps and social norm learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 151–157. Cited by: §1.
  • [22] I. Karamouzas, P. Heil, P. Van Beek, and M.H. Overmars (2009) A predictive collision avoidance model for pedestrian simulation. In International workshop on motion in games, pp. 41–52. Cited by: §2.1.
  • [23] M. Karl, M. Soelch, P. Becker-Ehmck, D. Benbouzid, P. van der Smagt, and J. Bayer (2017) Unsupervised real-time control through variational empowerment. arXiv preprint arXiv:1710.05101. Cited by: §1, §2.2, §2.2, §3.2.2, §3.3, §4.2.
  • [24] H. Kim, J. Kim, Y. Jeong, S. Levine, and H.O. Song (2019) EMI: exploration with mutual information. In ICML, pp. 3360–3369. Cited by: §2.2.
  • [25] J. Kim, A.K. Mishra, R. Limosani, M. Scafuro, N. Cauli, J. Santos-Victor, B. Mazzolai, and F. Cavallo (2019) Control strategies for cleaning robots in domestic applications: a comprehensive review. International Journal of Advanced Robotic Systems 16 (4). Cited by: §1.
  • [26] D.P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [27] A.S. Klyubin, D. Polani, and C.L. Nehaniv (2005) Empowerment: a universal agent-centric measure of control. In

    2005 IEEE Congress on Evolutionary Computation

    Vol. 1, pp. 128–135. Cited by: §2.2, §3.2.2, §3.2.
  • [28] A.S. Klyubin, D. Polani, and C. Nehaniv (2005-10) Empowerment: a universal agent-centric measure of control. Vol. 1, pp. 128 – 135 Vol.1. External Links: ISBN 0-7803-9363-5, Document Cited by: §1.
  • [29] M. Kollmitz, K. Hsiao, J. Gaa, and W. Burgard (2015) Time dependent planning on a layered social cost map for human-aware robot navigation. In 2015 European Conference on Mobile Robots (ECMR), pp. 1–6. Cited by: §1.
  • [30] H. Kretzschmar, M. Spies, C. Sprunk, and W. Burgard (2016) Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research 35 (11), pp. 1289–1307. Cited by: §1, §1, §2.1.
  • [31] T. Kruse, A.K. Pandey, R. Alami, and A. Kirsch (2013) Human-aware robot navigation: a survey. Robotics and Autonomous Systems 61 (12), pp. 1726–1743. Cited by: §1, §1, §1, §1, §2.1, §3.2.1, §4.1, §4.5.
  • [32] M. Kuderer, H. Kretzschmar, C. Sprunk, and W. Burgard (2012) Feature-based prediction of trajectories for socially compliant navigation.. In Robotics: science and systems, Cited by: §1, §2.1.
  • [33] D.V. Lu, D.B. Allan, and W.D. Smart (2013) Tuning cost functions for social navigation. In International Conference on Social Robotics, pp. 442–451. Cited by: §1.
  • [34] M. Luber, L. Spinello, J. Silva, and K.O. Arras (2012) Socially-aware robot navigation: a learning approach. In 2012 IEEE/RSJ IROS, pp. 902–907. Cited by: §1.
  • [35] P. Mannion, V. Talpaert, I. Sobh, B.R. Kiran, S. Yogamani, A. El-Sallab, and P. Perez (2019) Exploring applications of deep reinforcement learning for real-world autonomous driving systems. arXiv preprint arXiv:1901.01536. Cited by: §1.
  • [36] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §3.3.
  • [37] S. Mohamed and D.J. Rezende (2015) Variational information maximisation for intrinsically motivated reinforcement learning. In NeurIPS, pp. 2125–2133. Cited by: §2.2, §3.2.2.
  • [38] P. Oudeyer, F. Kaplan, and V.V. Hafner (2007) Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation 11 (2), pp. 265–286. Cited by: §2.2.
  • [39] M. Pfeiffer, U. Schwesinger, H. Sommer, E. Galceran, and R. Siegwart (2016) Predicting actions to act predictably: cooperative partial motion planning with maximum entropy models. In 2016 IEEE/RSJ IROS, pp. 2096–2101. Cited by: §1, §2.1.
  • [40] A. Pokle, R. Martín-Martín, P. Goebel, V. Chow, H.M. Ewald, J. Yang, Z. Wang, A. Sadeghian, D. Sadigh, S. Savarese, et al. (2019) Deep local trajectory replanning and control for robot navigation. arXiv preprint arXiv:1905.05279. Cited by: §2.1.
  • [41] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese (2016) Learning social etiquette: human trajectory understanding in crowded scenes. In

    European conference on computer vision

    pp. 549–565. Cited by: §1.
  • [42] R.M. Ryan and E.L. Deci (2000) Intrinsic and extrinsic motivations: classic definitions and new directions. Contemporary educational psychology 25 (1), pp. 54–67. Cited by: §2.2.
  • [43] C. Salge, C. Glackin, and D. Polani (2014) Empowerment–an introduction. In Guided Self-Organization: Inception, pp. 67–114. Cited by: §2.2, §3.2.2, §3.2.
  • [44] C. Salge and D. Polani (2017) Empowerment as replacement for the three laws of robotics. Frontiers in Robotics and AI 4, pp. 25. Cited by: §1, §1, §2.2, §3.2.1.
  • [45] K. Shiarlis, J. Messias, and S. Whiteson (2017) Acquiring social interaction behaviours for telepresence robots via deep learning from demonstration. In 2017 IEEE/RSJ IROS, pp. 37–42. Cited by: §1.
  • [46] A. Sieben, J. Schumann, and A. Seyfried (2017) Collective phenomena in crowds—where pedestrian dynamics need social psychology. PLoS one 12 (6), pp. e0177328. Cited by: §1.
  • [47] R.S. Sutton, A.G. Barto, et al. (1998) Introduction to reinforcement learning. Vol. 2, MIT press Cambridge. Cited by: §3.3.
  • [48] L. Tai, J. Zhang, M. Liu, and W. Burgard (2018) Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In 2018 IEEE ICRA, pp. 1111–1117. Cited by: §1.
  • [49] A. Templeton, J. Drury, and A. Philippides (2018) Walking together: behavioural signatures of psychological crowds. Royal Society open science 5 (7), pp. 180172. Cited by: §1.
  • [50] S. Thrun (1995) An approach to learning mobile robot navigation. Robotics and Autonomous systems 15 (4), pp. 301–319. Cited by: §1.
  • [51] P. Trautman and A. Krause (2010) Unfreezing the robot: navigation in dense, interacting crowds. In 2010 IEEE/RSJ IROS, pp. 797–803. Cited by: §2.1, §2.1.
  • [52] J. van den Berg, S.J. Guy, J. Snape, M.C. Lin, and D. Manocha Rvo2 library: reciprocal collision avoidance for real-time multi-agent simulation. Cited by: §4.1, §4.2.
  • [53] J. Van den Berg, M. Lin, and D. Manocha (2008) Reciprocal velocity obstacles for real-time multi-agent navigation. In 2008 IEEE International Conference on Robotics and Automation, pp. 1928–1935. Cited by: §2.1, §4.2.
  • [54] P. Vinayavekhin, M. Tatsubori, D. Kimura, Y. Huang, G. Magistris, A. Munawar, and R. Tachibana (2017)

    Human-like hand reaching by motion prediction using long short-term memory

    In International Conference on Social Robotics, pp. 156–166. Cited by: §4.1.