Social Behavior Learning with Realistic Reward Shaping

10/16/2018 ∙ by Yuan Gao, et al. ∙ Uppsala universitet 0

Deep reinforcement learning has been widely applied in the field of robotics recently to study tasks like locomotion and grasping, but applying it to social robotics remains a challenge. In this paper, we present a deep learning scheme that acquires a prior model of robot behavior in a simulator as a first phase to be further refined through learning from subsequent real-world interactions involving physical robots. The scheme, which we refer to as Staged Social Behavior Learning (SSBL), considers different stages of learning in social scenarios. Based on this scheme, we implement robot approaching behaviors towards a small group generated from F-formation and evaluate the performance of different configurations using objective and subjective measures. We found that our model generates more socially-considerate behavior compared to a state-of-the-art model, i.e. social force model. We also suggest that SSBL could be applied to a wide class of social robotics applications.



There are no comments yet.


page 3

Code Repositories


This repository contains the research project that enables the robot to automatically join a group based on the modeled personal, social and public spaces of the group.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep reinforcement learning (DRL) algorithms provide a framework for automatic robot perception and control [1] [2]. In recent years, methods based on DRL have achieved great performance in different control tasks like grasping and locomotion [3]. However, the question of how to make robots learn appropriate social behaviors under modern frameworks remains underexplored, partly due to the lack of cross-disciplinary synergies in human-robot interaction studies. As a consequence, the interaction scenarios studied in previous research have been limited to simplified cases and the algorithms studied to relatively simple ones [4].

Our approach is to learn a prior model in a simulator first and then refine the policy using model-based reinforcement learning (RL) in the real world. Learning a prior model in simulator for human-robot interaction (HRI) has a lot of potential benefits. First, it can save a significant amount of real world interactions. Several works [5] [6] have shown that learning a model for physical interactions can help robots learn faster in the real world. Secondly, in social interactions, humans have little tolerance for random behaviors [7], and lose interest quickly if the model deviates too much from social norms. Additionally, the mathematical modeling of social interactions in a simulated setting allows researchers to control factors more rigorously, which can help with the issue of replicability. Finally, learning a prior model can make use of previously-developed models, such as DeepPose [8] or Emonets [9]. However, unlike simulating physical interactions [10], simulating HRI poses a different set of challenges. One of the main challenges is that it is hard for the simulator to accurately model human relevant behavior. Simulators of physical interactions are based on physical laws which are well understood, while human behavior is less predictable. Nevertheless, two ways have been considered to simulate social feedback based on real world signals. The first one is to use computational models [11] that have been studied in experiments [12]

and the second one is to use machine learning methods 


In this paper, we consider how to efficiently learn a prior social interaction model in the simulator and propose a Staged Social Behavior Learning (SSBL) scheme for learning appropriate social behavior with continuous actions. Specifically, we consider a task in which the robot moves towards a small group generated from F-formation [14] based on their simulated social feedback. The task is learned in an end-to-end fashion i.e. from vision to social behaviors in a virtual environment. SSBL involves a pipeline for simulated social robot learning that deconstructs a social task into three steps. In the first step, the robot learns a compressed representation of the world from vision or other modalities. This step is important because it significantly reduces the complexity of the DRL problem. After the compressed information is learned, the algorithm learns a dynamical model from a prior model which is built upon social forces [11] [15] in the environment. The last step is to make sure that the learned behavior follows the social standard by using simulated social norms as realistic reward.

Ii Literature Review

Ii-a Deep Reinforcement Learning in HRI

Reinforcement learning (RL) has been used since the early days of HRI. One of the few early works that considered using social feedback as accumulative rewards were conduced by Bozinovski [16] [17]. After that, many papers in HRI started to investigate the effect of RL algorithms like Exp3 [18] or Q-learning [19] in social robotics settings. However, since such algorithms lack the ability to capture important features from high-dimensional signals [20], their applicability to solve HRI problems remains limited. After the era of deep learning started in 2006 [21], many different algorithms were proposed to understand different modalities in HRI, for example, ResNet [22]

for image processing and Long Short-Term Memory 

[23] (LSTM)-based solutions for text processing. As a consequence, some HRI researchers started to investigate deep learning’s role in the area of HRI. A pioneer work was conducted by Qureshi in 2017 [24]. In this work, a Deep Q-Network (DQN) [20] was used to learn a mapping from visual input to one of several predefined actions for greeting people. Another work was conducted by Madson [25], where a DQN was used for learning generalized, high-level representations from both visual and auditory signals.

Ii-B Learning Representations

RL based solely on visual observations has been used to solve complex tasks such as playing ATARI games [26], driving simulated cars [27] and navigating mazes [28]. However, learning policies directly from high-dimentional data such as images requires a large amount of samples, which makes it intractable in social robot learning [29]. One solution is to use low-dimensional hand-crafted features as the state, but this would reduce learning autonomy.

Prior works have utilized deep autoencoders (AEs) to learn a state representation, including Lange et al. 

[30]. Several variants of AEs have been applied as well, including attempts by Böhmer et al. [29] to learn the dynamics of the environment by constructing an AE predicting the next image, and Finn et al. [31] who adopted a spatial AE (SAE) to learn an intermediate representation consisting of image coordinates of relevant features. The latter suggested that this intermediate representation made it particular well suited for high-dimensional continuous control.

Ii-C Group Behavior

Numerous works have been done in group dynamic behaviors. Particle-based methods [32] [33] simulate global collective behaviors of large scale groups or crowds. For modeling small scale groups, agent-based methods [34] [35] are adopted to simulate the behavior of each individual based on rules of behavior. Specifically, in a small multi-party conversation group, Kendon [14] proposed the F-formation system to define the positions and orientations of individuals within a group, which characterized dynamic group behaviors. Several studies have been carried out that concern robot approaching behaviors towards small groups i.e. in which an agent moves towards a group in an attempt to join an ongoing task or conversation. Ramírez et al. [36] adopted inverse reinforcement learning, involving several participants demonstrating approaching behaviors for a robot to learn. Pedica et al. [37] integrated behavior trees in their reactive method to simulate lifelike social behaviors, including robot approaching behavior towards groups. Both approaching and leaving behaviors are considered in [38], where a finite state machine is utilized in the transitions between different social behaviors. Jan et al. [11] presented an algorithm for simulating movement of agents, such as an agent joining the conversation. The agents dynamically move to new locations, but without proper orientations.

Iii Methodology

In the following sections, we introduce the fundamental concepts needed to train a prior model for robot approaching behaviors in accordance with SSBL. Section III-A introduces some basic concepts and details on how the environment was set up. Sections III-B to III-D describe the three stages of SSBL training. Specifically Section III-B pertains to the state representation and its training procedure. It also details the various architectures used to evaluate this step. Section III-C shows how one can formulate the training of a dynamical model within an RL framework. Section III-D shows how social norms can be acquired by utilizing concepts from the Social Force Model (SFM) [14]. In the original SFM work [14], SFM was also used to generate social robot behaviors, which will be used as a baseline for evaluating our learned policy.

Figure 1: Two views of the environment setup. The left image shows a top-down view of the environment. The right image shows the first-person view from the robot’s perspective. The environment contains two Simulated Human Agents (SHAs: green and blue), and a robot agent (gray). Circles around the human agents and the robot represent different levels of space as discussed in Section III-C.
Figure 2: Schematic view of the encoder used both by the Finn-architecture and ours. The rightmost image is not a part of the network, but a visualization of two of the positions computed in . The slight discrepancy in position between the position of the activation peaks in

and the circles in the output-image is due to the convolutions not using any padding. This makes it so that

only represents the center pixels of the input.

Iii-a Environment Setup

In order to simulate robot approaching behaviors, we first build a simulator using Unity 3D111 game engine. The environment consists of a square floor surrounded by four walls. A conversation group which contains two Simulated Human Agents (SHAs) is spawned at a random position within this domain. The robot agent is spawned outside the group and performs approaching behaviors. The virtual agents (agents in the group and robot agent) are pre-defined assets which resemble the SoftBank Pepper robot. Figure 1 shows one example of environment’s top-down and first-person view. The blue and green agents are SHAs and the gray one at the top right of the top-down view is the robot agent. The first-person view (Figure 1, right) is from the robot agent’s perspective.

In this paper, we are mainly concerned with the task of learning a prior model in the simulator for a robot’s approaching behaviors towards small groups of individuals. The task can be formulated as an RL problem. Let us consider and as the state and action of the robot agent at time , respectively. Learning the dynamic behavior for approaching a group can be viewed as maximizing the expected cumulative reward over trajectories , where is the cumulative reward over . The expectation is under distribution , where is the policy we would like to train and is the forward model determined by the environment.

Iii-B State representations

In our experiments, we try three modes of representing the environment state to the robot agent: Vector, CameraOnly and CameraSpeed

. The first mode is a vector-based representation, consisting of the positions and velocities of all the agents, together with the positions of the walls. This representation is ideal for learning, so it serves as an upper bound on the performance of this task.

The second and third modes are designed to resemble two common robotic settings: one where the robot is equipped with a camera, and one where the robot has both a camera and the ability to estimate its speed. In these modes, the full states are given as

and respectively. Here is the visual information from the robot’s first-person view rendered by the Unity engine, and the velocity of the robot.

The method employed in this work is to learn a mapping from input images to simplified low-dimensional state representations, thus circumventing some of the problems associated with RL from high dimensional input [26]. To do this, we utilize an autoencoder (AE) [39], a neural net that maps inputs to itself, s.t. . An AE can be decomposed into an encoder and a decoder, . By choosing the intermediate representation to be comparatively low-dimensional, or could serve as a simplified but sufficient representation of the state, facilitating accelerated learning.

We implemented and evaluated two different AEs. The first one is a regular convolutional AE. It uses the following encoder and decoder:


where the are convolutional layers and the are fully connected layers.

The second AE is based on the deep SAE described in [31], but with some significant variations. In the following sections, we refer it as Spatial Auto-encoder Variant (SAEV). The SAEV uses the encoder


where are convolutional layers. , using exponential linear units () activation [40], while uses a spatial softmax-activation:


The mapping

takes a number of feature maps, which it treats as bivariate probability distributions. For each, a feature location is estimated by the expectation values:


where is the coordinate of the th feature-map of the input. The presence of a feature is defined as the weighted sum


Intuitively, a feature map which is highly localized around the estimated position has a presence near , whereas one that is very spread out will have presence close to . The output from is the concatenation of the of each feature map. In other words, the intermediate representation contains actual image-coordinates of the features.

The main difference between our SAEV architecture and the SAE described in [31] is the decoder. The decoder we use is


where are convolutional layers, , uses ELU-activations, while uses a sigmoid activation. is a transformation that takes the -tuples and maps each to a feature map:


This creates feature maps, with peaks at that decrease radially outwards according to the ELU  [40]. To the output of , three convolutional layers are applied, followed by an addition operation with a trainable constant to complete the decoder. The constant addition operation frees up the prior stages of the architecture to focus on learning positions of things that are not always in the same place.

All models are trained using the Adam-optimizer  [41]

on a loss function consisting of three components: reconstruction error

, a presence based loss that encourages localized features, and the smoothness loss defined in [31]. For the convolutional AE, the presence loss is ill-defined and thus that term was omitted. One can now use the intermediate representation as input to the RL framework, or to visualize the corresponding image coordinates, as is shown in Figure 2.

Figure 3: Positions extracted from visualized on top of their corresponding input images. The coloring of a feature is consistent across the three images. Most features are visualized as dots, except two features that have been chosen to be visualized as circles to more clearly show how they track their intended objects.

Iii-C Modeling group behavior

In a realistic multi-party conversation group, the individuals within it stand in appropriate positions with respect to others. This positional and orientational relationship has been defined as an F-formation as proposed by Kendon [14]. It characterizes a group of two or more individuals, typically in a conversation, to share information and interact with each other. Most importantly, it defines the o-space which is a common focused space in the group in which all individuals look inward and is exclusive to those external. When conditions change, such as a new individual joining the group, the group members should change position or orientation in order to form a new group including the newcomer. Jan et al. [11] proposed a group model which simulates these behaviors by a social force field. In this paper, we use an extended Social Force model which maintains F-formation through repositioning and reorientating by a conversation force field. This force field is produced and updated by three forces: a repulsion force, an equality force, and a cohesion force. The details of social force fields are described in [15]. In order to better model conversation groups, Hall’s proxemics theory [42] is adopted when generating social force fields, i.e. the repulsion, equality and cohesion forces occurring in personal, social and public spaces, respectively.

The repulsion force prevents other agents from stepping inside its personal space and generates a repulsion force to push them away. Let be the number of other agents inside the personal space of agent , and is the corresponding position of agent . The repulsion force is shown in equation 9.


where , is the position of the agent currently being evaluated. is the radius of its personal space, and is the distance to its closet agent inside the personal space.

The equality force keeps o-space shared to all group members by generating an attraction or a repulsion force towards a point in o-space. Also, an orientation force towards o-space is generated to change body orientation. Let be the number of other agents inside the social space. The equality force and equality orientation are shown in equation 10.


where is the centroid, i.e. , and is the mean distance of the members from the centroid.

The cohesion force prevents an agent to be isolated from a group and keeps agents close to each other by generating an attraction force. Let be the number of other agents inside the public area, is the conversation center and is the radius of the o-space. The cohesion force and cohesion orientation are shown in equation 11.


where , which is the scaling factor for the cohesion force used to reduce the magnitude of the cohesion force if the agent is surrounded by other agents in its social area.

In order to include a component in reward function to drive the robot to approach the group. We incorporate the extended social force model described previously and consider a line integral over a path in aforementioned force fields, namely force fields in personal, social and public spaces, to be the group forming reward. Mathematically, the group forming reward for the robot agent is defined as follows


where is the position of the robot along the trajectory , and is the combined force on the robot agent. Note that the force fields depend on the positioning of all agents, including the SHAs, but for notational simplicity, this is not made explicit in the formulae.

Together with the group forming reward, another reward function called non-increasing reward is added to ensure the the energy in the force field is non-increasing. Mathematically, it is defined as


where is the indicator function and is the set of points along the robot’s trajectory where . These two reward functions help the robot agent to approach the group center. To add further incentive to complete the task, two other other reward components are added. They are a time-penalty (, are the times an episode starts and ends), together with a bonus reward for successful approaching behavior within the required number of time steps.

Iii-D Following Social Norms

In order to make the robot adhere to social norms when it is approaching the group, simulated feedback from other agents is taken into consideration. Therefore, the robot agent considers the impact of its own behavior on others, which is important in generating appropriate real-world robot approaching behaviors. Here, we define summation of all the line integrals of SHAs’ paths in the force fields,


where means the total number of the SHAs.

The final reward is a combination of all five rewards. Each is associated with a weight to indicate the importance of that reward category. On top of the weights considered for each category of rewards, two other weights are used to influence the behavior of the robot. One weight is called egoism wight , which decides how much the robot agent considers achieving its own goal of approaching the group center. The other weight, altruism weight decides how much it cares about other agents, meaning avoiding pushing other SHAs around. The final reward is defined as follows:


By balancing the different weights, we produce a realistic reward function that captures important notions from human social interaction, such as respecting the private space of others.

Iv Results

We used a DRL algorithm called Proximal Policy Optimization (PPO) [43] to learn an appropriate behavior for the robot agent. We selected PPO due to its stability advantages [44] over DQN-based RL algorithms. We used ML-Agents Toolkit222 [45] to carry out our experiments.

Iv-a Models Configurations

To determine what state representation and type of network structure for the value and policy networks are the most suitable for robot approaching behavior, we evaluate combinations of state representations, and network architectures. For the state representations containing visual information, we evaluate both AEs (conv and SAEV from section III-B). The network structures considered are Feed-Forward (FF) networks and LSTM networks. Table I shows the model configurations and their corresponding performance.

Model Reward Percentage
Vector + LSTM (Baseline) -0.256 100.00%
CameraOnly + SAEV + FF -0.869 57.06%
CameraOnly + SAEV + LSTM -0.804 61.63%
CameraOnly + conv + FF -0.81 61.18%
CameraOnly + conv + LSTM -1.091 41.51%
CameraSpeed + SAEV + LSTM -0.544 79.80%
CameraSpeed + conv + LSTM -0.709 68.22%
Random policy -1.684 0.00%
Table I: Results of the different configurations. The reported results are the best cumulative reward of the model.

Performance is measured both as cumulative reward (an exponentially weighted running average is used to smooth the function.) , described in Section III-A, and as percentages. Percentages express relative performance, such that correspond to the baseline performance, and to the mean performance of a uniformly random agent. Figure 4 shows the learning curve of the best model, which uses image and robot’s speed as input, output of SAEV as learning state representation and a LSTM as policy network.

Figure 4: Learning curves of the best model compared to a uniformly random agent and the baseline.

Iv-B Approaching Behaviour: Perceptual Study

We compare the robot approaching behavior learned by our model with the one generated by SFM [15]. In a study conducted by Pedica et al. [12], it was shown that SFM increased believability of static group forming. A major drawback of SFM is that it is directly controlled by the social forces and therefore does not act according to current situations of the environment. We hypothesize that a learned agent that is able to accelerate and decelerated based on the simulated social feedbacks in RL framework can introduce more believability and social appropriateness. In order to compare the behaviors of the SFM and our model, we implemented a version of SFM and compared it with a model learned with the reward function defined in Section III-D. Figure 5 show paths sampled from our trained model and paths sampled from the SFM with the same initial positions.

In order to subjectively assess the behavior of our learned model compared to the behavior generated by SFM, we created a questionnaire to evaluate the approaching behaviors using subjective measures. In this study, the social behaviours are considered from three dimensions, namely polite, sociable and rude, suggested by Okal and Arras [46].

Figure 5: Comparison of generated paths. The left image shows paths generated by our model and the right image shows paths generated by the social force model with the same initial positions.

The questionnaire contains in total six videos of approaching behaviour. The videos show six different approaching behaviors of the robot towards groups from three starting locations by both our model and the SFM. The participants observe the behaviors from a top-down view. For each video, the participant needs to answer four questions. Among the four questions, the first three questions ask the participants to evaluate how much do they think the behaviours are polite, sociable or rude respectively and fourth question asks how much do they think the robot is human-like. For each question, the participants need to select a point from a 1-7 Likert scale, where 1 means ”not at all” and 7 means ”very”. The videos and their corresponding questions are given to the participants in a random order.

In total, 20 participants answered the questions. They have an average age of 28.25 and mixed cultural backgrounds. Figure 6 shows their ratings on approaching behaviours generated by two models. We found that people consider the behavior generated by our model to be significantly more polite (), less rude () and more sociable (). However, we did not find the approaching behaviour generated by our model to be significantly more human-like than the ones generated by SFM (). This might be related to human-likeness is hard to measure when there are more than one factors involved  [47].

Figure 6: Comparison of behaviors generated by two models. The behavior generated by our model is generally considered to be more polite, less rude and more sociable.

V Conclusion and Future Work

In this work, we proposed a scheme (SSBL) that can be considered as a general framework for social robot learning. In order to demonstrate it, we implemented a robot approaching behavior task based on this scheme. For this task, we designed a reward function combining concepts from SFM and Hall’s proxemics theory to enable the robot agent to learn a dynamical model which takes social norms into account. Our results show that we can generate more socially-considerate behavior than SFM, and the SAEV outperforms the vanilla convolutional AE on this task for both CameraOnly and CameraSpeed inputs. 333The code is publicly available at erSocial/tree/master

In the future, we will conduct studies where humans are asked to qualitatively assess the behavior of our learned model compared to the behavior generated by SFM in real-world situations. Regarding the model configuration experiments, we will also investigate how to utilize more subtle real-world human feedback such as engagement to refine our learned model using model-based RL algorithms. As a result, the robot will exhibit comfortable approaching behavior by taking other affective measures into account. We will also conduct policy refinement experiments in the real world with the Pepper robot and actual humans.