Reinforcement Learning Approaches in Social Robotics

09/21/2020 ∙ by Neziha Akalin, et al. ∙ 0

There is a growing body of literature that formulates social human-robot interactions as sequential decision-making tasks. In such cases, reinforcement learning arises naturally since the interaction is a key component in both reinforcement learning and social robotics. This article surveys reinforcement learning approaches in social robotics. We propose a taxonomy that categorizes reinforcement learning methods in social robotics according to the nature of the reward function. We discuss the benefits and challenges of such methods and outline possible future directions.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the proliferation of social robots in the society, these systems will affect their users in several facets of life through assistance, cooperation, and collaboration. In order to facilitate natural interaction, researchers in social robotics have focused on robots that can adapt to diverse conditions and to the different users with whom they interact. Recently, there has been great interest in the use of machine learning methods for adaptive social robots 

(Keizer2014), (de2015robots), (tsiakas2016), (hemminghaus2017), (Khamassi2018), (ritschel2019adaptive).

Machine Learning (ML) algorithms can be categorized into three subfields (mlbook)

: supervised learning, unsupervised learning and reinforcement learning. In supervised learning, correct input/output pairs are available and the goal is to find a correct mapping from input to output space. In unsupervised learning, output data is not available and the goal is to find patterns in the input data. Reinforcement Learning (RL) 

(sutton1998introduction) is a framework for decision-making problems in which an agent interacts through trial-and-error with its environment to discover an optimal behavior. The agent does not receive direct feedback of correctness, instead it receives scarce feedback about the actions it has taken in the past. The agent tunes its behavior over time via this feedback signal, i.e.,, reward (or penalty). The agent’s goal is learning to take actions that maximize the reward.

Interaction is a key component in both RL and social robotics. Therefore, RL can be a suitable approach for social human-robot interaction. Humans perform sequential decision-making in daily life. Sequential decision making describes problems that require successive observations, i.e., cannot be solved with a single action (barto1989learning). Social human-robot interactions can be formulated as sequential decision-making tasks, i.e., RL problems. The goal of the robot in these types of interactions would be learning an action-selection strategy in order to optimize some performance metric such as user satisfaction. Social robots can learn social skills from their own actions without demonstrations through uncontrolled interaction experiences. This is especially true since interaction dynamics are difficult to model and sometimes even humans cannot explain why they behave in a certain way. Therefore, RL may enable social robots to adapt their behaviors according to their human partners for natural human-robot interaction.

However, several problems may occur in practice. RL requires many attempts to learn how to solve a task, which makes it challenging to apply RL in real world scenarios with social robots. The requirement for longitudinal interaction can result in loss of interest and fatigue in humans. An alternative is using a simulation model to train the algorithm and subsequently deploying it on the real robot. However, simulating the real world can be very difficult, especially with regards to modeling relevant human behaviors. Simulating the human requires a predictive model of human interactive behaviors and social norms as well as modeling the uncertainty of the real world. Furthermore, the use of RL in social robotics poses other challenges such as devising proper reward functions and policies,and dealing with the sparseness of the reward signals. Despite these immediate challenges, the use of RL for social robotics is in its beginning and this article presents an up-to-date survey on RL approaches in social robotics. Due to the complexity of the social interactions and the real world, most of the studies applying RL are trained and tested in simulation. However, real world interactions are extremely important to understand social robots’ compatibility and efficacy as well as their broader impacts on the users, yet few examples of real world interactions exist. Although, in the general field of robotics there is a plethora of research in RL, we focus in this review on the works involving social robots.

bartneck2004design (bartneck2004design) define “social robot” as  “A social robot is an autonomous or semi-autonomous robot that interacts and communicates with humans by following the behavioral norms expected by the people with whom the robot is intended to interact.” Following the definition, they stress that a social robot has a physical embodiment. Considering the definition of (bartneck2004design), we exclude the studies with simulations and virtual agents that do not have a physical embodiment. We also exclude studies with industrial robots and studies that do not include any interaction with humans. We exclusively focus on papers that comprise a social robot(s) and human input/user studies. In other words, we include studies that use simulations for training and test on physical robot deployment with user studies. Likewise, studies that use explicit or implicit human input in the learning process are included.

The choice of reward function is crucial in RL for robotics, where the problem is also referred to as the “curse of goal specification” (kober2013reinforcement). As  kober2013reinforcement stated, the key challenge is to differentiate between the feasible and unreasonable amount of exploration. In social robotics, the presence of the human in the loop adds further complexity to this issue. We provide an organizational structure of the research on RL in social robotics based on the nature of reward design. In surveying research on RL and social robotics, four major themes emerged: (1) Interactive Reinforcement Learning (IRL), (2) intrinsically motivated methods, (3) social signals driven methods, and (4) task performance driven methods.

It is worth noting, that there exist review papers on the topic of RL in robotics such as applications of RL in robotics in general (kober2013reinforcement) (kormushev2013reinforcement), policy search in robot learning (deisenroth2013survey), and deep reinforcement learning in soft robotics (bhagat2019deep). Even though, RL has been applied to a variety of scenarios and domains within social robotics, to the best of our knowledge, there exists no survey of the topic in this particular research field. The main purpose of this work is to serve as a reference guide that provides a quick overview of the literature for social robotics researchers who aim to use RL in their research. The researchers can benefit from this review paper by reaching a collection of papers at once including RL applications on the field.

The remainder of this paper is organised as follows: In Section 2, we provide a brief overview of social robotics research and established definitions for the term “social robot”. We describe the methodology used for paper selection and introduce the categories by which we group the papers in Section 3. In Section 4, after a brief general overview of reinforcement learning, we present the various RL approaches used in social robotics, categorized by the applied type of reward function. In Section 5, we discuss the benefits and challenges of applying RL in social robotics. In Section 6, we present the evaluation methodologies used in the papers. We conclude the paper in Section 7, with a number of reflections and an outlook on the future of the field.

2. Social Robotics

The research field of social robotics aims at designing, developing and evaluating robots, which interact with humans within the attributed social roles to them. Since we aim to survey RL approaches for social robotics, it is necessary here to clarify exactly what is meant by a “social robot”. A variety of definitions for a social robot have been suggested in the literature.

fong2003survey presented a review on socially interactive robots and discussed different aspects of design considerations (e.g., embodiment, emotion, dialog, personality, perception) and impact of such robots on humans (fong2003survey). Their definition of a social robot was given as: “Social robots are embodied agents that are part of a heterogeneous group: a society of robots or humans. They are able to recognize each other and engage in social interactions, they possess histories (perceive and interpret the world in terms of their own experience), and they explicitly communicate with and learn from each other.” breazeal2003toward defined social robots from a human observer’s perspective: “the class of robots that people anthropomorphize in order to interact with them” (breazeal2003toward). She argued that people will attribute human or animal-like features (anthropomorphize) to social robots in order to understand and interact with them. Another definition was given by duffy2003anthropomorphism as: “A physical entity embodied in a complex, dynamic, and social environment sufficiently empowered to behave in a manner conducive to its own goals and those of its community” (duffy2003anthropomorphism).

bartneck2004design proposed a definition of social robots and discussed properties of social robots (i.e., form, modality, social norms, autonomy, and interactivity) (bartneck2004design). Their definition was as follows: “A social robot is an autonomous or semi-autonomous robot that interacts and communicates with humans by following the behavioral norms expected by the people with whom the robot is intended to interact.” Adding to this definition, they emphasized that a social robot has a physical embodiment. hegel2009understanding proposed a definition by considering the above mentioned definitions:  “A social robot is a robot plus a social interface” where they considered the social interface as the sum of all social features by this means an observer judges the robot as a social interaction partner (hegel2009understanding).  yan2014survey reviewed perception methods of Human-Robot Interaction (HRI) in social robots, in their work, they also provided a definition for a social robot as “a robot which can execute designated tasks, and the necessary condition turning a robot into a social robot is the ability to interact with humans by adhering to certain social cues and rules” (yan2014survey).  darling2012extending defined a social robot as: “a physically embodied, autonomous agent that communicates and interacts with humans on an emotional level” (darling2012extending). The reader may refer to the review paper on scientific and popular definitions of the social robot (sarrica2019many) for more references.

We can see a wide spectrum of characteristics in these definitions. However, two important aspects become prominent in the definitions mentioned so far: embodiment and interaction/communication capability. Therefore, the scope of this survey is limited to studies that include physical embodied robots and real world interactions. Based on the presented definitions, in this paper, we consider social robots as embodied agents that can interact and communicate with humans. Figure 1 shows some of the social robots that were used in the reviewed papers.

Figure 1. Some of the social robots referenced within in the reviewed papers.

The pictures of (a) Pepper robot, (b) Nao robot, (d) Furhat robot and were taken by the authors. (c) Mini robot, the figure is adapted from (maroto2018bio) – licensed under the Creative Commons Attribution, (e) Maggie robot, the figure is from, accessed 20 March, 2020 – licensed under the Creative Commons Attribution, (f) iCat robot, the figure is from – used with permission, photo credit to Christoph Bartneck.

3. Survey Structure and Methodology

We carried out a systematic search with the objective of reviewing papers about reinforcement learning applications in social robotics. As a first step, we identified two main key terms “reinforcement learning” AND “social robot*”. In addition, we used the following key terms: “social* assist* robot*”, “user stud*”, “human input”, “feedback”, “service robot*”, “rehabilitation robot*”, “therapy robot*”, “robot tutor*”,“human in the loop”, “interactive learning”. We searched in the following databases: Web of Science, Scopus, SpringerLink, ScienceDirect, ACM Digital Library, IEEE Xplore and Academic Search Elite. In the second step, as previously mentioned, we eliminated papers based on exclusion and inclusion criteria to narrow down the search results.

The scope was limited to studies that include social physical robot and real world human-robot interactions with users (i.e., the study should use human input during the learning process or the study should have field trials with users), as opposed to industrial robots and simulated environments. We applied two elimination steps on the search results. In the first step, we evaluated the results by reading the abstracts, and in the second step, we read the full papers of the remaining publications. We also examined reference lists of included articles to avoid skipping any relevant studies. We included only full papers, extended abstracts were eliminated. The earliest published work that meets our inclusion criteria appeared in 2006, and the survey cutoff date was September 2019. The final list of papers includes 55 studies, satisfying the inclusion criteria. After surveying research on RL and social robotics and full-text assessments, four major themes emerged based on the design of the reward mechanisms (see Figure 2):

Figure 2. Reinforcement Learning approaches in social robotics.
  • Interactive reinforcement learning: In these methods, humans are involved in the learning process either by providing feedback or guidance to the agent, i.e., the feedback signal the agent receives depends exclusively on human input. This approach, in which the human guides the agent with explicit or implicit feedback, is known as Interactive Reinforcement Learning (IRL). In the explicit feedback approach, the feeback of the human teacher is noiseless. The human teacher observes the agent’s actions and environment states, and subsequently provides feedback to the agent though an interface or through the robot’s (touch) sensors. In the implicit feedback approach, the human teacher guides the learning, however, the feedback is not direct (i.e., emotions, speech, etc.). Instead, feedback depends on the robot’s perception and recognition. Figure 3a depicts an illustration of IRL where a human participant is a part of the environment.

  • Intrinsically motivated methods: Despite the existence of different intrinsic motivations in RL literature (oudeyer2008can), the most seen approaches in social robotics depend on the robot maintaining an optimal internal state by considering both internal and external circumstances. Figure 3b depicts an illustration of intrinsically motivated RL.

  • Social signals driven methods: In these methods, the reward signal depends on the humans’ social signals. These methods use social cues that occur naturally during the interaction (i.e., emotions, engagement, interaction distance, gestures, gaze direction etc.). The difference between social signals driven methods and the explicit feedback in IRL is that in social signals driven methods, the user is not aware that her/his social signals or emotions are guiding the learning.

  • Task performance driven methods: In these methods, the reward depends on either the robot’s task performance or the human interactant’s task performance, or a combination of both.

The proposed taxonomy aims to facilitate and guide the choice of a suitable algorithm by social robotics researchers in their application domain. For that purpose, we elaborate on the different methods that are tested in real world scenarios with a physical robot.

Figure 3. Illustration of (a) IRL, and (b) intrinsically motivated RL.

4. Reinforcement Learning Approaches for Social Robotics

Reinforcement Learning (sutton1998introduction)

is a framework for decision-making problems. Markov Decision Processes (MDPs) are mathematical models for describing the interaction between an agent and its environment. Formally, an MDP is denoted as a tuple of five elements (

, , , , ) where represents the state space (i.e., the set of possible states), represents the action space (i.e., the set of possible actions),

is the transition function that represents the probability of transition from one state to another state given a particular action,

represents the reward function and is the discount factor that determines the importance of future rewards, . The agent interacts with its environment in discrete time steps, ; at each time step , the agent gets a representation of the environmental state , takes an action , moves to next state , and receives a scalar reward . The agent’s goal is to learn an optimal behavior that maximizes the expected cumulative discounted reward. This optimal behavior is called optimal policy. The policy, , is defined as a mapping from states to actions, , where is the probability of taking action given state .

The value of policy , namely the value function, is used to evaluate the states based on the total reward the agent receives over time. RL methods that approximate the value function instead of directly learning policy are called value-based methods. For each learned policy , there are two related value functions: the state value function (), and state-action value function (or quality function) (), where means the agent follows policy in each step. The value functions are expressed via the Bellman equation (bellman1952theory). Q-learning is a model-free and value-based algorithm (watkins1989learning), it is one of the most popular RL algorithms using discounted reward (gosavi2006boundedness). Q-learning algorithm iteratively applies the Bellman optimality equation. We found different variations of Q-learning in social robotics, where it is the most commonly used RL method.

Policy-based methods do not use value function models, instead they directly search for an optimal policy . In these methods, the policy is parameterized and the focus is the policy optimization to find a that maximizes the expected return by using gradient-based or gradient-free optimisation (arulkumaran2017brief).

The learning in RL progresses over discrete time steps by the agent interacting with the environment. Obtaining an optimal policy requires a considerable amount of interaction with the environment, which results in high memory and computational complexity. Therefore, the tabular approaches that represent state value functions, , or state-action value functions,

, as explicit tables, are limited to low-dimensional problems, and they become unsuitable for large state spaces. A common way to overcome this difficulty is to find a generalization for estimating state values by using a set of features in each state. In other words, the idea is using a parameterized functional form with weight vector

for representing or that are written as or

instead of tables. Such approximate solution methods are called function approximators. The reduction of state space by using the generalization capabilities of neural networks, especially deep neural networks, is becoming increasingly popular. Deep Learning (DL) has the ability to perform automatic feature extraction from raw data. Deep Reinforcement Learning (DRL) introduces DL to approximate the optimal policy and/or optimal value functions 

(rlbook_new). Recently, there has been an increasing interest in using DL for scaling RL problems with high-dimensional state spaces. For natural interaction, it is important that social robots possess human-like social interaction skills, which require features from high dimensional signals. In fact, several researchers have begun to examine the applicability of DRL in social robotics (Qureshi2016) (qureshi2017show) (cuayahuitl2019data).

The overall goal of an RL agent is to maximize the expected cumulative reward over time, as stated in the “reward hypothesis” (rlbook_new). The reward in RL is used as a basis for discovering an optimal behavior. Hence, reward design is important to elicit desired behaviors in RL-based systems. We categorized the reviewed papers based on the nature of the reward. The summary of the included papers is given in Table LABEL:summary_table.

4.1. Interactive Reinforcement Learning

Combining human and machine intelligence may be effective for solving computationally hard problems (holzinger2016towards). Integrating human feedback with RL can be done in different ways, such as by shaping the action policy (the human is involved in the action selection mechanism), and shaping the reward and the value function (knox2012reinforcement). The most frequently seen approach in social robotics is using human feedback as the reward. In this approach, the human teacher is involved in the learning phase by providing feedback on the action taken by the agent (i.e., right or wrong). This online guidance reduces the action space by narrowing down the action choices (thomaz2006reinforcement2), which speeds up the training process by accelerating the convergence time to reach an optimal policy (the set of actions that in each state lead to the maximum reward). In these approaches, the human can be in the learning loop with varying types of inputs such as providing feedback explicitly through an interface, as well as with speech and social cues. Therefore, this category comprises two subcategories: explicit feedback and implicit feedback, which are explained in the following subsections.


4.1.1. Explicit Feedback

In this approach, the feedback or guidance from the human teacher is noiseless and direct in the form of numerical values provided via a button, a Graphical User Interface (GUI) or through the robot’s touch sensors. The studies under this category are (Barraquand2008)(suay2011effect)(suay12)(Knox2013)(yang2017)(schneider2017exploring)(Churamani2018)(Gamborino18), and (ritschel2019adaptive). For a quick summary, see Table LABEL:summary_table.

Barraquand2008 (Barraquand2008) conducted five experiments with different modifications of the classical Q-learning algorithm. The Sony AIBO robot interacted with the user to learn appropriate behaviors based on the current social situation. The participant provided feedback through tactile sensors, caressing the robot for the positive feedback and tapping the robot for negative feedback. In the experiments, they investigated how social factors can be used to improve learning. The aim of the robot was to learn polite behavior in situations where the user was doing different activities such as reading, sleeping, calling on the phone, etc. Polite behavior was defined as the behaviors stimulating pleasure. The authors noted that proper credit assignment (i.e., actions that lead to higher cumulative reward are more valuable) improved the effectiveness of learning for social interaction.

suay2011effect performed experiments similar to those presented in (thomaz2006adding) in a real world scenario with the Nao robot (suay2011effect). In (thomaz2006adding), the human teacher trained a virtual robot to bake a cake in a domain called Sophie’s Kitchen. (suay2011effect) reproduced the interface where the human trainer observed the robot in its environment via a webcam and provided reward based on the robot’s past actions or anticipatory guidance for selecting future actions through a GUI. They conducted four sets of experiments (small state space and only reward, large state space and only reward, small state space and reward plus guidance, large state space and reward plus guidance) to investigate the effect of teacher guidance and state space size on learning performance in IRL. The task was object sorting and the size of state space depended on the object descriptor features. Their results showed that the guidance accelerated the learning by significantly decreasing the learning time and the number of states explored. They observed that human guidance helped the robot to reduce the action space and its positive effect was more visible in large state-space. In a similar vein, (suay12) conducted a user study where 31 participants taught the Nao robot to catch the robotics toys by using one of three algorithms: Behavior Networks, Interactive Reinforcement Learning (IRL), and Confidence-Based Autonomy. The study compared the results of these algorithms in terms of algorithm usability and teaching performance by non-expert users. In IRL, the participants provided positive or negative feedback through an on-screen interface. In terms of teaching performance, users achieved better performance using Confidence-Based Autonomy, however, IRL was better of modelling user behavior. The authors noted that teaching with IRL required more time than the other methods because users had the tendency to stop rewarding or to vary their reward strategy which affected the training time.

Knox2013 (Knox2013) demonstrated the applicability of the TAMER framework (knox2009interactively) on a physically embodied robot. TAMER (stands for Training an Agent Manually via Evaluative Reinforcement) enables a human trainer to train an agent by interactively shaping its policy via feedback delivered through push buttons, speech, or an interface. The first author of the paper trained the Nexi robot by using a remote with two buttons (positive: +1, negative: -1). The scenario included an artifact and the Nexi robot learning five actions of interactive navigation. The estimation of the 3-dimensional location and orientation of the artifact was received through a motion capture system. The authors discussed the transparency challenges when the artifact was out of range, the motion capture system could not detect its position. This information was not visible to the human trainer, which affected the training time.

yang2017 (yang2017) proposed a Q-learning based approach that combines homeostasis and IRL. The internal factors, i.e., the drives and motivations worked as a triggering mechanism to initiate the robot’s services. However, the reward in the real world experiments was given by the user touching the robot’s head, left hand and right hand to give positive, negative and dispensable feedback, respectively. For this reason, we categorized this paper as IRL and explicit reward. The authors trained their model in a simulator and deployed it on the Pepper robot. The target group of the application was older people with dementia and the purpose was providing support with daily activities such as reporting the weather, giving schedule reminders, and having conversations with the older person.

schneider2017exploring (schneider2017exploring)

investigated a dueling bandit learning (double Thompson sampling algorithm 

(wu2016double)) approach for preference learning in an exercise preference scenario with two different embodiments i.e., virtual Nao and physical Nao, and a non-embodied system (i.e., a GUI in a computer). In the user studies, the users were asked to select an exercise from six different exercises in different categories. The users were shown two exercises at once and asked to select one of them, as a result of which the preference matrix was updated. After ten minutes of interaction, the ranked exercise categories were presented to the user. In this aspect, this approach is similar to IRL with explicit feedback since the participants provided a feedback via a button. The authors did not find any difference regarding the preference.

Churamani2018 (Churamani2018) presented a deep hybrid neural model for enabling the Nico robot to express empathy towards the users. They focused on both recognizing the emotions of the user and generating emotions for the robot to display. The presented model consisted of three modules: an emotion perception module, an intrinsic emotion module, and an emotion expression module. For the perception module, both the visual and audio channels were used to train a Growing-When-Required (GWR) Network. The purpose was not simply mimicking the user’s expression, instead the intrinsic emotion module came into play for providing long-term and slow-evolving models of affect over the entire interaction. For the emotion expression module, they used a Deep Deterministic Policy Gradient (DDPG) based actor-critic architecture. The reward was the symmetry of the eyebrows and mouth in offline pre-training, whereas, in online training, the reward was provided by the participant deciding whether the expressed facial expression was appropriate. The Nico robot expressed its emotions through programmable LED displays in the eyebrow and mouth area. Six participants involved in the online training.

ritschel2019adaptive (ritschel2019adaptive) employed the robot Reeti as an autonomous companion where the robot adapted its linguistic style to the individual user’s preferences. They defined the learning tasks as k-armed bandit problems (rlbook_new). The adaptation was done based on the explicit human feedback given via buttons. They tested the system with one female and one male older participant. In the scenario, the robot provided several functionalities such as information retrieval, reminders, communication, and entertainment as well as health-related recommendations. They reported that the participants’ preferences varied depending on the politeness strategies and the context.

Tseng18 (Tseng18) proposed a model-based RL strategy for a service robot learning the varying user needs and preferences and adjusting its behaviors. The proposed reward model was used to shape the reward through human feedback, specifically by calculating temporal correlations of robot actions and human feedback. The participants provided feedback by using an interface. In the user study, ten participants interacted with the ARIO robot in three different experiments: learning user needs and preferences for new users, adapting to changes of user needs and preferences and learning in different contextual situations. They compared the proposed method with a model-based approach R-Max (brafman2002r) and a model-free approach Q-learning. The collected cumulative reward and the rate of accepted services offered by the robot were higher in the proposed method.

Gamborino18 (Gamborino18) presented an IRL approach for socially assistive robots for children to support them in emotionally difficult situations. In the proposed method, the human trainer selects the actions for the social robot RoBoHoN (small humanoid smartphone robot) through an interface with the purpose to improve the mood of the child depending on her/his current affective state. Whereas in previous studies the human feedback was converted into the reward signal, in the present approach it was instead used for policy shaping. The affective state of the child was based on seven basic facial emotions and engagement obtained by the Affectiva software (mcduff2016affdex)

and stored in an input feature vector to classify the mood of the child as good or bad. The emotions were binarized as 1 or 0 depending on the value whether was greater than or less than the average respectively. The robot suggested a set of actions to the trainer the aim was to suggest actions that would match with the action selection of the trainer throughout the training by this way the agent would act standalone. In the user study, 29 children interacted with the robot and the authors reported that the robot improved the mood of the children based on the compared pre and post-PANAS-C questionnaire 

(laurent1999measure) results.

4.1.2. Implicit Feedback

In this approach, similar to the explicit feedback, the human is aware that she/he is guiding the learning by providing feedback to the agent, however, the feedback is not directly provided by using any interface. Rather, the human emotions or verbal instructions act as guidance signals. In this approach, the feedback is not noise-free, since it also depends on the success of perception and recognition of the emotions or speech. We call this type of feedback implicit feedback. The studies in this category are (Thomaz2007)(thomaz08)(gruneberg2012lesson)(gruneberg2013approach)(nejat2008can), and  (patompak2019learning). The details of each paper are given below (for a quick summary, see Table LABEL:summary_table).

Thomaz2007 (Thomaz2007) focused on the asymmetric meaning of positive and negative feedback of human teachers in IRL. In their previous IRL experiments, they realized that human trainers might have multiple intentions with the negative reward they are giving, such as the last taken action was bad and future actions should correct the current state. They performed experiments in two different platforms: the Leonardo robot learned pressing buttons and a virtual agent learned baking a cake (Sophie’s kitchen). The virtual agent responded to the negative reward by taking an UNDO action, i.e., the opposite action. In the examples with the Leonardo robot, the human teacher provided verbal feedback. After negative feedback, the robot expected the human teacher to guide it through refining the example by using speech and gestures (collaborative dialog). They compared the standard IRL algorithm with the proposed one. The mentioned advantage of the proposed algorithm was a more efficient and robust exploration strategy, fewer states visited, fewer failures occurred and fewer action trials done for learning the task.

thomaz08  (thomaz08)

explored how self-exploration and human social guidance can be coupled for leveraging intrinsically motivated active learning. They called the presented approach socially guided exploration, in which the robot could learn by intrinsic motivations, however, it could also take advantage of a human teacher’s guidance when available. The robot had two motivational drives: mastery and novelty. In this approach motivational drives do not advance the reward, but rather mediate between learning a new task, practicing a learned task and exploring the environment. The Leonardo robot learned to perform a number of tasks with a pre-programmed puzzle box from nine naive participants who were unfamiliar with robotics and machine learning. The robot learner with human guidance generalized better to new starting states and reached the desired goal states faster than the self-exploration. They also examined the interaction dynamics and human teaching behavior, for example, participants’ tendency to wait for eye contact with the robot before saying the next utterance. Another observation was that the participants mirrored the robot’s facial expressions and tone of voice in their feedback utterance. It is not mentioned explicitly which type of RL was used, but they mentioned a goal-oriented hierarchical task model where the human teacher could suggest an action, label goal states and also provide feedback. In this sense, the presented approach differs from standard IRL. This paper is categorized under implicit feedback because the human teacher guides by way of speech or gestures, not through an interface.

Authors in (gruneberg2012lesson) and in (gruneberg2013approach) investigated the idea of the subjective agent, which was explained as an agent that processes its inputs and decides on its own actions, in a way that it exhibits autonomous relations to itself and others. For that purpose, they used IRL with a modified reward mechanism. In the experimental scenarios, two different task described: a robotic arm keeping a pendulum balanced, and a Nao robot sorting balls. For the robotic arm, the human teacher provided two-digit positive and negative feedback by using an interface. Since our main consideration is the social robots, these papers are categorized based on the Nao robot scenario. In the Nao robot scenario, one human teacher provided positive reward with a smile and negative reward with a frown to the robot in a ball sorting task where the goal was to give red balls to the teacher and to throw green balls away. The agent evaluated the feedback by comparing it to a certain history of previous feedback received for the same behavior. If the previous feedback was consistent with the history, then the robot modified its reward function and decided by itself whether and to what extent its behavior should be modified. However, no further implementation and technical details about RL were provided in the papers.

nejat2008can (nejat2008can) presented preliminary results of a socially assistive system for older people. This system utilized Q-learning where the user provides verbal feedback to the robot. In the scenario the robot reminded the user about daily activities, while exhibiting affective behaviors based on the user’s current affective state, as inferred from their body language. zarinbal2019new (zarinbal2019new)

presented their preliminary results with a social robot in text summarization. They used Q-learning for query-based scientific document summarization. The Nao robot was used to gather the user’s preferences to tune the summary. In the experiment, one user provided reward to the robot by using his facial expressions. The facial expressions were classified as like, dislike and neutral, which helped the robot to score the sentences for summarizing.

patompak2019learning (patompak2019learning) proposed a dynamic social force model for social HRI. The authors considered two interaction areas: a quality interaction area and a private area. The quality interaction area was defined as the distance from which the users can be engaged in high-quality interactions with robots. The proposed model was designed by a fuzzy inference system, the membership parameters were optimized by using the R-Learning algorithm (schwartz1993reinforcement). R-learning is an average reward RL approach; it does not discount or divide experience into distinct episodes with finite returns (mahadevan1996average). They argued that R-learning was suitable for the scenario since they intended to take every interaction experience into account equally. The purpose of the proposed system was to navigate the robot to the locations where the users can be engaged in high-quality interactions while not intruding into the private area. They tested the proposed method in simulations and with the Pepper robot. In the real robot experiments, five participants interacted with the Pepper robot and they provided positive or negative verbal rewards regarding the social distance based on whether they felt comfortable or uncomfortable. They reported that the robot learned to interact with the participants at the proper distance between the boundaries of interaction and private areas, even though each participant had a different range of preferred distance. The interaction duration per person was not mentioned.

4.2. Social Signals Driven Methods

Human social signals, including different body signals, and affective states have been used as a reward signal in RL approaches in social robotics. The most widely used signals are human emotions, as emotions have a great influence on decision-making (lerner2015emotion). Computational emotion models have been studied by many researchers in the agent’s decision making architecture, especially in RL where studies focus on agent/robot emotions (moerland2018emotion). Since emotions also play an important role in social interaction, communication and social robotics (fong2003survey). There exist various studies considering these aspects for RL and social robotics. As opposed to implicit feedback mentioned in the previous section, in this approach the human interactant is not aware that her/his feedback is used in the learning process. The studies under this category are (mitsunaga2006robot)(mitsunaga2008adapting)(leite2011modelling)(addo2014applying)(Chiang2015)(gordon2016affective)(Ritschel2017b)(Weber2018a)(martins2019), and  (ramachandran2019). The details of each paper are given below (for a quick summary, see Table LABEL:summary_table).

The authors in (mitsunaga2006robot) and in (mitsunaga2008adapting) proposed an adaptive robotic system based on Policy Gradient Reinforcement Learning (PGRL). The robot Robovie II was taught to adjust its behaviors (i.e., proxemics zones, eye contact ratio, waiting time between utterance and gesture, motion speed) according to discomfort signals of humans (i.e., body repositioning amount and the time spent gazing at the robot). These discomfort signals were used as a reward and the goal of the robot was to minimize these signals, thereby reducing experienced discomfort by the human interactant.

leite2011modelling (leite2011modelling) used a multi-armed bandit for empathetic supportive strategies in the context of a chess companion robot for children. The difference in the probabilities of the user being in a positive mood before and after employing supportive strategies was used as a reward. The child’s affective state was calculated by using visual facial features (smile and gaze) and contextual features of the game (game evolution i.e winning/losing, chessboard configuration). In a between-subject user study, 40 elementary school students interacted with the iCat robot. The conditions were empathetic (empathetic strategies based on the adaptive algorithm), neutral (no empathetic strategies) and random empathetic (randomly selected empathetic strategies). In the empathetic conditions, the robot provided empathetic supportive responses through facial expressions, encouraging comments or prosocial actions. The authors reported that children perceived the robot in both empathetic versions as engaging and helpful, and discussed that long term interaction would be necessary to see the effects of the empathetic robot.

addo2014applying (addo2014applying) employed a torso Nao robot for entertaining a human audience by telling jokes. After each joke, the human participant provided a verbal feedback (i.e., reward) such as “very funny”, “funny”, “indifferent” and “not funny”. They used Q-learning where the actions of the robot were pre-classified jokes. The robot interacted with 14 participants, and the authors mentioned that the participants’ mood improved progressively.

Chiang2015 (Chiang2015)

proposed a Q-learning based approach for personalizing the human-like robot ARIO’s interruption strategies based on the user’s attention and the robot’s belief in the person’s awareness of itself (the authors called it the robot’s theory of awareness). They formulated the problem based on the user attention, which was referred to as a Human-Aware Markov Decision Process. The human attention estimation was detected with a trained Hidden Markov Model (HMM) from human social cues (face direction, body direction, and voice detection). The reward was predefined numerical values based on the robot’s theory of awareness about the attention and engagement level of the user. The robot had six actions (gestures: head shake and arm wave; navigation: approach the user and move around; audio: make sound and call name) to draw the user’s attention while the user was reading. Five participants interacted with the robot, after two hours of interaction the optimal policy had converged. The robot developed personalized policies for each user depending on their interruption preferences.

In the work by  (gordon2016affective) an affective tutoring system for children was presented. The system included an Android tablet and the Tega robot setup integrated with the Affectiva software (mcduff2016affdex) for facial emotion recognition. They used the State-Action-Reward-State-Action (SARSA) algorithm where the reward was a weighted sum of engagement and valence values obtained from the Affectiva software. They employed the robot as a second language tutor that helped children in learning new Spanish words throughout three sessions. The authors compared two conditions: the robot with personalized motivational strategies and a non-personalized robot with a fixed policy. In the experiments, 27 children interacted with the Tega robot. They reported that the personalized robot resulted in increased long-term positive valence.

Ritschel2017b (Ritschel2017b)

presented a social-cues-driven Q-learning approach for adapting the Reeti robot to keep the user engaged during the interaction. The engagement of the user was estimated from the user’s movement through the Kinect 2 sensor by using a Dynamic Bayesian Network. They used the change in the engagement as a reward in the storytelling scenario to adapt the robot’s utterance based on the personality of the user. The user study was not described in detail and the number of participants was not specified.

Similar to  (gordon2016affective)park2019model (park2019model)

used the Tega robot as a language learning companion for young children. A personalized policy was trained through 6-8 sessions of interaction by using a tabular Q-learning algorithm. In the scenario, each child interacts with the robot one by one where the child and the robot tell stories to each other. The reward function was a weighted sum of engagement and learning where the engagement depended on the child’s affective arousal value divided into four quartiles obtained by using the Affectiva software 


. They compared children’s engagement and learning outcomes in three groups; a personalized robot, a fixed curriculum non-personalized robot, and a baseline group without any robot intervention. They reported that the children in the personalized group engaged more and learned more compared to the other groups. The children’s body pose was analyzed with OpenPose 

(cao2017realtime), they noted that Electrodermal activity (EDA) collected with the E4 armband and a forward-leaning pose were correlated with engagement.

In the work by  Weber2018a (Weber2018a)

incorporated social signals in the learning process, namely the participant’s vocal laughs and visual smiles as a reward. Their purpose was to understand the user’s humor preferences in an unobtrusive manner in order to improve the engagement skills of the robot. In a joke-telling scenario, the Reeti robot adapted its sense of humor (grimaces, sounds, three kinds of jokes and their combination) by using Q-learning with a linear function approximator with normally distributed initialized weights. They compared the adaptive and non-adaptive robot (telling random jokes) in between-subject experiments with 24 participants in total. They reported that based on the subjective data and the measured amusement level, the adaptive robot entertained the participants better against the baseline condition.

In addition to the articles outlined above, there are articles using Partially Observable Markov Decision Processes (POMDPs) based techniques for social robots (martins2019) (ramachandran2019). POMDPs are an extension to MDPs in which the current state is not fully observable, instead the agent maintains a belief state about being in a particular state that is updated after each action and observation. The work by  martins2019 (martins2019) presents a user-adaptive decision-making technique based on a simplified version of model-based RL and POMDP formulation. In their method, the values of robot actions depended on the actions’ influence on the user. The robot models the user’s state with the goal of keeping the user in a positive state by performing an action and estimating the resulting user status in each iteration. They tested the proposed method in a scenario where the state space consists of both user-related features such as user satisfaction and health as well as robot-related features such as speaking volume and proximity to the user on the simulated environment, and also with GrowMu social robot and human users. However, the user study is not detailed, the procedure and number of participants is not specified. In their simulation results, the system is able to take actions that keep the user in a “valuable” state.

In the work by  ramachandran2019 (ramachandran2019) a robot tutor was made that models students’ engagement and knowledge states and selected proper assistive actions using a POMDP framework. They tested the proposed approach in a between-subject study with 28 students where 14 students interacted with the robot having a fixed assistive action selection policy and 14 students interacted with the robot featuring a POMDP-based action selection policy over five sessions. Their user study results showed that personalized tutoring with the POMDP-based action selection policy improved the learning gains for students compared to a fixed policy.

4.3. Intrinsically Motivated Methods

It is a common approach to examine the biological and psychological decision-making mechanisms and to use a similar method for autonomous systems. One such approach is combining intrinsic motivation with reinforcement learning. Intrinsic motivation is a concept in psychology, which denotes internal natural drive to explore the environment, as well as gain new knowledge and skills. The activities are done for their inherent satisfaction rather than external rewards (ryan2000intrinsic). Researchers have proposed computational approaches that use intrinsic motivation (oudeyer2009intrinsic). In intrinsically motivated RL, the main idea is using intrinsic motivations as a form of reward (chentanez2005intrinsically). Despite the fact that there are different intrinsic motivation models within the RL framework (oudeyer2008can), the majority of the studies in social robotics depends on the idea of maintaining the internal well being of the robot. We describe these in detail in the following subsection.

Silva12 (Silva12)

presented a robotic architecture which included an artificial motivational system driving the robot’s behaviors to satisfy its intrinsic needs, so-called necessities. The motivational system comprised necessity units that were implemented as a simple perceptron with recurrent connections. They compared the performance of three different RL algorithms, namely contingency learning, Q-learning and Economic TG (ETG) methods for shared attention in social robotics. ETG is a relational RL algorithm that incorporates a tree-based method to store examples 

(da2009relational). Shared attention is a communication form that is the attention of an individual to an object or an event that can be observed by another individual including a sequence of steps: mutual gaze, gaze following, imperative pointing and declarative pointing. In the experiments, they used a robotic head (WHA8030 of Dr. Robot). Because ETG performed better in the simulation experiments, they used ETG in real world experiments where one of the authors interacted with the robotic head.

Qureshi2018 (Qureshi2018) proposed an intrinsically motivated RL approach for a humanoid robot learning human-like social skills. In their previous paper (Qureshi2016), the reward was based on the success of the robot’s handshaking attempt, which was obtained from a touch sensor attached to the Pepper robot’s right hand. In (Qureshi2018), the proposed method utilized three basic events to represent the current state of the interaction, i.e., eye contact, smile, and handshake. These event occurrences were predicted at the next time step according to the state-action pair by a neural network called Pnet. Another neural network called Qnet was employed for action selection policy guided by the intrinsic reward. The reward was the prediction error of Pnet, i.e., the error between actual occurrence of events and Pnet’s prediction. They compared the collected total reward in 3 days of experiments in a public place, each day following a different policy (random policy, Qnet policy, and the previously employed method (Qureshi2016)). The current proposed model led to more human-like behaviors. They reported that the robot captured the intention-depicting elements of human behaviors (e.g., human body language, walking trajectory or any ongoing activity).

4.3.1. Homeostasis-based Methods

Homeostasis, as defined by cannon1939wisdom, refers to a continuous process of maintaining an optimal internal state in the physiological condition of the body for survival (cannon1939wisdom).  berridge2004motivation explains homeostasis motivation with a thermostat example that behaves as a regulatory system by continuously measuring the actual room temperature and comparing it with a predefined set point, and activating the air conditioning system if the measured temperature deviates from the predefined set point (berridge2004motivation). In the same manner, the body maintains its internal equilibrium through a variety of voluntary and involuntary processes and behaviors.

The majority of existing literature on homeostasis-based RL in social robotics is presented by the same research group (perula2019bioinspired) (maroto2018bio) (castro2014learning) (castro2013autonomous) (Castro-Gonzalez2011) (malfaz2011biologically). These studies introduced a biologically inspired approach that depends on homeostasis. The robot’s goal was to keep its well being as high as possible while considering both internal and external circumstances. The common theme in these studies is that the robot has motivations and drives (needs), where each drive has a connection with a motivation. These motivations serve as action stimulation to satiate the drives. A drive can be seen as a deficit that leads the agent to take action in order to reduce this deficit and to maintaine an internal equilibrium. The ideal value for a drive is zero corresponding to the absence of need. The robot learns how to act in order to maintain its drives within an acceptable range.

In (maroto2018bio), the social robot Mini tried to maintain its well being in a game-playing scenario by interacting with the user if the user was close to the robot. The robot had two motivations, social and relaxed, and two corresponding drives; interaction and rest, respectively for these motivations. The authors used Q-learning, in which the reward was a variation of the robot’s well being. The limitation of this study is that they have tested their algorithm with only one participant. Likewise, in (perula2019bioinspired), the robot Mini learned different policies for each user in an educational game scenario. However, in this paper, the robot had one more additional motivation, please, and the corresponding drive was the user’s satisfaction. Two participants interacted with the social robot Mini and the participants rated their satisfaction through an interface. Those ratings affected the user’s satisfaction drive, with low ratings increasing the drive and high ratings decreasing it. In (castro2014learning)(castro2013autonomous) and (Castro-Gonzalez2011), a variation of the traditional Q-Learning algorithm was used in addition to the homeostasis-based approach. The authors referred to the proposed algorithm as “Object Q-learning”. The proposed algorithm was implemented on the social robot Maggie who lived in a laboratory and interacted with several objects in the environment (e.g., a music player, a docking station, or humans). In order to reduce the state space, the robot learned what to do with each object without considering its relation to other objects. In other words, they assumed that an action execution associated with a certain object does not affect the state of the robot in relation to the rest of the objects. (castro2013autonomous) appears to be closely linked to the other papers with one difference in that a discrete emotion (fear) was used as one of the motivations. Unlike other motivation-drive pairs, no drive was associated with fear ‘motivation’ (i.e., fear is not a deficiency of any need). The fear ‘motivation’ linked to dangerous situations (situations that can damage the robot) and directed the robot toward a secure state. As an example, the motivation social was not updated if the user who occasionally hit the robot was around.

4.4. Task Performance Driven Methods

Task performance is the effectiveness with which an agent performs the given task, and the performance metrics can vary for different tasks. In these methods, the design of the reward function is based on task-driven measures, which often include some problem-specific information, especially the task performance of the robot, task performance of the human, or task performance of both the robot and the human.

4.4.1. Human Task Performance Driven Methods

In these methods, the reward function is based on the user’s success in the task related to the interaction with the robot. The studies in this category are (Tapus2008)(Tsiakas18) and  (gao2018robot). For a quick summary, see Table LABEL:summary_table.

The work in (Tapus2008) presented a Policy Gradient Reinforcement Learning (PGRL) based approach for a non-contact robot to be used during rehabilitation exercises for investigating the role of the robot’s personality. The robot’s duty was monitoring, assisting, encouraging, and socially interacting with post-stroke users. The robot, an ActiveMedia Pioneer 2-DX mobile robot, adapted its personality (by changing interaction distance, speed and amount of movement, and vocal content: what the robot says and how it says) as a function of the user’s extroversion–introversion level with the purpose to improve the user’s task performance. Their reward function was based on user performance, which was the number of movements performed and/or time-on-task. It was tracked by the robot through a light-weight motion-capture system worn by the user. The user studies involved 19 participants, and the results showed that extroverts seemed to prefer challenging vocal content, whereas introverts favored more nurturing vocal content. The users tended to spend more time with the robot when the personality of the robot matched with their personality.

The work by  Tsiakas18 (Tsiakas18) employed the Nao robot for personalized cognitive training. The reward function was based on the task-related user parameters such as task performance and task engagement, which was tracked through an EEG. They used Q-learning for adapting the robot’s feedback and task difficulty in order to maximize the user’s performance as well as engagement. They collected data from 50 participants interacting with the Nao robot. Using the collected data, they created simulated users to train the RL algorithm. They noted that the user task engagement was a useful piece of information for adaptation.

The work in (gao2018robot) involved a robot tutor for helping users to solve logic puzzles. The Pepper robot adapted verbal supportive behaviors to maximize the user’s task performance. They used the user’s task performance together with the user’s verbal feedback as the reward for the multi-arm bandit RL algorithm. They conducted a between-subjects design user study which showed that participants may not always prefer a personalized behavior. In this study, participants preferred more varying behavior.

4.4.2. Robot Task Performance Driven Methods

In these methods, the reward design depends on the robot’s task performance. Examples of the robot behaviors are that satisfy the user preferences, accurate completion of the task, finishing the task within a desired amount of time, visiting certain states and robot actions that benefit or satisfy the user are examples for task performance measures. The studies in this category are (Ranatunga2011)(Keizer2014)(Qureshi2016)(papaioannou2017hybrid)(hemminghaus2017)(lathuiliere2018)(Chen2018)(lathuiliere2019)(ritschel2018drink),and  (cuayahuitl2019data). The details of each paper are given below (for a quick summary, see Table LABEL:summary_table).

Ranatunga2011 (Ranatunga2011) proposed the use of RL for human-like head-eye coordination behavior in the social robot Zeno. They sought to use the robot for the purpose of visually engaging patients with cognitive impairments, especially for the treatment of sensorimotor impairments. The reward function was based on the head and eye kinematic scheme of the robot (head angular velocity, eye angular displacement, the distance of an object to be tracked from the eyes, etc.). The system was tested with only one participant and the authors mentioned that the system managed to keep the participant engaged during the interaction.

Keizer2014,  (Keizer2014)

a robot bartender was presented as a system that supported simultaneous interactions with multiple users. They applied a range of ML techniques in the presented system that included a modified iCat robot (additional manipulator arms with grippers) and multimodal input sensors for tracking facial expressions, gaze behavior, body language and location of the users in the environment. The presented system included two main components: A Social State Recognizer (SSR) and a Social Skills Executor (SSE). The SSR was used to handle vision and speech input for maintaining a model of the social state. They presented two implementation mechanisms for SSR that were hand-coded rule-based systems depending on human-human interaction data and a trained model using supervised learning techniques and a multimodal corpus. The SSE was an RL-based system, which received a state update from the SSR and generated a response by selecting the appropriate robot behavior. The reward function was a weighted sum of task-related parameters (e.g., whether the drink was served, whether the correct drink was served, the system’s attention to the current user etc.). An experimental evaluation was conducted with 37 participants to compare the hand-coded and the trained system. The authors reported that the trained SSR performed better and it was found to be faster at detecting user engagement than the hand-coded SSR, while the latter was more stable. However, they obtained similar subjective scores in both SSR systems.

Qureshi2016 (Qureshi2016)

proposed to use a Deep Q-Network (DQN) for a social robot greeting people based on social norms. In their work, they succeeded to map two different visual input sources (RGBD camera and webcam of the Pepper robot) to discrete actions (waiting, looking towards the human, hand waving and handshaking) of the robot. The reward was obtained from a touch sensor located on the robot’s right hand to detect handshaking. The robot received a predefined numerical reward (1 or -0.1) based on a successful or unsuccessful handshake. The proposed multimodal DQN consisted of two identical streams of Convolutional Neural Networks (CNN), one for grayscale frames and another for the depth frames, for action-value function estimation. The grayscale and depth images were processed independently, and the Q-values from both streams were fused for selecting the best possible action. In this method, there were two phases: the data generation phase and the training phase. In the data generation phase, the Pepper robot interacted with the environment and collected data. After this phase, the training phase began. This two-stage algorithm was useful in that it did not pause the interaction for training. 

(Qureshi2016) used 14 days of interaction data where each day of the experiment corresponded to one episode. The same authors applied another variation of DQN, namely the Multimodal Deep Attention Recurrent Q-Network (MDARQN), to the same handshaking scenario for adding perceivability to the robot’s actions (qureshi2017show). This study (qureshi2017show) differed from the previous one (Qureshi2016)

with regard to an additional recurrent attention model, which enabled the Q-network to focus on selecting regions as well as provided computational benefits by reducing the number of training parameters.

Papaiannou et. al.  (papaioannou2017hybrid) presented an RL approach for task-based dialogue on the Pepper robot where the robot assisted the visitors of a shopping mall by providing information about the shops, the directions to the shops, current discounts in the shops etc. The system was tested with 41 participants. The reward was predefined numerical values based on the task completion of the robot including the engagement of the user. They compared the robot with only task-based dialogue and the robot that combined task-based dialogue with chat in the user study. The users liked the hybrid system (task-based dialogue with chat) and spent more time with it compared to the system that was only the task-based.

hemminghaus2017 (hemminghaus2017) used Q-learning to adapt the robot head Furhat’s behaviors in a memory game scenario. They compared learning-based behaviors and random behaviors in a user study where 26 participants played the memory game including eighteen cards with pictures of leaf contours. The reward depended on how successful the action of the robot (whether the action helped the user to find the correct pair of cards or not) and the execution cost of the action the robot had taken. Each action had a predefined cost (e.g., gaze, facial expressions, head gesture, and speech). They reported that users needed less time to solve the memory game when they interacted with an adaptive robot compared with the nonadaptive one.

In (lathuiliere2018)

, the reward function was defined as the number of visible people (face reward) and the presence of speech sources in the camera field of view (speaker reward) that were observed from the temporal sequence of camera and microphone observations. The robot turned its head (possible actions were rotate left, right, up, down, or stay still) towards the users to maximize the number of observed faces and speech sources in the camera area. The authors modeled Q-learning with a Long Short Term Memory (LSTM) to fuse audio and visual data for controlling the gaze of the robotic head to direct it towards targets of interest. The proposed DRL model was trained on a simulated environment with simulated people moving and speaking, and on the publicly available AVDIAR dataset. In this offline training, they compared the reward obtained with four different networks: early fusion and late fusion of audio and video data, as well as only audio data and only video data. Two participants took part in the live experiments with the Nao robot. Late fusion of audio and video data was selected for these live experiments. In the presented reward function, they used an adjustment parameter to weigh the speaker reward for favoring the speaking people. 

(lathuiliere2019) extended the study presented in (lathuiliere2018) by investigating the impact of the discount factor, the window size (number of past observations affects the decision), and LSTM network size. They reported that in the experiments with AVDIAR dataset, high discount factors were prone to overfit, whereas in the simulated environment low discount factors resulted worse performance. Using smaller window sizes accelerated the training, however, larger window sizes performed better in simulated environment. Changing the LSTM size did not make a substantial difference in the results.

Chen2018 (Chen2018) proposed a multi-robot system for providing service in a drinking-at-a-bar scenario. There were three robots: the information robot (for understanding customers’ emotions and intention), the music robot (for playing music based on customers’ emotions) and the waiter robot (for selecting drinks based on customers’ intentions). The authors used a modified Q-learning algorithm combined with fuzzy inference called information-driven fuzzy friend-Q (IDFFQ) learning for understanding and adapting the behaviors of the mentioned multi-robot system based on the emotional intention of the user. The reward was based on task completion (i.e., robots selected the drink the user preferred) or and human satisfaction from the robots’ task performance. Seven basic facial emotions (happiness, neutral, sad, surprise, fear, disgust, and anger) were recognized through facial action units. The user identification was based on the religion of the user (i.e., if the customer is Muslim, the robots do not offer alcoholic drinks). The simulation experiments were performed on the data collected from eight participants. Fuzzification of emotions was done using the triangular and trapezoidal membership function in the pleasure-arousal plane. The authors simulated the bar environment in their laboratory and conducted experiments with twelve participants. The proposed algorithm compared with their previously compared algorithm, they noted that the current algorithm was better (collected more reward and the response time of the robots were much faster).

ritschel2018drink (ritschel2018drink) employed the social robot Reeti as a nutrition adviser where the robot’s purpose was to convince the user to select a healthy drink. The robotic system included a social robot and custom hardware for getting the information about the selected drink (i.e., the quantity and nutritional values of the drink). The robot interacted with 78 participants at a public event. The problem was formalized as an n-armed bandit problem where the actions of the robot were scripted spoken advice. The reward depended on the amount of calories and quantity of the selected drink, which depended on the performance of the robot (i.e., convincing the user to select a healthy drink).

cuayahuitl2019data (cuayahuitl2019data) proposed a method for training a social robot to play multimodal games. In their scenario, human participants played  ‘Noughts and Crosses’ with two different grids (small and big) against the Pepper robot. They used a CNN for recognizing game moves, i.e., hand-writings on the grid. These visual perceptions and the verbal conversations of the participant were given as an input to their modified DQN. The proposed DQN algorithm refines the action set at each step in order to make the agent learn inferring the effects of its actions (such as selecting the actions that lead to winning or to avoid losing). The reward was predefined numerical values based on the performance of the robot in the game. The robot received the highest reward in the cases ‘about to win’ or ‘winning’, whereas the robot received the lowest reward in the cases ‘about to lose’ or ‘losing’. The RL approach was trained on simulations, then deployed and tested on the real Pepper robot with human participants. The game scenario was tested over four non-consecutive days with 130 participants in total. The performance of the algorithm improved over time. The author reported that the proposed method required less data with the increased size of the grid.

4.4.3. Ensemble Human and Robot Task Performance Driven Methods

In these methods, the reward function depends on both the human’s and the robot’s task performance.

For example, in (chan2012) and (chan2011learning) the robot received the highest reward if the user completed the task successfully, while the robot was also rewarded for its actions based on whether the action was suitable for the current situation. Likewise, in (moro18), the robot was rewarded based on actions that transition the user into a desirable state (e.g., completing the activity). Positive reward was provided when the user was focused on the activity and correctly finished the corresponding activity in the task.

The socially assistive robot Brian 2.0 was employed as a social motivator by giving assistance, encouragement, and celebration in a memory game scenario (chan2012; chan2011minimizing; chan2011learning). In the scenario, the participants interacted with the robot one-to-one with an objective to find the matching pictures on the memory card game (4x4 grid, 16 picture cards). The robot behaviors were adapted using a MAXQ method to reduce the activity-induced stress in the user. The MAXQ approach is a hierarchical formulation, which accommodates a hierarchical decomposition of the target problem into smaller subproblems by decomposing the value function of an MDP into combinations of value functions of smaller integral MDPs (dietterich2000hierarchical). The authors argued that the MAXQ algorithm was suitable for memory game scenarios due to its temporal abstraction, state abstraction, and sub-task abstraction. These abstractions also helped to reduce the number of Q-values that needed to be stored. The detailed system was presented in (chan2012). In their system, they used three different types of sensory information: a noise-canceling microphone for recognizing human verbal actions, an emWave ear-clip heart rate sensor for affective arousal level and a webcam for monitoring the activity state (depending upon whether matching card pairs were found or not). They used a two‐stage training process involving offline training followed by online training. The purpose of the first stage was to determine the optimal behaviors for the robot with respect to the card game. The offline training was carried out on a human user simulation model created with the interaction data of ten participants. In the second stage, they aimed to personalize the robot according to the user’s state (affective arousal and game state) for different participants from online interactions. The affective arousal and user activity state formed the user state (e.g., stressed: high arousal and not matching card, pleased: low arousal and matching card). The success of the robot’s actions was subject to the improvement of a person’s user state from a stressed state to a stress-free state.

moro18 (moro18) appears to be closely linked to (chan2012; chan2011learning) in the sense that the robot accommodates itself to either users’ affective states or levels of cognition.  moro18 (moro18) proposed an algorithm involving Learning from Demonstration (LfD) and Q-learning for personalized robot behavior to a user’s cognitive abilities. The scenario was an assistive tea-making activity for older people with dementia. Fifteen graduate students from healthcare fields were asked to perform a demonstration of preparing a cup of tea in a kitchen as if they were helping an older person with dementia. During these demonstrations, their movements and speech were recorded. The participants’ behavior (e.g., pointing gesture) was mapped onto the Casper robot. Later, these learned behaviors were labeled considering their verbal and nonverbal content (e.g., assertive speech and few gestures). Here, the RL, in particular Q-learning comes into play. The purpose of the robot was to learn a personalized behavior based on the user’s cognition (given that the target users are older people with dementia). The robot learns to select the suitable labeled behavior that is most likely to transition the user into the desired state, namely focused on the activity and completing the correct step.

ghadirzadeh2016sensorimotor (ghadirzadeh2016sensorimotor), presented a model-based Gaussian Process (GP) Q-learning framework to facilitate physical interactions with humans and the robots. In the presented scenario, a human and a PR2 robot jointly controlled balancing a ball on a plank. The robot learned the dynamics of the collaboration from its own sensorimotor experiences. Instead of the reward function, they model a cost function for state-action pairs where the cost was defined as the weighted squared Euclidean distance to the target state.

5. Benefits and Challenges of Applying Reinforcement Learning in Social Robotics

To achieve longitudinal interaction with social robots, it is important for such robots to learn incrementally from interactions with non-expert end-users. In consideration of continuously evolving interactions, where user needs and preferences change over time, hand-coded rules are labor-intensive. Even though rule-based systems are deterministic, it might be difficult to create rules for complex interaction patterns. RL can be used to formulate this kind of problems, but it has to be different from traditional RL problems since it involves human(s) in the learning cycle. However, there are many technical challenges to address in order to implement RL successfully in social robotics and HRI. In the following, we discuss some of the benefits and challenges relevant to human presence in the learning cycle.

One of the drawbacks of online learning through interaction with a human is the requirement of long interaction time which would be tedious and impractical for the users. The considerable amount of interaction time can wear out the robot’s hardware. Besides, in IRL, human teachers tend to give less frequent feedback (due to boredom and fatigue) as the learning progresses, resulting in diminished cumulative reward (isbell2001social). Likewise, human teachers tend to provide more positive reward than punishment (suay2011effect) (thomaz2006reinforcement). Yet another problem in IRL is the transparency issues that might arise during the training of a physical robot by human reward (thomaz08) (Knox2013).  (Knox2013) used an audible alarm to alert the trainer about the robot’s loss of sense.  suay12 (suay12) observed that experts could teach the defined task in a predefined time frame whereas the same amount of time was not enough for inexperienced users. One solution suggested for this was algorithmic transparency during training, which shows the internal policy to the human teacher. However, the presentation of the model of the agent’s internal policy might be obscure for naive human teachers. Therefore, this information should be presented in a straight-forward way that is easy to understand to avoid causing confusion. To exemplify, in (thomaz08), human trainers waited for the Leonardo robot to make eye contact with them before they continued teaching. The eye contact was considered as the robot being ready for the next action. These kinds of transparent behaviors of the robots showing the internal state of the learning process should be taken into account for guiding human trainers in IRL. As noted in several studies, in IRL, the human teacher’s positive and negative reward can be much more deliberate than a simple good or bad feedback (Thomaz2007; thomaz08). The learning agent should be aware of the subtle meanings of these feedback signals. As an example, the human trainers have a tendency to have a positive bias (Thomaz2007; thomaz08).

The exploration-exploitation dilemma is a well-known problem in RL and refers to the choice of actions in order to discover the environment or taking the actions that have already been experienced to be effective in producing reward (rlbook_new). Social robotics researchers use different approaches to deal with the trade-off between exploration and exploitation such as epsilon-greedy policy (patompak2019learning), epsilon-decreasing policy (chan2012) and Boltzmann distribution (perula2019bioinspired). The epsilon-greedy strategy exploits knowledge for maximizing rewards (greedily choosing the current best option), otherwise to select a random action with probability  (rlbook_new). The epsilon-decreasing strategy decreases over time, thereby progressing towards exploitative behaviour (rlbook_new). Boltzmann exploration uses Boltzmann distribution in order to select the action to execute. A temperature parameter balances between exploration and exploitation (high-temperature values for selecting actions randomly and low-temperature values for selecting actions greedily) (rlbook_new).

Despite the mentioned challenges, there are also advantages of using RL in social robotics. One of the main advantages is that the robot can learn personalized adaptation for each of the interactant, that is, a different policy for each user. The social robot can change its behavior autonomously accordingly for different interaction partners. In IRL, the immediate reward provided by the human teacher has the potential to improve the training by reducing the number of interaction. Human teachers’ guidance significantly reduces the number of states explored, and the impact of teacher guidance is proportional to the size of state space; it increases as the size of the state space grows (suay2011effect). In RL, how to achieve a goal is not specified, instead the goal is encoded and the agent can devise its own strategy for achieving that goal. Intrinsically motivated reward signals might be useful in many real world scenarios, where sparse rewards make the goal-directed behavior challenging. Social signals driven approaches have the advantage of using the social signals that the user exhibits naturally during the interaction. It does not require an extra effort to collect the reward. However, the change in social signals would not be so sudden, which would very much affect the time for convergence. The role of human social factors deserves an extra attention in online learning methods. Combination of RL with deep neural networks has shown success in many application areas. DRL is also a trending technique in social robotics as we see more and more work in recent years. It has an advantage of not needing manual feature engineering (cuayahuitl2019data) and resulting in human-like behaviors for social robots (Qureshi2016).

6. Evaluation Methodologies

The past decade has seen the rapid growth of social robots in diverse uncontrolled environments such as homes, schools, hospitals, shopping centers, or museums. In this review, we have seen various application domains in a range of fields including therapy (hemminghaus2017), eldercare (nejat2008can), HRI in general (Churamani2018), entertainment (Weber2018a), navigation (patompak2019learning), healthcare (schneider2017exploring), education (park2019model), personal robots (maroto2018bio), rehabilitation (Tapus2008), and human-robot collaboration (ghadirzadeh2016sensorimotor). Research in the field of social robotics and human-robot interaction becomes crucial as more and more robots are entering our lives. This brings many challenges as social robots are required to deal with dynamic and stochastic elements in social interaction in addition to the challenges in robotics. Besides these challenges, validation of social robotics systems with users necessitates efficient evaluation methodologies. Recent studies underline the importance of evaluation and assessment methodologies in HRI (sim2015extensive). However, developing a standardised way of evaluation remains as a challenge. Furthermore, in RL-based robotic systems, there is a need to explore various human-level factors (personal preferences, attitudes, emotions etc.) to assure that the learned policy leads to better HRI. Additionally, how can we evaluate that the learned policy conveys the intended social skill(s)? As an example, in Qureshi2016’s study, the model performance on a test dataset was evaluated by three volunteers who judged if the robot’s action was an appropriate one for the current scenario (Qureshi2016). This kind of methodology seems to be useful for validating the naturalness of the learned skill.

The papers in the scope of this manuscript used different evaluation and assessment methodologies for their algorithms and for their systems with the users. In this section, we present these methodologies that can be useful to social robotics researchers for considering the common evaluation methodologies in their systems. We identify three types of evaluation methodologies: an evaluation from the algorithm point of view, evaluation and assessment of user experience related subjective metrics and evaluation for both learning algorithm-related factors and user experience related factors. In the latter evaluation methodology, there are papers comparing the personalized policy with real user preferences discussing the learned policy for each participant. Several papers present the proposed system and some experimental design without explicitly specifying the details.

However, we would like to draw attention to the importance of comparative evaluation methodologies that consider both the learned policy and the user’s opinion about the robot’s actions. As an example (chan2012; mitsunaga2006robot; Chiang2015) presented the policy for each participant as well as a discussion on effectiveness of the robot behavior on the user based on user comments and subjective evaluations.

7. Future Outlook

In this review paper, we give an overview of work on RL in social robotics. There are still many interesting potential problems and open questions remain to be solved. RL applications on the physically embodied robots are limited due to the enormous challenge of complexity and uncertainty in real world social interactions. The increase of RL in physical social robots will shed further light on this topic.

Despite the fact that there are goal-oriented approaches for social robot learning (lockerd2004tutelage; liu2014interactive), in the current literature, the social robot that learns through RL has only one goal, such as performing a single task and optimizing a single reward function. However, in many real world scenarios, a robot may need to perform a diverse set of tasks. As an example, socially assistive robots designed with the purpose of assisting older people in their houses may need to accomplish several tasks such as medication reminders, detecting issues, informing caregivers, and managing plans. Applying the multi-goal RL framework (sutton2011horde) for social robots would be a fruitful area for future work. Multi-goal RL enables an agent to learn multiple goals, hence the agent can generalize the desired behavior and transfer skills to unseen goals and tasks (bai2019guided).

Another interesting future direction can be the application of multi-objective RL in social robotics. The task efficiency and user satisfaction can be two objectives where the robot would try to maximize both objectives by formalizing the problem as a multi-objective MDP. As an example, (hao2019emotion) presented a multi-objective weighted RL in which the agent had two objectives: minimizing the cost of service execution and eliminating the user’s negative emotions. We refer the interested readers to the survey on multi-objective decision making for more detailed explanation of the topic (roijers2013survey).

Recent developments in the field of deep neural networks have led to an increasing interest in DRL. Applying DRL in social robotics has also received recent attention, however, studies focused on small sets of actions and single task scenarios. In this regard, social robots with larger sets of actions would be a fruitful area for further work. Another future direction can be a further investigation of hyper-parameters of RL in social robotics. This was briefly discussed in (Keizer2014), as an example, in turn-based interactions relatively smaller discount factors (i.e., ) are more common, whereas for the frame-based interactions with rather long trajectories, a relatively higher discount factors seem to be more suitable (i.e., ). In deep networks, the selection of different hyper-parameters affects the accuracy of the algorithm (zhang2017intent). It also applies to DRL, (lathuiliere2019) presented several experiments to evaluate the impact of some of the principal parameters of their deep network structure.

Thus far, model-free RL that focuses on learning a value function or a policy through trial and error is the most commonly used approach in social robotics. However, model-based RL that focuses on learning a transition model of the environment serving as a simulation remains to be further explored. Although it is difficult to model human reactions, having a model can play a crucial role in reducing the number of required interactions with real world. The model-based approach can also help the depreciation in hardware problem that may arise in model-free RL in robotics because of considerable amount of interaction time. Simulating the interaction environment can ease the training without manual interventions and a need for maintenance. Nonetheless, transferring the learned policies in simulation directly to the physical robot may not be trivial due to under-modeling and uncertainty about system dynamics (kober2013reinforcement). It is a common limitation that most of the works are not generalizable, that is utilizing the knowledge learned by one robot for the other or utilizing the task knowledge for other tasks. The Google AI team trained a model-based Deep Planning Network (PlaNet) agent, where the agent achieved six different tasks (i.e., cartpole swing-up, cartpole balance, finger spin, cheetah run, etc.) (hafner2018learning). A similar approach for a physical social robot would be an interesting future direction.

RL problems are formalized as MDPs in fully observable environments. However, when it comes to HRI, not all the required observations are available, due to the underlying psychological states of human behaviors. It has been demonstrated that POMDPs are able to model the uncertainties and inherent interaction ambiguities in real world HRI scenarios (kostavelis2017pomdp)(hausknecht2015deep) proposed a method that couples a Long Short Term Memory with a Deep Q-Network to handle the noisy observations characteristic of POMDPs. A similar approach would be useful in social robotics problems to better capture the dynamics of the environment. We included two examples of POMDP approaches in social robotics (ramachandran2019) and (martins2019). Further investigation would be an interesting future direction.

In this article, we have reviewed RL approaches in social robotics, provided a taxonomy based on the reward design, visited the most common evaluation methods, and discussed benefits and challenges of RL applications in social robotics. We have highlighted the points that remain to be explored including the approaches, which have so far received less attention. To conclude, we are still far from general-purpose, robust and versatile social robots that can learn several skills from naive users with real world interactions, despite tremendous leaps in computing power and advances in learning methods. Although, the immediate challenges, we see steady progress of RL applications in social robotics with an increasing interest in recent years.


This research was funded by European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 721619 for the SOCRATES project.