Improving interactive reinforcement learning: What makes a good teacher?

by   Francisco Cruz, et al.
University of Hamburg

Interactive reinforcement learning has become an important apprenticeship approach to speed up convergence in classic reinforcement learning problems. In this regard, a variant of interactive reinforcement learning is policy shaping which uses a parent-like trainer to propose the next action to be performed and by doing so reduces the search space by advice. On some occasions, the trainer may be another artificial agent which in turn was trained using reinforcement learning methods to afterward becoming an advisor for other learner-agents. In this work, we analyze internal representations and characteristics of artificial agents to determine which agent may outperform others to become a better trainer-agent. Using a polymath agent, as compared to a specialist agent, an advisor leads to a larger reward and faster convergence of the reward signal and also to a more stable behavior in terms of the state visit frequency of the learner-agents. Moreover, we analyze system interaction parameters in order to determine how influential they are in the apprenticeship process, where the consistency of feedback is much more relevant when dealing with different learner obedience parameters.



There are no comments yet.


page 7

page 10


Human Engagement Providing Evaluative and Informative Advice for Interactive Reinforcement Learning

Reinforcement learning is an approach used by intelligent agents to auto...

Teaching Drones on the Fly: Can Emotional Feedback Serve as Learning Signal for Training Artificial Agents?

We investigate whether naturalistic emotional human feedback can be dire...

Interactive Lungs Auscultation with Reinforcement Learning Agent

To perform a precise auscultation for the purposes of examination of res...

A Conceptual Framework for Externally-influenced Agents: An Assisted Reinforcement Learning Review

A long-term goal of reinforcement learning agents is to be able to perfo...

The Agent Web Model – Modelling web hacking for reinforcement learning

Website hacking is a frequent attack type used by malicious actors to ob...

G-Learner and GIRL: Goal Based Wealth Management with Reinforcement Learning

We present a reinforcement learning approach to goal based wealth manage...

Learning Rewards from Linguistic Feedback

We explore unconstrained natural language feedback as a learning signal ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) (Sutton & Barto, 1998) is a behavior-based approach which allows an agent, either an infant or a robot, to learn a task by interacting with its environment and observing how the environment responds to the agent’s actions. RL has been shown in robotics (Kober et al., 2013; Kormushev et al., 2013) and in infant studies (Hämmerer & Eppinger, 2012; Deak et al., 2014) to be successful in terms of acquiring new skills, mapping situations to actions (Cangelosi & Schlesinger, 2015).

To learn a task, an RL agent has to interact with its environment over time in order to collect enough knowledge about the intended task. Nevertheless, on some occasions, it is impractical to leave the agent to only learn autonomously, mainly due to time restrictions and therefore, we aim to find a way to accelerate the learning process for RL.

In domestic and natural environments, adaptive agent behavior is needed utilizing approaches used by humans and animals. Interactive reinforcement learning (IRL) allows to speed up the apprenticeship process by using a parent-like advisor to support the learning by delivering useful advice in selected episodes. This allows to reduce the search space and thus to learn the task faster in comparison to an agent exploring fully autonomously (Suay & Chernova, 2011; Cruz et al., 2015). In this regard, the parent-like teacher guides the learning robot, enhancing its performance in the same manner as external caregivers may support infants in the accomplishment of a given task, with the provided support frequently decreasing over time. This teaching technique has become known as parental scaffolding (Breazeal & Velásquez, 1998; Ugur et al., 2015).

The parent-like teacher can be either a human user or another artificial agent. By using artificial agents as teachers, some properties have been studied so far such as different effects of delivering advice in different episodes and with different strategies during the learning process (Torrey & Taylor, 2013; Taylor et al., 2014)

and effects of different probabilities and consistency of feedback

(Griffith et al., 2013; Cruz et al., 2014, 2016). Nonetheless, to the best of our knowledge, there is no study so far about the implications of utilizing artificial teachers with different characteristics and different internal representations of the knowledge based on their previous experience. Moreover, the effects when the learner ignores some of the advice has also not been studied in artificial agent-agent interaction, although some insights are given in Griffiths’ work using human-human interaction with a computational interface (Griffiths et al., 2012).

In this paper, we study effects of agent-agent interaction in terms of achieved learning when parent-like teachers differ in essence and when learner agents vary in the way they incorporate the advice. We have seen differences in the performance which could lead to adaptive behavior in order to reduce interactive feedback between trainer and learner.

This paper is organized as follows: in the second section, we present background and related work about IRL from both neuroscience and computational points of view. The third section shows the proposed IRL scenario which has been previously used but is updated here to further integrate multi-modal advice from human teachers. In the fourth section, we present the experimental set-up and obtained results. Finally, the fifth section gives an overall discussion including main conclusions and future work.

2 Interactive Reinforcement Learning

Learning in humans and animals has been widely studied by neuroscience yielding a better understanding of how the brain can acquire new cognitive skills. We currently know that RL is associated with cognitive memory and decision-making in animals’ and humans’ brains in terms of how behavior is generated (Niv, 2009). In general, computational neuroscience has interpreted data and used abstract and formal theories to help to understand about functions in the brain.

In this regard, RL is a method used to address optimal decision-making, attempting to maximize collected reward and minimize the punishment over time. It is a mechanism utilized by humans and in robotic agents. In developmental learning, it plays an important role since it allows infants to learn through exploration of the environment and connect experiences with pleasant feelings which are associated with higher levels of dopamine in the brain (Wise et al., 1978; Gershman & Niv, 2015).

RL is a plausible method to develop goal-directed action strategies. During an episode, an agent explores the state space within the environment selecting random actions which move the agent to a new state. Moreover, a reward signal is received after performing an action, which may encode a positive compensation or a negative punishment. Over time, the agent learns the value of the states in terms of future reward, or reward proximity, and how to get to states with higher values to reach the target by performing actions (Weber et al., 2008).

In robotics, RL has been used to allow robotic agents to autonomously explore their environment in order to develop new skills (Wiering & Van Otterlo, 2012; Mnih et al., 2015). To solve an RL problem means to find at least one optimal policy that collects the highest reward possible in the long run. Such a policy is known in psychology as a set of stimulus–response rules (Kornblum et al., 1990). Optimal policies are denoted by and share the action-value function which is denoted by and defined as: . The optimal action–value function can be solved through the Bellman optimality equation for :


where is the current state, is the taken action, is the next state reached by performing the action in the state , and are possible actions that could be taken in . In the equation, represents the probability of reaching the state given that the current state is and the selected action is , and is the received reward for performing action in the state for reaching the state . The parameter is known as discount rate and represents how influential future rewards are (Sutton & Barto, 1998). The gray box in Fig. 1 shows the general description of the RL framework, where the environment is represented by domestic objects which are related to our scenario which is described in the next section.

Figure 1: An interactive reinforcement learning approach with policy shaping. The agent autonomously performs action in state obtaining reward and reaching the next state . In selected states, the trainer advises the learner-agent changing the action to be performed in the environment.

In the learning phase, to solve equation 1, one strategy is to allow the agent to perform actions considering transitions from state–action pair to state–action pair rather than transitions from state to state only. Accordingly, the on-policy method SARSA (Rummery & Niranjan, 1994) updates every state–action value according to the equation:


where is the value of the state–action pair and the learning rate.

Although the next action can be autonomously selected by choosing the best known action at the moment, represented by the highest state–action pair, an intuitive strategy to speed up the learning process would be to include external advice in the apprenticeship loop; early research on this topic using both humans and robots can be found in

(Lin, 1991). When using IRL, an action is interactively encouraged by a trainer with a priori knowledge about the desired goal (Thomaz et al., 2005; Thomaz & Breazeal, 2006; Knox et al., 2013). In IRL, using a trainer to advise an agent on future actions is known as policy shaping (Cederborg et al., 2015; Amir et al., 2016).

Supportive advice can be obtained from diverse sources like expert and non-expert humans, artificial agents with perfect knowledge about the task, or previously trained artificial agents with certain knowledge about the task. In this work, an artificial trainer-agent which was itself previously trained through RL is used to provide advice, which has been formerly used in other works. For instance, in (Cruz et al., 2014) advice is given based on an interaction probability and consistency of feedback. In Taylor’s works, interaction is based on a maximal budget of advice and they studied which moment is better to give advice during the training (Torrey & Taylor, 2013; Taylor et al., 2014). Fig. 1 shows a general overview of the agent–agent scheme where the trainer provides advice in selected episodes to the learner-agent to bootstrap its learning process.

Although interactive advice improves the learning performance of learner-agents, a problem which remains open and that can significantly affect the agent’s performance is the need of a good trainer since consecutive mistakes may lead to a worse training time (Cruz et al., 2016). In principle, one may think that an expert agent with a larger accumulated reward should be a good candidate to become the trainer. Expert agents, either human or artificial, have been used in different reinforcement learning approaches using advice (e.g.: da Silva et al. (2017); Ahmadabadi & Asadpour (2002); Ahmadabadi et al. (2000); Price & Boutilier (1999)). However, when we look into the internal knowledge representation, this may not necessarily be the best option. On some occasions, agents with lower overall performance may be better trainers due to a possibly vast experience about less common states (i.e. states that do not necessarily lead to the optimal performance) and therefore, may give better advice in those states. Some insights on using trainer-agents with different abilities have been discussed by Taylor et al. (2011) in a simulated robot soccer domain by using a human-agent transfer approach.

3 Domestic Robot Scenario

In this paper, we extend a previously used RL scenario which consists of a robotic agent performing a cleaning task (Cruz et al., 2016). Here, we do not deal with contextual affordances and, therefore, we do not have to previously learn them which results in a shorter training time, in general.

The current scenario comprises two objects, three locations, and seven actions. The robot is placed in front of a table in order to clean it up. In this scenario, there are two objects: a cup which is initially at a random location on the table and needs to be relocated as the table is being wiped, and a sponge which is used by the robot in order to wipe different sections of the table.

Three locations have been defined in the cleaning scenario: left and right to refer to each of the two sections of the table, and one additional position called home which is the robot’s arm’s initial position and the location where the sponge is placed when not being used. Furthermore, seven domain-specific actions are allowed in this scenario defined as follows:

  1. [(P1)]

  2. GET: allows the robot to pick up the object which is placed in the same location as its hand.

  3. DROP: allows the robot to put down the object held in its hand. The object is placed in the same location where the hand is.

  4. GO HOME: moves the hand to the home position.

  5. GO LEFT: moves the hand to the left position.

  6. GO RIGHT: moves the hand to the right position.

  7. CLEAN: allows the robot to clean the section of the table at the current hand position if holding the sponge.

  8. ABORT: cancels the execution of the cleaning task at any time and returns to the initial state.

Each state is represented by using a state vector of four variables:

  1. [(P1)]

  2. the object held in the agent’s hand (if any),

  3. the agent’s hand position,

  4. the position of the cup, and

  5. a 2-tuple with the condition of each side of the table, i.e. whether the table surface is clean or dirty.

Therefore, the state vector at any time is characterized as follows:

Figure 2: Outline of state transitions in the defined cleaning scenario. Two different paths are possible to reach a final state. Each path implies a different number of intermediate states which influence the total amount of collected reward during a learning episode. Thus path A comprises 23 states and path B 31 states.

As long as the agent successfully finishes the task, a reward equal to is given to it, whereas a reward of is given if a failed-state was reached. In this context, a failed-state is a state from where the robot cannot continue the expected task execution, for instance attempting to pick-up an object when it is already holding another object. Furthermore, it is given a small negative reward of to encourage the agent to take shorter paths towards a final state. Therefore, the reward function can be posed as:


At the beginning of each training episode, the robot’s hand is free at the home location, the sponge is also placed at the home position, while the cup is at either the left or the right location, and both table sections are dirty. Therefore, the initial state may be represented as:


From the initial state, the state vector is updated every time after performing an action according to the state transition table as shown in Table 3. In the current scenario, considering the state vector features, there are 53 different states which represent two divergent paths to two final states. Fig. 2 depicts a summarized illustration of the transitions to reach a final state assuming the cup to be initially at the left position. The figure also shows the number of states involved in each path. Therefore, each path leads to a different number of transited states which in turn also leads to a different accumulated reward.

State vector transitions. After performing an action the agent reaches either a new state or a failed condition, if the latter, the agent starts another training episode from the initial state . Action State vector update Get if handPos == home && handObj == cup then FAILED if handPos == cupPos && handObj == sponge then FAILED if handPos == home then handObj = sponge if handPos == cupPos then handObj = cup Drop if handPos == home && handObj == cup then FAILED if handPos != home && handObj == sponge then FAILED otherwise handObj = free Go pos handPos = pos if handObj == cup then cupPos = pos Clean if handPos == cupPos then FAILED if handPos == home then FAILED if handObj == sponge then sideCond[handPos] = clean Abort handPos = home handObj = free cupPos = random(pos) sideCond = [dirty]*pos pos may be any defined location, therefore three actions are represented by this transition, i.e.: go left, go right, and go home.

As defined, the same transitions may be used in scaled-up scenarios where more locations are defined on the table in a larger grid since the definition of transitions is done by only considering the object held by the robot and the hand position in reference to either the home location or the cup position.

Fig. 3 shows the domestic robotic scenario with two robotic agents where one agent becomes the trainer by learning the task using autonomous RL. The second agent performs the same task supported by the trainer-agent with selected advice using the IRL framework.

Figure 3: Two robotic agents performing a domestic task in the defined home scenario. The trainer-agent advises the learner-agent in selected states what action to perform next.

4 Experiments and Results

In the following subsections, the experimental set-up will be explained in detail. Initially, we look into the internal representation and visited states of prospective advisor agents in order to explore which features may be important to act as a good trainer. Afterward, we compare the behavior of both the advisor and the learner in terms of the internal representation, visited states, and collected reward. Finally, we evaluate some system interaction parameters like frequency of feedback, consistency of feedback, and learner behavior.

All experiments included the training of agents through

episodes. Q-values were randomly initialized using a uniform distribution between

and . Other parameter values were learning rate and discount factor . Besides this, we used -greedy action selection with . To assess the interaction between learner and trainer-agents we used a probability of feedback of as a base; nevertheless, we afterward varied this parameter along with the consistency of feedback and learner behavior. All the aforementioned parameters were empirically determined and related to our scenario.

4.1 Choosing an Advisor Agent

To acquire a sample of trainer-agents, autonomous RL was performed with agents, each of them a prospective trainer for the IRL approach. In the presented scenario, there are agents with diverse behaviors which differ mostly in the path they choose until reaching a final state. First, there are agents which most of the time choose the same path to complete the task, either path A or path B, which leads to a biased behavior due to the way the knowledge is acquired during the learning process. From this kind of behavior and taking into account our scenario, there exist agents that regularly take the shorter path (path A) and others that take the longer one (path B); we refer to them as the specialist-A and the specialist-B agents respectively. In both cases, agents successfully accomplish the task, although they accumulate different amounts of average reward. Obviously, the specialist-A agents are the ones with better performance in terms of collected reward since fewer state transitions are needed to reach the final state. Second, there are agents with a more homogeneously distributed experience, meaning that they do not have a favorite sequence to follow and have equally explored both paths. We refer to such agents as polymath agents.

Figure 4: Frequency of visits per state for two agents. It is possible to observe two different behaviors. The biased (specialist-A) agent gained experience mostly on the shorter path, whereas the homogeneously-distributed (polymath) agent gained experience through most states.

To illustrate this, Fig. 4 shows a frequency histogram of visited states for two potential trainer-agents over all training episodes. The histogram shows two distinct distributions, one for a specialist-A agent in gray and one for a polymath agent in blue. The specialist-A agent decided to clean the table following the shorter path most of the time and, therefore, there is an important concentration of visits among the states from to which are intermediate states to complete the task on this path. Furthermore, there is a clear subset of states which was never visited during the learning. In contrast, the polymath agent visited all the states and transits on both paths to a similar extent. In the case of the specialist-B agent, there is also a concentration of visits among a subset of states, similarly to the specialist-A agent. The specialist-B agent decided most of the time to clean following the longer path along the states from to and barely visiting states from to . Therefore, we do include this agent in the results hereafter but we do not present it in some plots to make the relevant information more accessible.

To further analyze the agents’ behavior we took three representative agents, one per class, that we will from now on use with the respective names: specialist-A agent with biased behavior for the shorter path, specialist-B agent with biased behavior for the longer path, and polymath agent with unbiased behavior. The specialist-A agent visited each state with an average of

times, a standard deviation of

, an accumulated average reward of per episode, and during the whole training. The specialist-B agent visited each state on average times obtaining a more diverse experience than the previous agent but certainly not homogeneously distributed, which can also be appreciated in the standard deviation of . The specialist-B agent accumulated an average reward of for each episode and a total of . In the case of the polymath agent, each state was visited an average of times with standard deviation of . The accumulated average reward was per episode and the total reward was during the whole training. Table 4.1 shows a summary of the performance of the three aforementioned agents.

Visited states, standard deviation, reward accumulated per episode, and total collected reward for three agents from classes with different behavior. The agents show different characteristics as result of the autonomous learning process. Agent R Characteristic Specialist-A agent 1121.21 1570.75 0.11105 333.15 Largest accumulated reward Specialist-B agent 1561.15 1628.70 -0.17839 -535.18 Largest amount of experience Polymath agent 1307.51 947.96 -0.00427 -12.82 Smallest standard deviation

Nevertheless, accumulating plenty of reward does not necessarily lead to becoming a good trainer. In fact, it only means that the agent is able to select the shorter path most of the time from the initial state, but the experience collected in other states not involved in that route is absent or barely present and therefore, such an agent cannot give good advice in those states where it does not know how to act optimally.

For a good trainer to emerge with knowledge of most of the situations or in all possible states we suggest an agent with a small standard deviation from the mean frequency over all visited states, which represents a better distribution of the experience during the training. We select the trainer-agent computing:


where is the set of all the trained agents and their respective visited states during the learning process.

Therefore, we propose that a good trainer is, in essence, an agent which not only collects more rewards but shows also a fairly distributed experience. From the three agents shown above, the polymath agent has a standard deviation of and thus might be a good advisor. In Fig. 4, the experience distribution of such an agent is shown in blue and this experience distribution suggests that the agent has the knowledge to advise what action to perform in most of the states. In the case of the initial state, the frequency is much higher in comparison since this state is visited every time at the beginning of a learning episode. In fact, similar frequencies are observed in this state for a biased distribution.

Figure 5: Internal knowledge representation for three possible parent-like advisors in terms of Q-values, namely the specialist-A, the specialist-B, and the polymath agent. The specialist-A agent shown in figure a), despite collecting more reward, does not have enough knowledge to advise a learner in every situation represented by the blue box. A similar situation is experienced by the specialist-B agent, as shown in figure b). The polymath agent shown in figure c) has overall much more distributed knowledge which allows it to better advise a learner-agent.

We also recorded the internal representation of the knowledge through the Q-values to confirm the lack of learning in a subset of states. Fig. 5 shows a heat map of the internal Q-values of three agents, the specialist-A, the specialist-B, and the polymath agent. Warmer regions represent a larger reward and colder regions lower values. In fact, the coldest regions are associated with failed-states from where the agent should start a new episode, obtaining a negative reward of according to Eq. 4. In Fig. 5, it can be observed that the specialist-A agent may be an inferior advisor since there exists a whole region uniformly in yellow, which shows no knowledge about what action to prefer. In the case of the specialist-B agent, there exists a region which shows much less knowledge on what action to prefer when comparing it with the two other agents. In other words, the learned policies are partially incomplete as highlighted by the blue boxes in Fig. 5. To the contrary, the policy learned by the polymath agent is much more complete when observing the same regions as highlighted by the green boxes. It is important to note that the region on top is in all cases colder than the rest because it is the most distant one from the final states where a positive reward is given, but in spite of that, the polymath agent is still able to select a suitable action according to the learned policy.

4.2 Comparing Advisor and Learner Behavior

Once we had chosen trainer-agents, we were able to compare how influential such a trainer was in the learning process of a learner. We used two agents shown in the previous subsection, the specialist-A and the polymath agent, the former with the largest accumulated reward and the latter with the smallest standard deviation.

Fig. 6 shows the frequency with which each state was visited for learner-agents on average using the specialist-A agent with biased frequency distribution as a trainer. We can observe a large standard deviation for visited states in IRL agents in most of the cases, which suggests diversity in terms of frequency for those states among the learner-agents. Fig. 7 shows the average frequency of visits for each state for learner-agents using the polymath agent as a trainer which has a more homogeneous frequency distribution. It can be observed that the standard deviation for visited states in IRL agents is much lower in comparison to the previous case. This shows a more stable behavior in terms of visiting frequency in learner-agents when using the polymath trainer-agent.

Figure 6: Visited states for the specialist-A RL trainer-agent and average state visits of IRL learner-agents. The averaged frequency for IRL agents moreover includes the standard deviation for visited states showing that in many cases the trainer-agent does not know how to advise and in consequence leads the learner-agent to dissimilar behavior.

Figure 7: Visited states for the polymath RL trainer-agent and average state visits of IRL learner-agents. The averaged frequency for IRL agents includes the standard deviation which in this case is considerably lower as the learners are assisted by a trainer with more knowledge about the task-space which also leads learner-agents to have more stable behavior as they are consistently advised.

By using the specialist-A agent as a trainer in our IRL approach the average collected reward is slightly higher in comparison with autonomous RL. In general, the IRL approach collects the reward faster than RL but in a similar magnitude after episodes. Fig. 8 depicts the average collected reward during the first episodes using autonomous RL and IRL approaches with yellow and red respectively using the specialist-A agent as the trainer in the case of IRL. The gray curves show the convoluted collected reward inside a window of 30 values to smooth the results shown.

On the other hand, by using the polymath agent as the trainer the IRL approach converges both faster and to a higher amount of reward when compared with the previous case. This is due to the polymath agent which knows the task-space better and is able to advise correctly in more situations than the specialist agent. In consequence, this allows the learner to complete the task faster and therefore accumulate more reward. Fig. 9 shows the average collected reward in episodes for RL and IRL approaches. Once again, the gray curves show the convoluted collected reward inside a window of 30 values to smooth the results shown. In the following experiments, only smooth curves will be used to simplify the analysis of the results.

Figure 8: Average collected reward by agents using RL and IRL approaches. In this case, a biased trainer (the specialist-A agent) is used to advise the learner-agents. The advice slightly improves the performance in terms of accumulated reward and convergence speed.
Figure 9: Average collected reward by agents using RL and IRL approaches. When using an unbiased trainer-agent (the polymath agent), the accumulated reward is higher and the convergence speed faster in comparison with the previous case using a biased agent as an advisor.

Therefore, IRL is in general beneficial for a learner-agent in terms of accumulated reward and convergence speed. Nevertheless, the selection of the trainer can have significant implications on the learner’s performance. In the following subsection, we analyze the main interaction parameters in order to understand how influential they are regarding the learner’s performance when being advised by a potentially good trainer.

4.3 Evaluating Interaction Parameters

As part of this study, we evaluated the involved interaction parameters namely probability of feedback (), consistency of feedback (), and whether the learner follows the received advice or not in order to mimic actual human-human behavior where the learner occasionally does not follow the advice (Griffiths et al., 2012). We called this parameter learner obedience , being an agent that never follows the advice and thus corresponds to a pure RL learner. Probability and consistency of feedback correspond to the frequency of giving advice to the learner and the degree to which such advice is rational in the current state respectively.

Initially, we used a fixed probability of feedback , with different values of consistency. A similar probability of feedback has been used in (Cruz et al., 2016) and therefore, we used it as a base to start the evaluation. The idea then was to test the system over a number of different values of consistency of feedback and learner obedience. Fig. 10 shows the collected reward during episodes for the different values of consistency of feedback and learner obedience . In all cases, the learner obedience , shown in black, corresponds to autonomous RL which is shown in yellow. The collected rewards indicate generally that the more consistent the feedback, the better is the performance. Even though that difference in the performance seems to be intuitive, it is important to note that, even with comparatively high values of consistency like , the learner does not achieve significantly better performance compared to autonomous RL while on the other hand, an idealistic perfect consistency () allows the learner-agent to achieve much higher collected rewards than with autonomous RL even when the learner obedience is as low as . Therefore, in the current scenario, wrong advice has an important negative effect since it does not only lead to the execution of more intermediate steps but also, in many cases, leads to failed-states and thus to a high negative reward () and the start of a new learning episode. Further on in this section, we are going to test additional values of consistency to observe how influential small variations in this parameter are.

Figure 10: Collected reward for different values of learner obedience using fixed probability of feedback of 0.25 and four different values for consistency of feedback between and .

In Fig. 10, agents which follow the advice only of the time (), depicted in green, show much better performance when the consistency of feedback is lower which is due to the agent being able to ignore the suggested wrong advice and select an action on its own. On the contrary, agents which follow the advice all the time (), depicted in red color, show much better performance in presence of consistent feedback.

Figure 11: Collected reward for different learner obedience levels using several probabilities and consistencies of feedback. Higher probabilities of feedback do not necessarily lead to discernible improvements in the overall performance; however, important differences can be noted as higher consistencies of feedback are used.

Thereupon, we modified the probability of feedback for the purpose of testing how influential different consistencies of feedback and different learner obedience levels are. Fig. 11 shows the accumulated reward during episodes for probability of feedback (the outcome using probability of feedback of is already shown in Fig. 10) and consistency of feedback using learner obedience .

In Fig. 11 the columns show the performance over different probabilities of feedback, while the rows show the performance over different values of consistency. Observing each row, it can be seen that higher probabilities of feedback do not considerably improve the outcomes in terms of the collected reward, suggesting that often interactive feedback does not necessarily enhance the overall performance but it is rather the consistency of feedback that makes prominent differences. In fact, observing the outcomes down the columns, thus with the same probability of feedback, different values of consistency lead to significant improvements in the collected reward and consequently, consistency of feedback has much more impact on the final learning performance. For instance, when using the consistency of feedback (fourth row in Fig. 11), in all cases the accumulated reward is higher than , but on the other hand, when using the consistency of feedback (third row in Fig. 11), the accumulated reward tends to slightly decrease as trainer advice increases, meaning that more interactive feedback does not help in the presence of poor consistency of feedback or, in other words, of bad advice.

Ultimately, since the consistency of feedback shows considerable sensibility in the presence of small variations, we performed one additional experiment keeping the probability of feedback fixed to as in Fig. 10 since we use this value as a base as aforementioned. We tested the consistency of feedback with values (consistency of and are already shown in Fig. 10) to evaluate how these slight changes impact on the overall performance. Fig 12 shows the accumulated rewards for learner obedience . It can be seen that such small differences in the consistency of feedback can lead to dissimilar outcomes, ranging from behavior similar to autonomous RL when to behavior similar to a fully and correctly advised learner-agent when . Therefore, even a small proportion of bad advice can considerably impoverish the learning process, which shows how important it is to select trainers that can give useful advice in most states since specialised trainers, despite being more successful themselves from the initial state, have limited knowledge when it comes to states that lie outside their specialised policy.

In our approach, we have used the probability of feedback as a way to control how much advice is given to the learner-agent in terms of assistance during selected training episodes. As mentioned above, the consistency of feedback allows to mimic the behavior of human trainer-agents who are susceptible to make mistakes during the learning process. Nevertheless, at this point, all the instances of advice are received by the learner-agent without any discrimination between right or wrong advice. As discussed, the inconsistent feedback may in fact lead to slow the learning process in terms of accumulated reward. Therefore, the learner obedience parameter is an effective way for learner-agents to suppress the influence of the inconsistent feedback disregarding some wrong pieces of advice. In this way, the learner-agents are able to accumulate more reward during the learning process.

Figure 12: Collected reward for different values of learner obedience using fixed probability of feedback and for four different cases for higher consistencies of feedback between and .

5 Conclusions and Future Work

In this work, we presented a comparison of artificial agents that are used as parent-like teachers in an IRL cleaning scenario. We have defined three classes of trainer-agents related to our scenario. The agents differ in their characteristics and consequently in the obtained performance during their own learning process and in turn as trainers. The three agents vary in their main properties which reflect in their behavior as i) the specialist-A agent with the largest accumulated reward, ii) the specialist-B agent with the largest amount of experience in terms of the number of explored states, and iii) the polymath agent with the smallest standard deviation.

It has been shown that there exists divergence in the internal representation of the knowledge of the agents through state–action Q-values since there are states in which it is not possible to distinguish what actions lead to greater reward. Using the polymath agent as an advisor leads to both greater reward and faster convergence of the reward signal and also to a more stable behavior in terms of the state visit frequency of the learner-agents, which can be seen in the standard deviation for each visited state when compared with the case of the specialist-A agent as a trainer.

IRL generally helps to improve the performance of an RL agent using parent-like advice. Nonetheless, it is important to take into account that higher levels of interaction do not necessarily have a direct impact on the total accumulated reward. More importantly, the consistency of feedback seems to be more relevant when dealing with different learner obedience parameters (or a noisy or unreliable communication channel) since small variations can lead to considerably different amounts of collected reward.

Agents with a smaller standard deviation are preferred candidates to be parent-like teachers since they have a much better distribution of knowledge among the states. This allows them to adequately advise learner-agents on what action to perform in specific states. Agents with biased knowledge distributions collect more reward themselves, but nevertheless, have a subset of states where they cannot properly advise learners. This leads to a worse performance in the apprenticeship process in terms of maximal collected reward, convergence speed, and behavior stability represented as the standard deviation for each visited state.

The finding that an expert in a certain domain is not necessarily a good teacher might also help the understanding of biological or natural systems in terms of assistive teaching. For instance, a good soccer player is not necessarily a good soccer trainer. We are not aware of studies that confirm this in biological systems or human-human interaction. However, Taylor et al. (2011) gave some interesting insights about a human-agent interaction approach. Also, Griffiths et al. (2012) studied different teacher behaviors to improve the apprenticeship in learner-agents. Although their experiments are based on human-human interaction, they have used tutors that have mastered a given task without any classification about the level of expertise.

An important future work is to investigate how the obtained results can be scaled up to either larger discrete or continuous scenarios. There are many real-world problems which have inherently continuous characteristics. Many of these problems have been addressed using autonomous RL by discretizing the state-action space. This discretization may lead to the introduction of hidden states or hidden actions for the RL agent. However, a human trainer may not know or have access to this discrete representation and may advise actions which are not directly mapped into the discrete action-state representation used by the learner-agent. Therefore, if the learner-agent maps the given advice into the discrete representation, it could lead to a slight error which over time could be accumulated rendering the learned policy useless. An alternative is to address the problem directly in its continuous representation, but to the best of our knowledge, continuous IRL has not been studied yet. It can be expected that RL agents have similar behavior in continuous scenarios compared to discrete ones since they are designed to find the optimal solution maximizing the collected reward.

Moreover, adaptive learner behavior can be explored, thus allowing to decide which advice to follow depending on the collected knowledge about the current state that the learner-agent has at a specific time. Then, the learner-agent would act with diverse values for the learner obedience parameter, adapting it in real time. Greater learner obedience can be expected at the beginning of the learning process, but over time the learner-agent should take its own experience more into account and therefore follow its own policy instead of the parent-like advice, leading to smaller obedience values. In the same way, if new space is explored and consequently the reward gets worse, then parent-like advice could be used once again, leading to a dynamic learning process, taking advice into account when necessary while avoiding bad advice when possible.


The authors gratefully acknowledge partial support by CONICYT scholarship 5043, the German Research Foundation DFG under project CML (TRR 169), the European Union under project SECURE (No 642667), and the Hamburg Landesforschungsförderungsprojekt CROSS.


  • Ahmadabadi et al. (2000) Ahmadabadi, M. N., Asadpur, M., Khodanbakhsh, S. H., & Nakano, E. (2000). Expertness measuring in cooperative learning. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 2261–2267.
  • Ahmadabadi & Asadpour (2002) Ahmadabadi, M. N., & Asadpour, M. (2002). Expertness based cooperative Q-learning. IEEE Transactions on Systems, Man, and Cybernetics Vol. 32, Nr. 1, 66–76.
  • Amir et al. (2016) Amir, O., Kamar, E., Kolobov, A., & Grosz, B. (2016). Interactive teaching strategies for agent training.

    Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)

    , pp. 804–811.
  • Breazeal & Velásquez (1998) Breazeal, C., & Velásquez, J. (1998). Toward teaching a robot ’infant’ using emotive communication acts. Proceedings of the Simulated Adaptive Behavior Workshop on Socially Situated Intelligence, pp. 25–40.
  • Cangelosi & Schlesinger (2015) Cangelosi, A. & Schlesinger, M. (2015). Developmental Robotics: From Babies to Robots. MIT Press.
  • Cederborg et al. (2015) Cederborg, T., Grover, I., Isbell, C. L., & Thomaz, A. L. (2015). Policy shaping with human teachers. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 3366–3372.
  • Cruz et al. (2014) Cruz, F., Magg, S., Weber, C., & Wermter, S. (2014). Improving reinforcement learning with interactive feedback and affordances. Proceedings of the IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 165–170.
  • Cruz et al. (2015) Cruz, F., Twiefel, J., Magg, S., Weber, C., & Wermter, S. (2015). Interactive reinforcement learning through speech guidance in a domestic scenario.

    Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN)

    , pp. 1341–1348.
  • Cruz et al. (2016) Cruz, F., Magg, S., Weber, C., & Wermter, S. (2016). Training agents with interactive reinforcement learning and contextual affordances. IEEE Transactions on Cognitive and Developmental Systems (TCDS), Vol. 8, Nr. 4, pp. 271–284.
  • da Silva et al. (2017) da Silva, F. L., Glatt, R., & Costa, A. H. R. (2017). Simultaneously Learning and Advising in Multiagent Reinforcement Learning. Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems (AAMAS), pp. 1100–1108.
  • Deak et al. (2014) Deak, G. O., Krasno, A. M., Triesch, J., Lewis, J., & Sepeta, L. (2014). Watch the hands: Infants can learn to follow gaze by seeing adults manipulate objects. Developmental Science, Vol. 17, No. 2, pp. 270–281.
  • Gershman & Niv (2015) Gershman, S. J., & Niv, Y. (2015). Novelty and inductive generalization in human reinforcement learning. Topics in Cognitive Science, Vol. 7, No. 3, pp. 391–415.
  • Griffith et al. (2013) Griffith, S., Subramanian, K., Scholz, J., Isbell, C., & Thomaz, A. (2013). Policy shaping: Integrating human feedback with reinforcement learning. Advances in Neural Information Processing Systems (NIPS), pp. 2625–2633.
  • Griffiths et al. (2012) Griffiths, S., Nolfi, S., Morlino, G., Schillingmann, L., Kuehnel, S., Rohlfing, K., & Wrede, B. (2012). Bottom-up learning of feedback in a categorization task. Proceedings of the IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 1–6.
  • Hämmerer & Eppinger (2012) Hämmerer, D., & Eppinger, B. (2012). Dopaminergic and prefrontal contributions to reward-based learning and outcome monitoring during child development and aging. Developmental Psychology, Vol. 48, No. 3, pp. 862–874.
  • Knox et al. (2013) Knox, W. B., Stone, P., & Breazeal, C. (2013). Teaching agents with human feedback: a demonstration of the tamer framework. Proceedings of the ACM International Conference on Intelligent User Interfaces Companion, pp. 65–66.
  • Kober et al. (2013) Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, Vol. 32, No. 11, pp. 1–37.
  • Kormushev et al. (2013) Kormushev, P., Calinon, S., & Caldwell, D. (2013). Reinforcement learning in robotics: Applications and real-world challenges. Robotics, Vol. 2, No. 3, pp. 122–148.
  • Kornblum et al. (1990) Kornblum, S., Hasbroucq, T., & Osman, A. (1990). Dimensional overlap: cognitive basis for stimulus-response compatibility–a model and taxonomy. Psychological Review, Vol. 97, No. 2, 253–270.
  • Lin (1991) Lin, L. J. (1991). Programming Robots Using Reinforcement Learning and Teaching. Proceedings of the Association for the Advancement of Artificial Intelligence Conference (AAAI), pp. 781–786.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, Vol. 518, No. 7540, pp. 529–533.
  • Niv (2009) Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology, Vol. 53, No 3, pp. 139–154.
  • Price & Boutilier (1999) Price, B. & Boutilier, C. (1999). Implicit imitation in multiagent reinforcement learning.

    Proceedings of the International Conference on Machine Learning (ICML)

    , pp. 325–334.
  • Rummery & Niranjan (1994) Rummery, G. A. & Niranjan, M. (1994). On-line Q-learning using connectionist systems, Technical report CUED/F-INFENG/TR166, Cambridge University Engineering Department, Cambridge, U.K.
  • Suay & Chernova (2011) Suay, H. B., & Chernova, S. (2011). Effect of human guidance and state space size on interactive reinforcement learning. IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 1–6.
  • Sutton & Barto (1998) Sutton, R. S. & Barto, A. G. (1998). Reinforcement learning: An Introduction, Cambridge, MA, USA: Bradford Book.
  • Taylor et al. (2011) Taylor, M. E., Suay, H. B., & Chernova, S. (2011). Integrating reinforcement learning with human demonstrations of varying ability. Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 617–624.
  • Taylor et al. (2014) Taylor, M. E., Carboni, N., Fachantidis, A., Vlahavas, I., & Torrey, L. (2014). Reinforcement learning agents providing advice in complex video games. Connection Science, Vol. 26, No 1, pp. 45–63.
  • Thomaz et al. (2005) Thomaz, A. L., Hoffman, G., & Breazeal, C. (2005). Real-time interactive reinforcement learning for robots. Proceedings of the Workshop on Human Comprehensible Machine Learning, pp. 9–13.
  • Thomaz & Breazeal (2006) Thomaz, A. L., & Breazeal, C. (2006). Reinforcement learning with human teachers: Evidence of feedback and guidance with implications for learning performance. Proceedings of the Association for the Advancement of Artificial Intelligence Conference (AAAI), Vol. 6, pp. 1000–1005.
  • Torrey & Taylor (2013) Torrey, L. & Taylor, M. (2013). Teaching on a budget: Agents advising agents in reinforcement learning. Proceedings of the International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pp. 1053–1060.
  • Ugur et al. (2015) Ugur, E., Nagai, Y., Celikkanat, H., & Oztop, E. (2015). Parental scaffolding as a bootstrapping mechanism for learning grasp affordances and imitation skills. Robotica, Vol. 33, No 5, pp. 1163–1180.
  • Weber et al. (2008) Weber, C., Elshaw, M., Wermter, S., Triesch, J., & Willmot, C. (2008). Reinforcement Learning Embedded in Brains and Robots, chapter 7. I-Tech Education and Publishing.
  • Wiering & Van Otterlo (2012) Wiering, M., & Van Otterlo, M. (2012). Reinforcement Learning, State-of-the-Art. Springer Heidelberg.
  • Wise et al. (1978) Wise, R. A., Spindler, J., & Gerberg, G. J. (1978). Neuroleptic-induced ”anhedonia” in rats: pimozide blocks reward quality of food. Science, New Series, Vol. 201, No. 4352, pp. 262–264.