Newtonian Action Advice: Integrating Human Verbal Instruction with Reinforcement Learning

A goal of Interactive Machine Learning (IML) is to enable people without specialized training to teach agents how to perform tasks. Many of the existing machine learning algorithms that learn from human instructions are evaluated using simulated feedback and focus on how quickly the agent learns. While this is valuable information, it ignores important aspects of the human-agent interaction such as frustration. In this paper, we present the Newtonian Action Advice agent, a new method of incorporating human verbal action advice with Reinforcement Learning (RL) in a way that improves the human-agent interaction. In addition to simulations, we validated the Newtonian Action Advice algorithm by conducting a human-subject experiment. The results show that Newtonian Action Advice can perform better than Policy Shaping, a state-of-the-art IML algorithm, both in terms of RL metrics like cumulative reward and human factors metrics like frustration.



There are no comments yet.


page 7

page 9


Improving Reinforcement Learning with Human Assistance: An Argument for Human Subject Studies with HIPPO Gym

Reinforcement learning (RL) is a popular machine learning paradigm for g...

Leveraging human knowledge in tabular reinforcement learning: A study of human subjects

Reinforcement Learning (RL) can be extremely effective in solving comple...

Towards Learning to Play Piano with Dexterous Hands and Touch

The virtuoso plays the piano with passion, poetry and extraordinary tech...

Improving Deep Reinforcement Learning in Minecraft with Action Advice

Training deep reinforcement learning agents complex behaviors in 3D virt...

Reinforcement Learning based Embodied Agents Modelling Human Users Through Interaction and Multi-Sensory Perception

This paper extends recent work in interactive machine learning (IML) foc...

Value Driven Representation for Human-in-the-Loop Reinforcement Learning

Interactive adaptive systems powered by Reinforcement Learning (RL) have...

Autonomous Self-Explanation of Behavior for Interactive Reinforcement Learning Agents

In cooperation, the workers must know how co-workers behave. However, an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

1.1 Reinforcement Learning

Reinforcement learning (RL) is a form of machine learning influenced by behavioral psychology in which an agent learns what actions to take by receiving rewards or punishments from its environment (Sutton and Barto, 1998; Skinner, 1938)

. The probability people will repeat an action in a given circumstance is increased or decreased if they receive positive or negative reinforcement.

Most RL algorithms are modeled as Markov Decision Processes (MDPs), which learn policies by mapping states to actions such that the agent’s expected reward is maximized. An MDP is a tuple

that describes , the states of the domain; , the actions the agent can take; , the transition dynamics describing the probability that a new state will be reached given the current state and action; , the reward earned by the agent; and , a discount factor in which .

Bayesian Q-Learning is an MDP-based RL algorithm in which the utility of state-action pairs are represented as probabilistic point estimates of the expected long term discounted reward

(Dearden et al., 1998). Bayesian Q-Learning was used as the underlying RL algorithm for both the Policy Shaping and NAA interaction methods in this work.

1.2 Learning from critique: Policy Shaping

Critique was initially used directly as a reward signal (Isbell et al., 2001), but it was later shown  (Knox and Stone, 2010; Thomaz and Breazeal, 2008) that it is more efficient to use critique as policy information. Policy Shaping is an interaction algorithm that enables a human teacher’s critique to be incorporated into a Bayesian Q-learning agent as policy information (Griffith et al., 2013) and was used in this work. Cederborg et al. (2015) investigated how to interpret silence while learning from critique with policy shaping.

1.3 Learning from advice

Various forms of advice have been developed in other work, including linking one condition to each action (Maclin et al., 2005), and linking a condition to rewards (MacGlashan et al., 2015). Several connect conditions to higher-level actions that are defined by the researcher instead of primitive actions (Maclin et al., 2005; Kuhlmann et al., 2004; Joshi et al., 2012)Argall et al. (2008) creates policies using demonstrations and advice. Meriçli et al. (2014) parses language into a graphical representation and finally to primitive actions. Maclin et al. (2005) has the person provide a relative preference of actions. Sivamurugan and Ravindran (2012) explored learning multiple interpretations of instructions.  Tellex et al. (2011) represents natural language commands as probabilistic graphical models.

Most methods are permanently influenced by the advice. Kuhlmann et al. (2004) can adjust for bad advice by learning biased function approximation values that negate the advice. Maclin et al. (2005) uses a penalty for not following the advice that decreases with experience. Newtonian Action Advice differs because the advice can be overwritten by new, contradictory advice in the future.

Many researchers incorporate advice using IF-THEN rules and formal command languages (Maclin et al., 2005; Kuhlmann et al., 2004); if the state meets a condition, then the learner takes the advice into account. Formal command languages and IF-THEN rules require advice that is state specific and contains numbers. Similar to this work, the advice in Argall et al. (2008) does not require people to give specific numbers for continuous state variables, but uses a set of predefined advice operators.

1.4 Natural Language Processing (NLP)

1.4.1 Automatic Speech Recognition (ASR)

ASR software transcribes the human teacher’s verbal instructions to written text. The human-subject experiment in this work used the Sphinx ASR software  (Huggins-Daines et al., 2006).

1.4.2 Sentiment Analysis

Sentiment analysis is an NLP tool used to classify movie, book, and product reviews into positive and negative  (Pang and Lee, 2008). Sentiment analysis has not been widely applied to action selection. One method we previously developed for using sentiment analysis is to classify natural language advice into advice of ‘what to do’ and warnings of ‘what not to do’  (Krening et al., 2017). Many approaches to learning from language instruction require people to provide instructions using specific words, often in a specific order or format (Meriçli et al., 2014).  Thomason et al. (2015) worked to get around limitations like keyword search by creating an agent that learns semantic meaning from the human. In this work we created a method of using sentiment analysis to filter verbal critique into positive and negative, which furthers the goal of allowing people to provide verbal instructions without being limited to a specific dictionary of words.

This work uses Stanford’s deep learning sentiment analysis software  

(Manning et al., 2014)

, which uses Recursive Neural Tensor Networks and the Stanford Sentiment Treebank  

(Socher et al., 2013). The Stanford Sentiment Treebank is a set of labeled data corpus of fully-labeled parse trees trained on the dataset of movie reviews from  (Pang and Lee, 2005).

2 Newtonian Action Advice

Newtonian Action Advice is a teaching interaction algorithm we designed to enable an RL agent to learn from human action advice. The theory is a metaphor of Newtonian dynamics: objects in motion stay in motion unless acted on by an external force. In the Newtonian Action Advice model, a piece of action advice provided by the human is an external force on the agent. Once a person provides action advice (ex: “Go right”), the agent will immediately move in the direction of the external force, superseding the RL agent’s normal action selection choice of exploration vs. exploitation. The model contains natural friction that ‘slows down’ the agent’s need to follow the human’s advice. The friction ensures that after some amount of time, the agent will resume the underlying RL algorithm’s exploration policy. The advice does not necessarily need to specify directional motion; the metaphor of advice as a force pushes the agent to follow the advice as opposed to the RL’s action selection mechanism.

Figure 1: Simple Force Model. Actions are an external force acting on the agent, and ‘friction’ determines the amount of time the action will be followed after the advice is given.

Newtonian Action Advice was designed to behave in a manner that is more natural and intuitive for the human teacher than other IML algorithms such as Policy Shaping. The force model allows each piece of advice to be generalized through time. If a person says, “go right,” the Newtonian Action Advice agent will move right and keep moving right until the ‘friction’ causes the agent to resume normal exploration. The simplicity of the force model is a feature to improve the human experience; people deal with Newtonian mechanics in their everyday life and are used to objects moving in a Newtonian manner. For example, if a ball is thrown straight up in the air the same way multiple times, it will always rise to a certain height and fall back to the ground. The motion is predictable, and one does not need formal physics training to recognize and expect the motion. We expect that loosely mimicking this will help create a user-friendly experience.

If advice was followed in one state, the Newtonian Action Advice algorithm will cause the agent to follow the same advice if the state is seen in the future. This means that a person will only have to provide advice once for a given situation. Also, only the latest advice is saved for a state, so people can correct any mistakes they made or change the desired policy in real time.

Bayesian Q-Learning was used as NAA’s underlying RL algorithm. This choice was primarily made so the Newtonian Action Advice algorithm could be more directly compared to Policy Shaping. The structure of the NAA algorithm is such that the BQ-L algorithm could be exchanged for a different RL algorithm.

1:procedure NAA
2:     for  each time step do
3:         Listen for human advice
4:         if human advice given then
7:         Take action and get reward
8:         Update Bayesian Q-learning policy with reward      
9:procedure newAdvice()
13:procedure actionSelection()
14:     if  then
16:         if  then
17:                        timesNewAdviceFollowed += 1
18:         if  then
21:     If advice has been given for this state, choose between the human advice and the algorithm
22:     Otherwise, if advice has not been given for this state in the past, use the action recommended from the BQL algorithm
23:     return
Algorithm 1 Newtonian Action Advice algorithm

In algorithm 1, NAA is the main procedure. At each time step, the agent listens for advice. If advice is given, the agent updates its internal advice dictionary. The agent then chooses and takes an action, receives a reward, and updates the Bayesian Q-learning policy.

The New Advice procedure adds the new (state, advice) pair to the agent’s dictionary and sets a parameter that will tell the action selection procedure to follow the new advice.

The Action Selection procedure first checks to see whether advice has recently been given and should still be followed. If the advice is being generalized through time (due to low friction) and the new (state, advice) pair has not been added to the dictionary, it will be added at this time. If this state is revisited in the future, the recent advice given for a previous state will be applied as if it had been given for this state, too. The timer that keeps track of the friction parameter threshold is updated. If the timer indicates that the advice has been followed for long enough, parameters will be reset so the agent will return to the Bayesian Q-learning’s action selection for the next time step. If advice has previously been given for this state, the agent must choose between the human advice and the BQL suggestion. For this work, we always choose the human’s advice. If a researcher wants to encourage more exploration, a different method can be chosen (i.e. an algorithm similar to applied to human vs. agent action selection instead of exploration vs. exploitation). However, we have found that following advice in a probabilistic manner increases frustration and uncertainty since the agent seems to disregard advice. In the case that no advice has been given for or generalized to the current state, the action is chosen from the Bayesian Q-learning’s action selection method.

2.1 Combining Supervised and Reinforcement Learning to allow for personalization by end-users

When designing an interactive machine learning algorithm, one first must ask: what is the goal of IML? Is the goal to use human instruction to decrease training time? Or is the goal to enable people to teach an agent to perform a task in the way the human intends?

If the goal of IML is to use human instruction to decrease the amount of time it takes to train the RL agent, at first glance it seems like the best of both worlds. We get to use powerful RL algorithms that are capable of learning from their environment instead of having policies hard-coded or models from datasets built a priori; and we get to decrease training time by getting useful human input.

However, if the goal is to get the agent to perform a task that a non-expert specifies in the way the human wants the task done, then we have a problem.

The policy an RL agent learns is very sensitive to the reward function. In this work, as well as most IML research, the reward function is provided by the researcher prior to the experiment. Given a reward function, an RL algorithm will learn a policy to maximize the reward, but the policy learned may be very alien to a human mind. The agent will technically complete the task efficiently, but not in a way that makes a lot of sense to a human. Given a reward function and human input, the RL algorithm may initially learn a policy that conforms to the human’s instructions, but eventually might learn a policy that solely maximizes cumulative reward. This may satisfy a goal of decreasing training time since the human will have shown the RL agent high-earning states earlier than it would have explored, but the policy may still be baffling to a non-expert. The human teacher may feel like their instructions were disregarded in the long-term, creating feelings of frustration and powerlessness when they cannot directly control the agent’s policies.

Newtonian Action Advice can be seen as a way to combine supervised and reinforcement learning. A human provides information that is used to create policies that are static to the agent (supervised learning), but can be overwritten by the human. The agent uses reinforcement learning to determine the best course of action for parts of the state space in which human advice is not given. Not only does this enable the agent to use human input to decrease training time, it also empowers the human teacher to customize how the agent performs the task. It is possible the learned policy will be near-optimal instead of optimal from an objective analysis of the cumulative reward; but the agent’s performance will be in greater accordance with the human teacher’s instructions, which will increase the human’s satisfaction with the agent’s performance.

2.2 Choosing the friction parameter

When calculating the algorithm’s ‘friction’ parameter, it is more straightforward to think of the parameter as the number of steps each piece of advice should persist for, . Increasing causes each piece of advice to be followed for a longer time, which causes a metaphorically lower friction in the model.


is the desired number of steps advice should persist for

is the minimum number of steps advice should persist for

is the maximum number of steps advice should persist for

is the domain update rate

is the desired time between given advice

is the minimum time between given advice

is the maximum time between given advice

Equations 1-3 show how to calculate the minimum, maximum, and desired values of the friction parameter. Two items should be noted for . First, the value of has a lower bound based on human limitations. You cannot expect people to provide advice infinitely quickly. It is nonsensical to provide advice that only lasts for a fraction of a second; if is too small, it may occur that the agent follows the advice for such a short amount of time that it is not perceivable by the human. We suggest . Second, the advice must last for at least one time step, so .


Depending on the domain, task, and nature of the actions, we suggest a starting value of to be between 2-8 seconds.

Equation 4 shows the bounds on the friction parameter. The value of has the potential to run into a hard boundary based on human limitations, while is a flexible boundary based on desired behavior.


Instead of calculating the friction parameter using the desired time between when advice is given, the average duration of each action can be used.


is the desired number of actions between when advice is given

is the average number of steps it takes to complete an action

is the average time it takes to complete an action

If the human teacher is instructing the agent to take primitive actions, then . If the human is providing instructions for higher order actions, then . An action must last for the duration of at least one time step, so . Equations 5 and 6 show expressions for depending on whether the researcher has easier access to or .


If primitive actions are being used, it is likely that the ideal parameter will be fairly large because the human teacher will want a given piece of advice to be followed for several consecutive time steps. If higher order actions are being used, a smaller may be beneficial because it will already take the agent several time steps to carry out the higher order action.

3 Method

We validated our NAA algorithm first with oracles to test the theoretical performance of the algorithm, and then with a human-subject experiment to compare the human teacher’s experience with another IML interaction algorithm, Policy Shaping.

Many of the existing machine learning algorithms that learn from human feedback are evaluated using oracles and focus on how quickly the agent learns. While this is valuable information, it ignores other important aspects of the human-agent interaction such as how humans react to the agent. For example, an oracle will never get frustrated with the agent or confused by its actions. If the interaction method affects how frustrated people are, regardless of the underlying machine learning algorithm, then perhaps that interaction method should be avoided. To that end, interactive machine learning agents should not solely be designed to optimize theoretical learning curves from simulations, but also to create a positive experience for the human teacher.

Both the simulations and human-subject experiment used the same task domain. The oracle or human participant was required to teach agents to rescue a person in Radiation World, a game developed in the unity minecraft environment (Figure 2). In the experimental scenario, there has been a radiation leak and a person is injured and immobile. The agent must find the person and take him to the exit while avoiding the radiation.

Figure 2: Radiation World Initial Condition

3.1 Constructing Oracles

We first tested the Newtonian Action Advice algorithm with simulations that used a constructed oracle to simulate human feedback. Each oracle was instantiated with a probability, , that determined how often to check for advice from the oracle. If the simulation was testing the algorithm’s performance with , then at every time step a random decimal, , would be chosen between 0 and 1. The agent would check for advice if , and would otherwise not check for advice. We provided the advice for the oracles to test several cases, including maximum friction, two cases of minimal advice, and decreased friction.

The same oracle algorithm controlling when to check for instructions was used to test the Policy Shaping agent. The same advice dictionary was used for the NAA and Policy Shaping oracles. The advice dictionary was converted to critique for the Policy Shaping agent in the following manner: if the agent took the advised action for the state, the critique was positive; otherwise, the critique was negative.

3.2 Human-Subject Experiment

We conducted a repeated measures human-subject experiment in which we investigated the effect of two different interaction methods, NAA and Policy Shaping, on the human’s experience of teaching the agent. Both the interaction methods shared the same underlying Bayesian Q-learning algorithm.

Each participant trained two agents with different interaction methods: NAA and Policy Shaping. Both agents learned from verbal natural language instruction, which was transcribed to text using ASR software. After language processing, the processed human instructions were sent to an interaction algorithm (Policy Shaping or NAA). Then, the interaction algorithm worked with the RL agent to determine action selection.

The Policy Shaping agent learned by incorporating a human teacher’s positive and negative critique. People were instructed to provide positive or negative critique in response to the agent’s actions. We used sentiment analysis as a filter to enable people to provide verbal critique without restricting their vocabulary. For example, a participant could give varied critique such as, “Good job”, “That’s great”, “That is a bad idea”, and “You’re wasting time”.

The NAA agent learned from a human teacher’s action advice. Participants were instructed to tell the agent to move in a desired direction. For example, if participants wanted the agent to move right, they should say, “right.” The only four words the participants should have used while training the action advice agent were, “up,” “down,” “left,” and “right.” These four directions were grounded to the agent’s actions.

The experiment collected data from 24 participants. None of our participants were ML experts. We made a concerted effort to recruit non-technical participants. Our participants included a piano teacher, lawyer, director of photography, political science student, and Marine veteran.

The experiment randomly split participants into two groups. The first group trained the Policy Shaping agent first, and the second group trained the NAA agent first. Participants were told to stop training when either the agent was performing as they intended or the participant wanted to stop for any reason. Thus, the training time varied for each participant and interaction method. After participants finished training an agent, they filled out a questionnaire concerning the experience. After training both agents, the participants filled out a questionnaire directly comparing the two agents.

In these questionnaires, the participants scored frustration, perceived performance, transparency, immediacy, and perceived intelligence. For example, immediately after training an agent, the participants were asked to score the intelligence of the agent on a continuous scale from [0:10]. A value of 0 indicated that the agent was not intelligent, while 10 meant very intelligent. The same scale of [0:10] was used for additional human factors metrics including performance, frustration, transparency, and immediacy. Values of 0 corresponded to poor performance, low frustration, non-transparent use of feedback, and a slower response time. Values of 10 meant excellent performance, high frustration, clear use of feedback, and an immediate response time, respectively.

4 Results and Discussion

4.1 Simulations

4.1.1 No generalization through time (extreme friction)

We simulated how the percentage of time advice is followed impacts performance. The simulation was set up so the NAA agent did not generalize a given piece of advice to other states immediately after the advice was given, meaning that one piece of advice counted for only one time step (maximum friction with ). The oracle was built with advice given for every square in the grid (Figure 3).

Incorporating human instruction by using the Newtonian Action Advice algorithm allows the RL agent to achieve a higher level of performance in many fewer episodes than without human input. As advice is given for a greater number of individual times steps (increasing from 20% to 90%), the agent accumulates more reward and completes each episode with fewer actions (Figure 4). The case with no human input is shown as BQL on the figures.

Figure 3: Advice given to simulation to avoid radiation.
(a) Reward
(b) Number of Steps to complete an episode
Figure 4: How the amount of advice provided impacts performance. (advice given 20, 50, and 90 percent of the time)
(a) Minimal advice given to simulation to complete task with minimal steps.
(b) Minimal advice given to simulation to avoid radiation.
Figure 5: Minimal advice for two paths.

4.1.2 Minimal Advice - shortest path

The minimal advice to take the shortest path (which takes the agent next to the radiation) is comprised of only two pieces of action advice equivalent to a human saying, “First move down. Then go right.” The minimal advice used to create the oracle in this case is represented in Figure 4(a).

Given only two pieces of advice, the NAA agent was able to complete the episode in 10 steps achieving a reward of 102.0 every single episode. The NAA agent was set to follow each piece of advice for before returning to the BQ-L baseline action selection.

The NAA model with a decreased friction parameter is what allows a human teacher to say “down, right,” instead of, “down, down, down, down, down, right, right, right, right, right.” It makes for a much better and more intuitive user experience to provide less instruction and not have to constantly repeat advice.

4.1.3 Minimal Advice - avoiding radiation

The minimal advice to take a path that avoids the radiation is comprised of only four pieces of action advice equivalent to a human saying, “First move right. Then go down. Move left then immediately right after rescuing the injured person.” This advice, which was used to construct the oracle for this case, can be seen in Figure 4(b).

Given only four pieces of advice, the NAA agent was able to complete the episode in 12 steps achieving a reward of 100.0 every single episode. The NAA agent was set to follow each piece of advice for before returning to the BQ-L baseline action selection.

This case is an example of the optimal vs. customized discussion in Section 2.1. The learned policy was near-optimal instead of optimal from an objective analysis of the cumulative reward since the path to avoid the radiation was slightly longer, but the agent’s performance was in accordance with the human teacher’s advised path.

4.1.4 Generalization through time (friction effect)

We studied how the algorithm performs as the friction of the NAA model is decreased (i.e. the parameter is increased). Sections 4.1.2 and 4.1.3 have already shown that a small amount of advice paired with a lowered friction can enable the NAA agent to perform optimally or near-optimally from the very first episode. To test the friction effect more rigorously, we built an oracle with the same advice given for every square as Section 4.1.1 (Figure 3).

When advice is given 20% of the time, the agent with a lower friction initially performs better than a higher friction . However, in later episodes the agent with lower friction earns a lower cumulative reward while taking more steps to complete each episode compared to the high friction agent. In general, these results indicate that, while lowering the friction can increase initial performance, it can also cause a lower-performing policy to be learned by the agent.

But what is really going on in this case? We have seen in Sections 4.1.2 and 4.1.3 that minimal advice paired with a lowered friction enables the agent to perform optimally or near-optimally from the very first episode. Why would providing more advice () harm the agent’s performance, particularly when Section 4.1.1 showed that increasing advice increases performance? The core issue is a limitation of the oracle. At every time step, the oracle listens for advice with a probability of 20%. This is not how a human would provide advice. The decreased performance in this case occurs when the agent spends time repeatedly banging into walls after advice has been generalized to a wall state instead of the oracle providing advice for that wall state. The probability is low enough that this behavior is not corrected for many episodes. Human teachers who observed this behavior would quickly provide an extra piece of advice to make sure the agent did not fruitlessly waste time.

When people decrease advice, they tend to limit themselves to the most important pieces of advice, such as the minimal advice cases. The oracle has no way to know which advice is the most important, and so provides advice in a way that is not indicative of human behavior. A possible solution to this problem is to build more elaborate oracles that more accurately represent human behavior. There are three main issues with this approach: 1) the use of and response to an algorithm will vary across individuals, so multiple contradictory oracles would need to be constructed, 2) an oracle’s ability to provide a type of input does not mean a human is likely or able to provide that input in reality, and 3) it is very unlikely that even the most elaborate oracle could simulate the human’s response to the agent, such as frustration. A more practical solution to this problem is to test algorithms with human-subject experiments.

This case shows why IML researchers should verify interaction algorithms with human-subject experiments in addition to simulations. If we had analyzed these results without understanding the limitations of the oracle, we might have discarded parameterizations using a lower friction.

(a) Reward
(b) Number of Steps to complete an episode
Figure 6: Generalization through time. (advice given 20, 50, 90 percent of the time)

4.1.5 Comparison of Newtonian Action Advice and Policy Shaping

(a) Reward
(b) Number of Steps to complete an episode
Figure 7: Comparison of Newtonian Action Advice and Policy Shaping

Figure 7 shows that, given equivalent input, Newtonian Action Advice can learn faster using less human instruction than Policy Shaping. Even when Policy Shaping used advice 98% of the time, it learned slower than the NAA agent that was using input only 20% of the time. The oracle used the same setup as Sections 4.1.1 and 4.1.4 (Figure 3).

When learning from human teachers in practice, however, the performance of each agent is entirely dependent on the instruction provided by the person. Neither agent is guaranteed to perform better than the other. If the human provides no instructions, the Policy Shaping and NAA agents perform equally since they both reduce to a Bayesian Q-learning algorithm. In order to investigate how the performance of the agents varied with real human teachers, as well as how the human experience was impacted by interacting with each agent, we conducted a human-subject experiment.

4.2 Human Subject Results

Immediately after training each agent, participants were asked to score aspects of their experience training the agent, including frustration, perceived performance, transparency, immediacy, and perceived intelligence (Figure 8

). Paired t-tests were conducted for each metric in which the null hypothesis was the pairwise difference between the two paired groups had a mean equal to zero. We found that all measured aspects of the human experience differed significantly between the two agents.

Figure 8: Comparison of the human experience.

In summary, compared to Policy Shaping participants found the Newtonian Action Advice agent to be less frustrating, clear and more immediate in terms of how the agent used human input, better able to complete the task as the person intended, and more intelligent.

In addition to creating a better human experience, the NAA agent also performed better than Policy Shaping in terms of objective RL metrics. The average training time was smaller for the NAA agent. The number of steps the agent took to complete each episode was smaller for NAA. The average reward was higher for NAA than Policy Shaping. However, the amount of input provided by the human teachers was statistically equal for the two interaction algorithms.

5 Conclusions

This paper presented Newtonian Action Advice, a method to integrate a human’s interactive action advice (ex: “move left”) with reinforcement learning. For equivalent human input, Newtonian Action Advice performs better than Policy Shaping, both in terms of RL metrics like cumulative reward and human factors metrics like frustration.


This work was funded under ONR grant number N000141410003.


  • Argall et al. (2008) Argall, B. D., Browning, B., and Veloso, M. (2008). Learning robot motion control with demonstration and advice-operators. In Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on, pages 399–404. IEEE.
  • Cederborg et al. (2015) Cederborg, T., Grover, I., Isbell, C. L., and Thomaz, A. L. (2015). Policy shaping with human teachers. In IJCAI, pages 3366–3372.
  • Dearden et al. (1998) Dearden, R., Friedman, N., and Russell, S. (1998). Bayesian q-learning. In AAAI/IAAI, pages 761–768.
  • Griffith et al. (2013) Griffith, S., Subramanian, K., Scholz, J., Isbell, C., and Thomaz, A. L. (2013). Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems, pages 2625–2633.
  • Huggins-Daines et al. (2006) Huggins-Daines, D., Kumar, M., Chan, A., Black, A. W., Ravishankar, M., and Rudnicky, A. I. (2006). Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 1, pages I–I. IEEE.
  • Isbell et al. (2001) Isbell, C., Shelton, C. R., Kearns, M., Singh, S., and Stone, P. (2001). A social reinforcement learning agent. In Proceedings of the fifth international conference on Autonomous agents, pages 377–384. ACM.
  • Joshi et al. (2012) Joshi, M., Khobragade, R., Sarda, S., Deshpande, U., and Mohan, S. (2012). Object-oriented representation and hierarchical reinforcement learning in infinite mario. In

    Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on

    , volume 1, pages 1076–1081. IEEE.
  • Knox and Stone (2010) Knox, W. B. and Stone, P. (2010). Combining manual feedback with subsequent mdp reward signals for reinforcement learning. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pages 5–12. International Foundation for Autonomous Agents and Multiagent Systems.
  • Krening et al. (2017) Krening, S., Harrison, B., Feigh, K. M., Isbell, C. L., Riedl, M., and Thomaz, A. (2017). Learning from explanations using sentiment and advice in rl. IEEE Transactions on Cognitive and Developmental Systems, 9(1):44–55.
  • Kuhlmann et al. (2004) Kuhlmann, G., Stone, P., Mooney, R., and Shavlik, J. (2004). Guiding a reinforcement learner with natural language advice: Initial results in robocup soccer. In The AAAI-2004 workshop on supervisory control of learning and adaptive systems.
  • MacGlashan et al. (2015) MacGlashan, J., Babes-Vroman, M., desJardins, M., Littman, M., Muresan, S., Squire, S., Tellex, S., Arumugam, D., and Yang, L. (2015). Grounding english commands to reward functions. In Proceedings of Robotics: Science and Systems, Rome, Italy.
  • Maclin et al. (2005) Maclin, R., Shavlik, J., Torrey, L., Walker, T., and Wild, E. (2005). Giving advice about preferred actions to reinforcement learners via knowledge-based kernel regression. In Proceedings of the National Conference on Artificial intelligence, volume 20, page 819. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.
  • Manning et al. (2014) Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. (2014).

    The Stanford CoreNLP natural language processing toolkit.

    In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60.
  • Meriçli et al. (2014) Meriçli, C., Klee, S. D., Paparian, J., and Veloso, M. (2014). An interactive approach for situated task specification through verbal instructions. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 1069–1076. International Foundation for Autonomous Agents and Multiagent Systems.
  • Pang and Lee (2005) Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 115–124. Association for Computational Linguistics.
  • Pang and Lee (2008) Pang, B. and Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1-2):1–135.
  • Sivamurugan and Ravindran (2012) Sivamurugan, M. S. and Ravindran, B. (2012). Instructing a reinforcement learner. In FLAIRS Conference.
  • Skinner (1938) Skinner, B. F. (1938). The behavior of organisms: An experimental analysis.
  • Socher et al. (2013) Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), volume 1631, page 1642. Citeseer.
  • Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
  • Tellex et al. (2011) Tellex, S., Kollar, T., Dickerson, S., Walter, M. R., Banerjee, A. G., Teller, S. J., and Roy, N. (2011). Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI, volume 1, page 2.
  • Thomason et al. (2015) Thomason, J., Zhang, S., Mooney, R. J., and Stone, P. (2015). Learning to interpret natural language commands through human-robot dialog. In IJCAI, pages 1923–1929.
  • Thomaz and Breazeal (2008) Thomaz, A. L. and Breazeal, C. (2008). Teachable robots: Understanding human teaching behavior to build more effective robot learners. Artificial Intelligence, 172(6-7):716–737.