1 Introduction
A key goal of interactive learning research is to allow robots and other artificial agents to leverage human knowledge to more efficiently learn useful tasks. Such agents often require large amounts of data to learn tasks on their own, data which, while easy to generate in simulation, can be expensive to collect in realworld settings. Through interactive learning, an agent can rely on human knowledge in addition to its own, limited experience. The challenge is learning from modes of communication that are natural for users (whom we will refer to as the teachers) who lack expertise in programming or AI. Here we consider two such modes: task demonstrations [Argall et al.2009] and evaluative feedback [Knox, Stone, and Breazeal2013]. As these types of data are expensive to collect, we need to extract as much information as possible from interaction with a teacher. In this work, we develop an algorithm, Behavior Aware Modeling (BAM), which incorporates data provided by a teacher for several different tasks into a single model of the agent’s environment, which allows the teacher’s knowledge to be shared across all of the tasks the agent is learning.
A teacher’s understanding of their environment is implicit in the way they choose to perform different tasks, and BAM allows an agent to build this knowledge into its own model of the world. There are many settings where an agent must learn to operate in an environment that its human teachers would already know well. For example, a robot that transports materials around a hospital needs to learn the layout of the building, and how the people within it will behave (e.g. will someone pushing a gurney make way for the robot?). A smart home system will be more effective if it knows how the HVAC system affects different parts of the house, and where the occupants are likely to be at a given time. While an agent could acquire such information on its own, this might be expensive or even dangerous (e.g. the robot blocking a gurney to see what happens). This work shows how such knowledge can be acquired from human teachers, in combination with an agent’s own experience.
BAM treats the problem of learning multiple tasks as a set of Markov decision processes, each with the same transition dynamics, and captures the teacher’s understanding of the environment with a
model [Brafman and Tennenholtz2002, Deisenroth and Rasmussen2011] of these dynamics, while learning separate cost functions encoding the goals of each of the tasks. BAM can be seen as a generalization ofInverse Reinforcement Learning
(IRL), which learns to perform tasks by identifying the cost functions defining those tasks [Abbeel and Ng2004, Vroman2014]. The key difference between BAM and IRL is that IRL relies solely on the agent’s own understanding of the transition dynamics, while BAM also exploits the teacher’s knowledge of these dynamics. BAM combines its own experience and prior knowledge of the dynamics with that provided by the teacher, and because its dynamics model is shared across tasks, the policy it learns for one task can incorporate data provided for another.In addition to deriving a novel algorithm for learning tasks and dynamics from demonstrations and evaluative feedback, we also present empirical results comparing BAM against existing approaches to learning from such data. We evaluate BAM both with simulated teachers and with human teachers through a largescale, webbased user study. Our results demonstrate that by explicitly capturing a teacher’s understanding of the environment, BAM can significantly reduce the total effort required to teach an agent to perform a collection of tasks. Interactive learning is limited by the amount of information a human teacher can provide, and so by reducing the effort required on the part of the teacher BAM may allow for interactive learning to be applied to more complex and useful tasks than would otherwise be feasible.
2 Related Work
This work considers the case where a human teacher demonstrates a set of tasks, and then provides evaluative feedback while the agent attempts to perform these tasks itself. As there is no observable cost function, standard reinforcement learning methods cannot be directly applied. For learning from demonstrations, the simplest approach would be behavioral cloning [Bain and Sammut1995, Pomerleau1989]
, where a supervised learning algorithm is used to find a mapping from states to actions based of the teacher’s actions. Behavioral cloning, however, often struggles to find robust policies for sequential tasks
[Ross, Gordon, and Bagnell2011, Atkeson and Schaal1997]. By using knowledge of the dynamics of the environment, and finding cost functions describing the tasks being taught, inverse reinforcement learning can produce policies which are robust over longer time horizons, and which generalize to states not encountered during training [Abbeel and Ng2004, Ng and Russell2000].The dynamics knowledge on which IRL depends is sometimes provided in the form of explicit state transition probabilities
[Ramachandran2007, Syed, Bowling, and Schapire2008], but in more realistic settings the agent must acquire this information through its own interaction with the environment [Abbeel and Ng2004, Bloem and Bambos2014, Boularias, Kober, and Peters2011]. In contrast to existing IRL approaches, the algorithm described here learns about the dynamics based on the teacher’s demonstrations and feedback. BAM interprets teacher actions in much the same way as many existing IRL algorithms [Ramachandran2007, Neu and Szepesvári2007, Vroman2014], assuming that each action is sampled from a Boltzmann distribution based on the expected return values, or values, for possible actions in the current state. BAM is most closely related to MaximumLikelihood IRL (MLIRL) [Vroman2014], in which these values are assumed to be computed through a differentiable, soft value iteration process, and costs are found via gradient ascent on the loglikelihood of the data.We note that other recent work has taken a similar approach to ours, learning both cost functions and dynamics parameters from human demonstrations via a maximumlikelihood IRL algorithm [Herman et al.2016]. In contrast to that work however, this work demonstrates that incorporating teacher knowledge into an agent’s dynamics model is beneficial in the context of realtime, interactive learning. We also consider the problem of learning multiple tasks (with multiple cost functions) simultaneously. Finally, our algorithm can incorporate positive and negative feedback from the teacher, relying on a variation of the Bayesian interpretation of feedback developed in [Loftin et al.2016].
3 Algorithms
The behavioraware modeling algorithm assumes that information coming from the teacher, both demonstrations of the target tasks and evaluative feedback given in response to the agent’s behavior, depends on some stateaction value function known only to the teacher. This function itself depends on the cost function which the teacher associates with the current task, as well as the teacher’s internal model of the dynamics of the environment (which we assume is equivalent to the true dynamics). We then assume that the function for a task is computed via a finite number of steps of the same soft value iteration used in MLIRL, defined by
(1)  
(2) 
where is a normalization term. This soft value iteration has the advantage of being differentiable, and accounts for the teacher’s potentially suboptimal behavior. It also allows the teacher’s actions and feedback to depend on states that it would not under optimal planning.
The BAM algorithm estimates
and the cost for each task by maximizing their probability given the data provided by the teacher (feedback and demonstrated actions), as well as the agent’s direct observations of state transitions. We divide the training data provided to BAM into a set containing the teacher’s actions and feedback, and a set containing the state transitions observed when either the teacher or the agent takes an action. We further divide into sets for each task being taught. consists of state transitions , while each set consists of stateaction pairs and feedback events . Similar to Bayesian IRL [Ramachandran2007], BAM assumes that a teacher samples actions from a Boltzmann distribution such that .To incorporate positive and negative feedback, we employ a version of the SABL feedback model developed in [Loftin et al.2016]. This version, which we refer to as AdvantageSABL (ASABL), defines the probability of receiving a positive or negative feedback signal from the teacher in terms of the advantage of the most recent action. Specifically, we define the advantage of action as (the advantage under a random policy). The probabilities of receiving positive feedback or negative feedback for are then
(3)  
(4) 
where . The probability of receiving no feedback is simply . and are tunable parameters that define the probability of receiving explicit feedback given that the teacher interprets an action as correct or incorrect, while is the teacher’s error rate, and is a scale factor.
3.1 Behavior Aware Modeling
BAM works with a parametric space of dynamics models and cost functions, and computes maximum likelihood estimates of the parameters of the dynamics model , as well as the parameters of the cost functions for each of the tasks being taught. Both the dynamics parameters and the parameters of of each cost function (which we write compactly as ) are learned via gradient ascent on their logprobability, that is, BAM maximizes the objective function:
(5) 
where is the function computed under the model , for the task defined by , and where and are regularization terms. The most computationally difficult part of the optimization is the gradient of w.r.t. and . The gradient w.r.t. ,
(6) 
where , depends on and on , the Jacobian of the values for state w.r.t. . The gradient for takes the same form. We assume that , where is the th step of a soft value iteration process. For state and action we then have
(7)  
(8) 
and for both and we have
(9) 
where , that is, is the Boltzmann action distribution for state . In our implementation, we first accumulate the likelihood terms for all demonstrated actions and feedback before computing this gradient. Rather than maximizing Equation 5 for the and simultaneously, we have found empirically that alternating between optimizing and optimizing is more efficient and reliable in finding good estimates of the dynamics and costfunctions.
4 Simulated Teacher Experiments
To understand how learning dynamics can reduce the effort needed to teach a set of behaviors, we compare BAM against two other approaches to interactive learning, using data generated by a simulated teacher. We consider three classes of learning problems, which we refer to as domains, with discrete state and action spaces. For each domain, we define multiple environments, where each environment is defined by its specific transition probabilities, which are initially unknown to the learning agents. Within each environment, we define one or more tasks, each defined by different cost function, which are also unknown. In our experiments, an individual agent attempts to learn all the tasks within a single environment. Each environment also defines a space of possible dynamics models and cost functions which the agents must choose from. In all three domains, the true cost functions were zero everywhere except for the goal states.
The first domain, which we refer to as navigation (see Figures (a)a, (b)b), is a grid world in which any grid cell may be blocked by an obstacle. Each task in this domain is defined by a goal location, while the dynamics are defined by the set of cells that are blocked by obstacles. The space of cost functions has one parameter for each cell in the grid, potentially allowing obstacles to be represented as highcost cell. The dynamics model also has one parameter per cell, and a transition into cell fails with probability .
The two other domains are grid worlds, in which each grid cell has an additional feature which affects the transition dynamics, such that multiple states correspond to single cell. In the farming domain (see Figures (c)c, (d)d), the agent may carry one of three farm implements (a plow, a sprinkler, or a harvester), and the implement it carries determines which cells it is able to enter. Cells representing dirt fields require the plow, while cells with immature crops require the sprinkler, and cells with fully grown crops require the harvester. There are also cells in which the agent can pick up each implement. Tasks are defined by groups of cells that the agent must reach. The space of dynamics models is a space of mappings from each implement to the probability that it works on a certain type of cell, that is, a mapping from the three implements to distributions over the three cell types.
In the gravity domain there are four possible gravity directions, and the agent cannot move in the direction opposite to the current gravity. A task is defined by a goal cell, and certain cells allow the agent to change the gravity direction so that it can reach the goal. Each of these cells has a color which determines how it changes the direction of the gravity, and the space of dynamics models is a space of mappings from colors to distributions over the gravity directions. The cost function spaces for the farming and gravity domains allow independent cost for every grid cell (but not every state).
4.1 Alternative Algorithms
We wish to determine whether incorporating teacher knowledge into the agent’s dynamics model allows for more efficient learning than using a model built solely from directly observed state transitions. As the BAM algorithm can be viewed as a generalization of maximumlikelihood IRL to the problem of learning transition dynamics, we compare BAM against a version of MLIRL that uses a dynamics model which does not incorporate information from the teacher. This modelbased IRL algorithm first finds a maximumlikelihood model of the transition dynamics based only on the observed transitions in , and then uses MLIRL to find the cost function for each task. Modelbased IRL selects its dynamics model from the same spaces of models as BAM does, and selects its actions greedily.
Both BAM and MLIRL generalize to states for which they have received no teacher data by learning cost functions that describe the tasks being taught. As this approach may be ineffective, or even counterproductive in some cases, we also compare BAM against a behavioral cloning algorithm, one that uses a tabular representation that does not generalize between different states. To incorporate both feedback and demonstrations, our behavioral cloning algorithm finds a table of values for each stateaction pair, rather than a policy, and selects its actions greedily (as with BAM and modelbased IRL). These values, however, can be interpreted simply as the logprobabilities of each action, with the greedy policy selecting the most probable action.
4.2 Learning from Demonstration
In our first set of experiments, the simulated teacher only provided demonstrations, and the agent did no exploration on its own. In each round, a single demonstration of the optimal policy was given for each task in the current environment, after which the agent updated its policies for each of these tasks to incorporate the new demonstrations. To best reflect the behavior of real teachers, each demonstration was terminated after the goal state was reached, with a final action allowing the agent to observe the teacher remaining at the goal. We evaluated the set of policies an agent learned at a given point in time by their total return, that is, the sum of the expected returns of policies for each task. The expected return of was estimated by running 50 simulated episodes following a policy (the agent did not see these episodes).
Figure 2 shows the total return of the policies learned by each of the algorithms, as a percentage of the total return of the optimal policies, and plotted against the number of rounds of demonstrations. We can see that BAM substantially outperforms both algorithms in the farming environments, while outperforming behavioral cloning in the gravity environments, and modelbased IRL in the Doorway environment. Furthermore, we can see that BAM dominates the other algorithms in that it always performs at least as well as the strongest alternative, while modelbased IRL and behavioral cloning perform inconsistently.
The apparent advantage of the BAM algorithm over modelbased IRL in the Doorway environment (see Figure (b)b) is of particular interest. The initial state in this environment is randomly chosen from the bottom three rows of the grid, such that the agent must go through the doorway to reach any of the goals. While the walls are never encountered during the optimal demonstrations, BAM is able to identify the location of the doorway and share this information across all four tasks. This example suggests that inferring a teacher’s dynamics model may be an effective alternative to extracting intermediate policies, commonly represented as options [Sutton, Precup, and Singh1999] in RL, from human demonstrations. Rather than learning policy for going through the doorway, BAM can find a dynamics model which leads the agent to use the doorway, thus capturing the same behavior. Modelbased IRL, however, must encode the doorway within its taskspecific cost functions, and so must learn this behavior separately for each task.
The BAM algorithm also has an advantage in the farming environments. In both of these environments, the agent must move away from the target field to retrieve the plow. While the agent never directly observes the fact that the other implements do not work for the target field, it can infer this outcome from the fact that the teacher went out of their way to reach the correct machine. Finally, we attribute the relatively strong performance of behavioral cloning in the navigation domains to the fact that the cloning agent follows a random policy until it reaches a state it has observed before (which does not take very long), after which it knows the optimal policy all the way to the goal. Even so, BAM still performs at least as well as behavioral cloning in these environments.
4.3 Global Cost Functions
One of the main advantages of the BAM algorithm in terms of reducing teacher effort is that it is able to share a teacher’s knowledge across multiple tasks through the dynamics model. An alternative (and potentially simpler) approach to sharing such information would be to learn a global cost function in addition to the task specific cost functions, such that the policies learned for each task would be optimal for the sum of the global and task costs. The global cost function might encode much of the same teacher knowledge that BAM captures with its dynamics model. For example, in the navigation domain, unobserved obstacles could be represented as states with a high global cost, with goals captured as lowcost states in the taskspecific cost functions.
To understand the benefits of learning a global dynamics model, we compared a version of modelbased IRL that learned a global cost function against both BAM and modelbased IRL without global costs. Figure 3 shows the results of these comparisons, using the same training protocol and environments as the experiments in Section 4.2. We can see that in most cases, the use of a global cost function offers little advantage over the modelbased IRL algorithm, that is, the total return for the global cost algorithm is less than or equal to that of modelbased IRL. In fact, we find that using a global cost function actually hurts performance in the gravity domain. This reflects the fact that a global cost function cannot capture the outcomes of unseen actions, only the immediate costs. The fact that BAM dominates the global cost algorithm in the gravity and farming domains can be explained by noting that the space of global cost functions is much less constrained (and has many more parameters) than the space of dynamics models, such that the agent requires less data to learn a good dynamics model than it does a good global cost function. The simpler space of dynamics models is possible because we have domainspecific prior knowledge of the possible dynamics, knowledge which would be difficult to incorporate into a space of global cost functions.
4.4 Demonstrations and Feedback
As we are interested in the case where the teacher uses a combination of demonstrations and evaluative feedback, we also conducted a set of experiments in which, for each round, the agent first observed a set of demonstrations of each task, and then attempted each task on its own while the simulated teacher provided feedback. These experiments (Figure 4) show a major improvement for modelbased IRL in most domains, though BAM still has an advantage in the navigation environments. The improvements for modelbased IRL the farming and gravity domains largely reflect the fact that the agent can now explore the environment under its own policies, and can therefore directly observe transitions that would not have been seen during the teacher’s optimal demonstrations. This exploration is less valuable in the navigation domain, however, as the space of dynamics models is much more complex. Inferring the dynamics from the teacher’s behavior is still more efficient in this domain than requiring that each obstacle be directly observed. As learning from feedback only (no demonstrations) in our domains proved to be very inefficient for all three algorithms, we did not evaluate BAM when learning from feedback alone.
5 HumanSubjects Experiments
Total Sessions  Successful  Episodes Required  Actions Required  

Two Rooms  50%  80%  50%  80%  50%  80%  
BAM  27  24  14  7.0  9.9  164.2  215.9 
ModelBased IRL  27  25  *6  7.4  12.8  168.1  258.5 
Behavioral Cloning  27  17  *1  8.4  20.0  207.3  461.0 
Doorway  50%  80%  50%  80%  50%  80%  
BAM  27  19  5  9.7  14.7  167.4  217.6 
ModelBased IRL  27  18  4  *12.9  *26.6  215.4  *441.2 
Behavioral Cloning  27  12  4  *13.7  *27.7  271.0  *517.7 
Two Fields  50%  80%  50%  80%  50%  80%  
BAM  31  27  18  2.8  3.6  61.4  87.6 
ModelBased IRL  31  27  12  *4.4  *6.5  *103.1  131.0 
Behavioral Cloning  31  *10  *0  *8.4  N/A  *272.1  N/A 
Three Fields  50%  80%  50%  80%  50%  80%  
BAM  31  24  20  1.3  1.6  26.9  29.8 
ModelBased IRL  31  22  12  *1.8  *2.4  *38.3  *53.0 
Behavioral Cloning  31  17  *2  *2.7  7.5  *62.8  *201.5 
, ttest or Fisher’s exact test) than the value for BAM, while (
) indicates that the result remains significant under a BenjaminiYekutieli [Benjamini and Yekutieli2001] correction for multiple comparisons (with a false discovery rate of 20%).As the purpose of this work is to allow nonexpert humans to teach agents with less effort than is possible with existing approaches, we need to determine whether the BAM algorithm actually reduce this effort for real human teachers. We therefore compared BAM against the modelbased IRL and behavioral cloning algorithms from Section 4.1 in a webbased user study in which participants each trained several learning agents in the four environments shown in Figure 1. Participants were recruited through the Amazon Mechanical Turk platform, and were paid $3.00 US for completing the experiment. Each participant was randomly assigned to either the navigation or farming domain, and went through a brief tutorial showing them how to use the training interface in that domain. Participants assigned to the navigation domain first taught three agents in the Two Rooms environment, and then taught three agents in the Doorway environment. Those assigned to the farming domain first taught agents in the Two Fields environment, and then in the Three Fields environment. Between both domains we had a total of 58 unique participants who completed the study, out of 129 (possibly nonunique) participants who began the experiment. These participants spent an average of 33 minutes (std. deviation 17 minutes) working on the study (including the tutorial).
5.1 Learning Sessions
Participants taught one agent at a time, and could not return to an agent after moving on to the next one. The interface shown to the participants allowed them to take control of the agent and demonstrate a task, or ask the agent to try and perform the task itself. During demonstrations, the participant controlled the agent using the arrow keys to move up, down, left, or right. While the agent was acting on its own, participants had the option of providing positive and negative feedback using the keyboard as well. Participants were prompted to reset the environment to a random initial state after a task was completed, and had the option of moving the agent to desired start locations themselves. The participant could switch between tasks at any time, and the agent would associate any demonstrated action or feedback signal with the currently selected task. Participants were required to provide at least one demonstrated action for each task before being allowed to move on to teaching the next agent.
We compared the BAM algorithm against both behavioral cloning and modelbased IRL. Participants were asked to train three agents in both of the environments they encountered, with each agent being controlled by a different learning algorithm. The order in which these algorithms were presented was randomized for each participant, and each agent was rendered with a different color to highlight the fact that it did not know what the previous agents had learned. The learning agents themselves were configured identically to the ones used in the simulation experiments (see Section 4). Learning updates only occurred after the participant ended a demonstration or an evaluation episode. Because the teacher had no way to demonstrate remaining in the same location, at the end of a demonstration a synthetic noop action was shown to the agent to allow it to identify goal states.
5.2 Results
Table 1 shows the number of learning sessions in which the total return of the agent’s policies reached 50% and 80% of the optimal total return, as well as the average number of episodes (including demonstrations and agentcontrolled episodes) and individual actions (both teacher and agent actions) required for the agent to reach these thresholds. The total return was estimated by running each of the agent’s learned policies for 1000 simulated episodes. We can see in Table 1 that BAM dominates the alternative algorithms in almost every case, save for the 50% threshold in the Two Rooms environment. For the 80% threshold in particular, BAM reduces the number of actions and episodes required across all four environments, and is more likely to reach 80% of the optimum. Multiway ANOVA’s (with algorithm, environment and session order as factors) show that BAM’s advantages over modelbased IRL and behavioral cloning in terms of the number of episodes and actions needed to reach the 50% and 80% thresholds are significant (). Fisher’s exact test’s (for all environments combined) show that BAM’s superior success rates versus the alternatives at the 80 threshold are also significant ().
More specifically, in the Two Rooms environment (rows 3 to 5), Fisher’s exact test shows that the difference in the number of sessions using BAM (14 sessions) and modelbased IRL (6 sessions) that reached the 80% threshold is significant (). In the Doorway environment (rows 7 to 9), ttests also show that BAM requires significantly () fewer actions and episodes than the alternatives to reach the threshold. To address multiplicity, we also perform a BenjaminiYekutieli [Benjamini and Yekutieli2001] correction on all of the environmentspecific comparisons. Though some results are no longer significant under this correction, BAM’s advantages over modelbased IRL in terms of the number of episodes required at the threshold (column 6), and in terms of the number of actions required to reach the 50% threshold (column 7), are still significant in both the Two Fields and Three Fields environments. These results demonstrate that BAM is more efficient and more reliable in learning from human teachers than modelbased IRL or behavioral cloning, meaning that an agent using BAM can learn more complex sets of behaviors than the alternatives, with no additional effort on the part of the teacher.
6 Conclusions
In this work, we have presented a novel approach to interactive learning that takes full advantage of a teacher’s understanding of the learning agent’s environment. We have shown that BAM dominates existing approaches (which ignore the teacher’s dynamics knowledge) across many different domains and measures of teacher effort. Future work will focus on scaling BAM to more complex problems. This includes replacing exact value iteration with sparse search in BAM’s planning model, along the lines of [MacGlashan and Littman2015]. In many domains however, building a sufficiently accurate onestep model may be difficult, and so we are also interested in finding compact, highlevel dynamics models (e.g. the value iteration networks developed in [Tamar et al.2016]) that capture the teacher’s knowledge of the environment. Finally, we are particularly interested in implementing BAM on physical robots, a key application area for interactive learning. Overall, we are extremely encouraged by BAM’s performance in reducing teacher workload both in simulation and when learning from real humans, and our results provide a strong foundation for future work applying this approach to realworld domains.
References

[Abbeel and
Ng2004]
Abbeel, P., and Ng, A. Y.
2004.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the TwentyFirst International Conference on Machine Learning
. ACM.  [Argall et al.2009] Argall, B. D.; Chernova, S.; Veloso, M.; and Browning, B. 2009. A survey of robot learning from demonstration. Robotics and Autonomous Systems 57(5):469 – 483.
 [Atkeson and Schaal1997] Atkeson, C. G., and Schaal, S. 1997. Robot learning from demonstration. In Proceedings of the Fourteenth International Conference on Machine Learning, volume 97, 12–20.
 [Bain and Sammut1995] Bain, M., and Sammut, C. 1995. A framework for behavioural cloning. In Machine Intelligence 15.
 [Benjamini and Yekutieli2001] Benjamini, Y., and Yekutieli, D. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of statistics 1165–1188.
 [Bloem and Bambos2014] Bloem, M., and Bambos, N. 2014. Infinite time horizon maximum causal entropy inverse reinforcement learning. In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, 4911–4916. IEEE.

[Boularias, Kober, and
Peters2011]
Boularias, A.; Kober, J.; and Peters, J.
2011.
Relative entropy inverse reinforcement learning.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, 182–189.  [Brafman and Tennenholtz2002] Brafman, R. I., and Tennenholtz, M. 2002. Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research 3(Oct):213–231.
 [Deisenroth and Rasmussen2011] Deisenroth, M., and Rasmussen, C. E. 2011. Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), 465–472.
 [Herman et al.2016] Herman, M.; Gindele, T.; Wagner, J.; Schmitt, F.; and Burgard, W. 2016. Inverse reinforcement learning with simultaneous estimation of rewards and dynamics. In Artificial Intelligence and Statistics, 102–110.
 [Knox, Stone, and Breazeal2013] Knox, B.; Stone, P.; and Breazeal, C. 2013. Training a robot via human feedback: A case study. In Social Robotics, volume 8239 of Lecture Notes in Computer Science. 460–470.
 [Loftin et al.2016] Loftin, R.; Peng, B.; MacGlashan, J.; Littman, M. L.; Taylor, M. E.; Huang, J.; and Roberts, D. L. 2016. Learning behaviors via humandelivered discrete feedback: modeling implicit feedback strategies to speed up learning. Autonomous Agents and MultiAgent Systems 30(1):30–59.
 [MacGlashan and Littman2015] MacGlashan, J., and Littman, M. L. 2015. Between imitation and intention learning. In IJCAI, 3692–3698.
 [Neu and Szepesvári2007] Neu, G., and Szepesvári, C. 2007. Apprenticeship learning using inverse reinforcement learning and gradient methods. In Proceedings of the TwentyThird Conference on Uncertainty in Artificial Intelligence, 295–302. AUAI Press.
 [Ng and Russell2000] Ng, A. Y., and Russell, S. J. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, 663–670.

[Pomerleau1989]
Pomerleau, D. A.
1989.
Alvinn: An autonomous land vehicle in a neural network.
In Advances in neural information processing systems, 305–313.  [Ramachandran2007] Ramachandran, D. 2007. Bayesian inverse reinforcement learning. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence.

[Ross, Gordon, and
Bagnell2011]
Ross, S.; Gordon, G.; and Bagnell, D.
2011.
A reduction of imitation learning and structured prediction to noregret online learning.
In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 627–635.  [Sutton, Precup, and Singh1999] Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112(12):181–211.

[Syed, Bowling, and
Schapire2008]
Syed, U.; Bowling, M.; and Schapire, R. E.
2008.
Apprenticeship learning using linear programming.
In Proceedings of the 25th international conference on Machine learning, 1032–1039. ACM.  [Tamar et al.2016] Tamar, A.; Wu, Y.; Thomas, G.; Levine, S.; and Abbeel, P. 2016. Value iteration networks. In Advances in Neural Information Processing Systems, 2154–2162.
 [Vroman2014] Vroman, M. C. 2014. Maximum likelihood inverse reinforcement learning. Ph.D. Dissertation, Rutgers The State University of New JerseyNew Brunswick.
Appendix A User Interface for HumanSubjects Experiments
Appendix B Experimental Environments
Appendix C Simulation Results
Appendix D Statistical Analysis of Human subjects Experiments
Performance Measure  Threshold  ModelBased IRL  Behavioral Cloning 

Success Rate  50%  0.4276  3.926e10 
80%  1.298e4  2.2e16  
Episodes Required  50%  6.47e4  1.25e12 
80%  9.38e06  2e16  
Actions Required  50%  0.00376  1.20e10 
80%  1.92e4  1.34e14 
Environment  Performance Measure  Threshold  ModelBased IRL  Behavioral Cloning 

Two Rooms  Success Rate  50%  1  5.367713e02 
80%  4.727039e02  1.291742e04  
Episodes Required  50%  0.7058005  0.2355695  
80%  0.11362855  N/A  
Actions Required  50%  0.9153665  0.1368809  
80%  0.3346513  N/A  
Doorway  Success Rate  50%  1  9.777568e02 
80%  1  1  
Episodes Required  50%  4.159614e02  7.416946e03  
80%  3.728851e02  4.383471e02  
Actions Required  50%  0.1120207  5.307102e02  
80%  4.483479e02  4.535574e02  
Two Fields  Success Rate  50%  1  4.309968e06 
80%  0.2035422  2.230871e07  
Episodes Required  50%  1.252016e04  9.592095e05  
80%  1.197617e04  N/A  
Actions Required  50%  4.849809e03  1.767247e03  
80%  7.011737e02  N/A  
Three Fields  Success Rate  50%  1  6.200178e02 
80%  4.130673e02  2.376083e06  
Episodes Required  50%  1.985265e02  5.721056e04  
80%  1.475193e03  0.1536753  
Actions Required  50%  4.085855e02  2.319712e03  
80%  4.712152e03  2.819027e02 