Agent-Agnostic Human-in-the-Loop Reinforcement Learning

01/15/2017 ∙ by David Abel, et al. ∙ Brown University 0

Providing Reinforcement Learning agents with expert advice can dramatically improve various aspects of learning. Prior work has developed teaching protocols that enable agents to learn efficiently in complex environments; many of these methods tailor the teacher's guidance to agents with a particular representation or underlying learning scheme, offering effective but specialized teaching procedures. In this work, we explore protocol programs, an agent-agnostic schema for Human-in-the-Loop Reinforcement Learning. Our goal is to incorporate the beneficial properties of a human teacher into Reinforcement Learning without making strong assumptions about the inner workings of the agent. We show how to represent existing approaches such as action pruning, reward shaping, and training in simulation as special cases of our schema and conduct preliminary experiments on simple domains.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A central goal of Reinforcement Learning (RL) is to design agents that learn in a fully autonomous way. An engineer designs a reward function, input/output channels, and a learning algorithm. Then, apart from debugging, the engineer need not intervene during the actual learning process. Yet fully autonomous learning is often infeasible due to the complexity of real-world problems, the difficulty of specifying reward functions, and the presence of potentially dangerous outcomes that constrain exploration.

Consider a robot learning to perform household chores. Human engineers create a curriculum, moving the agent between simulation, practice environments, and real house environments. Over time, they may tweak reward functions, heuristics, sensors, and state or action representations. They may intervene directly in real-world training to prevent the robot damaging itself, destroying valuable goods, or harming people it interacts with.

In this example, humans do not just design the learning agent: they are also in the loop of the agent’s learning process, as is typical for many learning systems. Self-driving cars learn with humans ready to intervene in dangerous situations. Facebook’s algorithm for recommending trending news stories has humans filtering out inappropriate content fac . In both examples, the agent’s environment is complex, non-stationary, and there are a wide range of damaging outcomes (like a traffic accident). As RL is applied to increasingly complex real-world problems, such interactive guidance will be critical to the success of these systems.

Prior literature has investigated how people can help RL agents learn more efficiently through different methods of interaction Ng et al. (1999); Knox and Stone (2011); Loftin et al. (2014); Peng et al. (2016); Thomaz and Breazeal (2006); Wiewiora et al. (2003); Judah et al. (2010); Griffith et al. (2013); K.V.N. (2012); Torrey and Taylor (2013); Knox and Stone (2009); Zhan et al. (2016); Walsh et al. (2011); Torrey and Taylor (2012); Maclin and Shavlik (1996); Driessens and Džeroski (2004). Often, the human’s role is to pass along knowledge about relevant quantities of the RL problem, like -values, action optimality, or the true reward for a particular state-action pair. This way, the person can bias exploration, prevent catastrophic outcomes, and accelerate learning.

Most existing work develops agent-specific protocols for human interaction. That is, protocols for human interaction or advice that are designed for a specific RL algorithm (such as -learning). For instance, Griffith et al. (2013) investigate the power of policy advice for a Bayesian -Learner. Other works assume that the states of the MDP take a particular representation, or that the action space is discrete or finite. Making explicit assumptions about the agent’s learning process can enable more powerful teaching protocols that leverage insights about the learning algorithm or representation.

1.1 Our contribution: agent-agnostic guidance of RL algorithms

Our goal is to develop a framework for human-agent interaction that is (a) agent-agnostic and (b) can capture a wide range of ways a human can help an RL agent. Such a setting is informative of the structure of general teaching paradigms, the relationship and interplay of pre-existing teaching methods, and suggestive of new teaching methodologies, which we discuss in Section 6. Additionally, approaching human-in-the-loop RL from a maximally general standpoint can help illustrate the relationship between the requisite power of a teacher and the teacher’s effectiveness on learning. For instance, we demonstrate sufficient conditions on a teacher’s knowledge about an environment that enable effective111By “effective“ we mean: pruning bad actions while never pruning an optimal action. See Remark 3 (below). action pruning of an arbitrary agent. Results of this form can again be informative to the general structure of teaching RL agents.

We make two simplifying assumptions. First, we consider environments where the state is fully observed; that is, the learning agent interacts with a Markov Decision Process (MDP)

(Puterman, 2014; Kaelbling et al., 1996; Sutton and Barto, 1998). Second, we note that conducting experiments with an actual human in the loop creates a huge amount of work for a human, and can slow down training to an unacceptable degree. For this reason, we focus on programatic instantiations of humans-in-the-loop; a person informed about the task (MDP) in question will write a program to facilitate various teaching protocols.

There are obvious disadvantages to agent-agnostic protocols. The agent is not specialized to the protocol, so it is unable to ask the human informative questions as in Amir et al. (2016), or will not have an observation model that faithfully represents the process the human uses to generate advice, as in Griffith et al. (2013); Judah et al. (2010). Likewise, the human cannot provide optimally informative advice to the agent as they don’t know the agent’s prior knowledge, exploration technique, representation, or learning method.

Conversely, agent-specific protocols may perform well for one type of algorithm or environment, but poorly on others. In many cases, without further hand-engineering, agent-specific protocols can’t be adapted to a variety of agent-types. When researchers tackle challenging RL problems, they tend to explore a large space of algorithms with important structural differences: some are model-based vs. model-free, some approximate the optimal policy, others a value function, and so on. It takes substantial effort to adapt an advice protocol to each such algorithm. Moreover, as advice protocols and learning algorithms become more complex, greater modularity will help limit design complexity.

In our framework, the interaction between the person guiding the learning process, the agent, and the environment is formalized as a protocol program. This program controls the channels between the agent and the environment based on human input, pictured in Figure 1. This gives the teacher extensive control over the agent: in an extreme case, the agent can be prevented from interacting with the real environment entirely and only interact with a simulation. At the same time, we require that the human only interact with the agent during learning through the protocol program—both agent and environment are a black box to the human.

Figure 1: A general setup for RL with a human in the loop. By instantiating with different protocol programs, we can implement different mechanisms for human guidance of RL agents.

2 Framework

Any system for RL with a human in the loop has to coordinate three components:

  1. The environment is an MDP and is specified by a tuple , where is the state space, is the action space,

    , denotes the transition function, a probability distribution on states given a state and action,

    is the reward function, and is the discount factor.

  2. The agent is a (stateful, potentially stochastic) function .

  3. The human can receive and send advice information of flexible type, say and , so, we will treat the human as a (stateful, potentially stochastic) function . For example, might contain the history of actions, states, and rewards so far, and a new proposed action , and might be an action as well, either equivalent to (if accepted) or different (if rejected). We assume that the human knows in general terms how their responses will be used and is making a good-faith effort to be helpful.

The interaction between the environment, the agent, and a human advisor sets up a mechanism design problem: how can we design an interface that orchestrates the interaction between these components such that the combined system maximizes the expected sum of -discounted rewards from the environment? In other words, how can we write a protocol program that can take the place of a given agent , but that achieves higher rewards by making efficient use of information gained through sub-calls to and ?

By formalizing existing and new techniques as programs, we facilitate understanding and comparison of these techniques within a common framework. By abstracting from particular agents and environments, we may better understand the mechanisms underlying effective teaching for Reinforcement Learning by developing portable and modular teaching methods.

3 Capturing Existing Advice Schemes

Naturally, protocol programs cannot capture all

advice protocols. Any protocol that depends on prior knowledge of the agent’s learning algorithm, representation, priors, or hyperparameters is ruled out. Despite this constraint, the framework can capture a range of existing protocols where a human-in-the-loop guides an agent.

Figure 1 shows that the human can manipulate the actions () sent to the environment, the agent’s observed states (), and observed rewards (). This points to the following combinatorial set of protocol families in which the human manipulates one or more of these components to influence learning:

The first three elements of the set correspond to state manipulation, action pruning, and reward shaping protocol families.222State manipulation can correspond to abstraction or training in simulation The remaining elements represent families of teaching schemes that modify multiple elements of the agent’s learning; these protocols may introduce powerful interplay between the different components, which hope future work will explore.

We now demonstrate simple ways in which protocol programs instantiate typical methods for intervening in an agent’s learning process.

Algorithm 1 Agent in control (standard) 1:procedure agentControl() 2:     return 3:end procedure Algorithm 2 Human in control 1:procedure humanControl() 2:     return 3:end procedure Algorithm 3 Action pruning 1: To Prune: 2:procedure pruneActions() 3:      4:     while  do If Needs Pruning 5:          6:          7:     end while 8:     return 9:end procedure Algorithm 4 Reward manipulation 1:procedure manipulateReward() 2:      3:     return 4:end procedure Algorithm 5 Training in simulation 1: Simulation 2: History: array of 3:procedure trainInSimulation() 4:      5:      6:     while  “agent is ready” do 7:          8:         append to 9:          10:          11:     end while 12:     return 13:end procedure
Figure 2: Many schemes for human guidance of RL algorithms can be expressed as protocol programs. These programs have the same interface as the agent , but can be safer or more efficient learners by making use of human advice .

3.1 Reward shaping

Section 2 defined the reward function as part of the MDP . However, while humans generally don’t design the environment, we do design reward functions. Usually the reward function is hand-coded prior to learning and must accurately assign reward values to any state the agent might reach. An alternative is to have a human generate the rewards interactively: the human observes the state and action and returns a scalar to the agent. This setup has been explored in work on TAMER (Knox and Stone, 2009). A similar setup (with an agent-specific protocol) was applied to robotics by Daniel et al. (2014). It is straightforward to represent rewards that are generated interactively (or online) using protocol programs.

We now turn to other protocols in which the human manipulates rewards. These protocols assume a fixed reward function that is part of the MDP .

3.1.1 Reward shaping and Q-value initialization

In Reward Shaping protocols, the human engineer changes the rewards given by some fixed reward function in order to influence an agent’s learning. Ng et al. (1999) introduced potential-based shaping, which shapes rewards without changing an MDP’s optimal policy. In particular, each reward received by the environment is augmented by a shaping function:


so the agent actually receives . Wiewiora et al. (2003) showed potential shaping to be equivalent (for -learners) to a subset -value initialization under some assumptions. Further, Devlin and Kudenko (2012) propose dynamic potential shaping functions that change over time. That is, the shaping function also takes as two time parameters, and , such that:


Where . Their main result is that dynamic shaping functions of this form also guarantee optimal policy invariance. Similarly, Wiewiora et al. (2003) extend potential shaping to potential-based advice functions, which identifies a similar class of shaping functions on pairs.

In Section 4, we show that our Framework captures reward shaping, and consequently, a limited notion of -value initialization.

3.2 Training in Simulation

It is common practice to train an agent in simulation and transfer it to the real world once it performs well enough. Algorithm 5 (Figure 2) shows how to represent the process of training in simulation as a protocol program. We let represent the real-world decision problem and let be a simulator for that is included in the protocol program. Initially the protocol program has the agent interact with while the human observes the interaction. When the human decides the agent is ready, the protocol program has interact with instead.

3.3 Action Pruning

Action pruning is a technique for dynamically removing actions from the MDP to reduce the branching factor of the search space. Such techniques have been shown to accelerate learning and planning time Sherstov and Stone (2005); Hansen et al. (1996); Rosman and Ramamoorthy (2012); Abel et al. (2015). In Section 5, we apply action-pruning to prevent catastrophic outcomes during exploration, a problem explored by Lipton et al. (2016); Garcia and Fernandez (2011, 2012); Hans et al. (2008); Moldovan and Abbeel (2012).

Protocol programs allow action pruning to be carried out interactively. Instead of having to decide which actions to prune prior to learning, the human can wait to observe the states that are actually encountered by the agent, which may be valuable in cases where the human has limited knowledge of the environment or the agent’s learning ability. In Section 4, we exhibit an agent-agnostic protocol for interactively pruning actions that preserves the optimal policy while removing some bad actions.

Our pruning protocol is illustrated in a gridworld with lava pits (Figure 3). The agent is represented by a gray circle, “G” is a goal state that provides reward +1, and the red cells are lava pits with reward . All white cells provide reward 0.

Figure 3: The human allows movement from state 34 to 33 but blocks agent from falling in lava (at 43).

At each time step, the human checks whether the agent moves into a lava pit. If it does not (as in moving DOWN from state 34), the agent continues as normal. If it does (as in moving RIGHT from state 33), the human bypasses sending any action to the true MDP (preventing movement right) and sends the agent a next state of 33. The agent doesn’t actually fall in the lava but the human sends them a reward . After this negative reward, the agent is less likely to try the action again. For the protocol program, see Algorithm 3 in Figure 2.

Note that the agent receives no explicit signal that their attempted catastrophic action was blocked by the human. They observe a big negative reward and a self-loop but no information about whether the human or environment generated their observation.

3.4 Manipulating state representation

The agent’s state representation can have a significant influence on its learning. Suppose the states of MDP

consist of a number of features, defining a state vector

. The human engineer can specify a mapping such that the agent always receives in place of this vector . Such mappings are used to specify high-level features of state that are important for learning, or to dynamically ignore confusing features from the agent.

This transformation of the state vector is normally fixed before learning. A protocol program can allow the human to provide processed states or high-level features interactively. By the time the human stops providing features, the agent might have learned to generate them on its own (as in Learning with Privileged Information (Vapnik and Vashist, 2009; Pechyony and Vapnik, 2010)).

Other methods have focused on state abstraction functions to decrease learning time and preserve the quality of learned behavior, as in Li et al. (2006); Ortner (2013); Even-Dar and Mansour (2003); Jong and Stone (2005); Dean et al. (1997); Abel et al. (2016); Ferns et al. (2006). Using a state abstraction function, agents compress representations of their environments, enabling deeper planning and lower sample complexity. Any state aggregation function can be implemented by a protocol program, perhaps dynamically induced through interaction with a teacher.

4 Theory

Here we illustrate some simple ways in which our proposed agent-agnostic interaction scheme captures other existing agent-agnostic protocols. The following results all concern Tabular MDPs, but are intended to offer intuition for high-dimensional or continuous environments as well.

4.1 Reward Shaping

First we observe that protocol programs can precisely capture methods for shaping reward functions.

Remark 1: For any reward shaping function , including potential-based shaping, potential-based advice, and dynamic potential-based advice, there is a protocol that produces the same rewards.

To construct such a protocol for a given , simply let the reward output by the protocol, , take on the value at each time step. That is, in Algorithm 4, simply define .

4.2 Action Pruning

We now show that there is a simple class of protocol programs that carry out action pruning of a certain form.

Remark 2: There is a protocol for pruning actions in the following sense: for any set of state action pairs , the protocol ensures that, for each pair , action is never executed in the MDP in state .

The protocol is as described in Section 3.3 and shown in Algorithm 3. The premise is this: in all cases where the agent executes an action that should be pruned, the protocol gives the agent low reward and forces the agent to self-loop.

Knowing which actions to prune is itself a challenging problem. Often, it is natural to assume that the human guiding the learner knows something about the environment of interest (such as where high rewards or catastrophes lie), but may not know every detail of the problem. Thus, we consider a case in which the human has partial (but useful) knowledge about the problem of interest, represented as an approximate -function. The next remark shows there is a protocol based on approximate knowledge with two properties: (1) it never prunes an optimal action, (2) it limits the magnitude of the agent’s worst mistake:

Remark 3: Assuming the protocol designer has a -optimal function:


there exists a protocol that never prunes an optimal action, but prunes all actions so that the agent’s mistakes are never more than below optimal. That is, for all times :


where is the agent’s policy after timesteps.

Proof of Remark 3.

The protocol designer has a -approximate function, denoted , defined as above. Consider the state-specific action pruning function :


The protocol prunes all actions not in according to the self-loop method described above. This protocol induces a pruned Bellman Equation over available actions, , in each state:


Let denote the true optimal action: . To preserve the optimal policy, we need , for each state. Note that when:


But by definition of :


Thus, can never occur. Furthermore, observe that retains all actions for which:


holds. Thus, in the worst case, the following two hold:

  1. The optimal action estimate is

    too low:

  2. The action with the lowest value, , is too high:

From Equation 9, observe that the minimal such that is:

Thus, this pruning protocol never prunes an optimal action, but prunes all actions worse then below in value. We conclude that the agent may never execute an action below optimal. ∎

5 Experiments

This section applies our action pruning protocols (Section 3.3 and Remarks 2 and 3 above) to concrete RL problems. In Experiment 1, action pruning is used to prevent the agent from trying catastrophic actions, i.e. to achieve safe exploration. In Experiment 2, action pruning is used to accelerate learning.

5.1 Protocol for Preventing Catastrophes

Human-in-the-loop RL can help prevent disastrous outcomes that result from ignorance of the environment’s dynamics or of the reward function. Our goal for this experiment is to prevent the agent from taking catastrophic actions. These are real world actions so costly that we want the agent to never take the action333We allow an RL agent to take sub-optimal actions while learning. Catastrophic actions are not allowed because their cost is orders of magnitude worse than non-catastrophic actions.. This notion of catastrophic action is closely related to ideas in “Safe RL” (Garcia and Fernandez, 2015; Moldovan and Abbeel, 2012) and to work on “significant rare events” (Paul et al., 2016).

Section 3.3 describes our protocol program for preventing catastrophes in finite MDPs using action pruning. There are two important elements of this program:

  1. When the agent tries a catastrophic action in state , the agent is blocked from executing the action in the real world, and the agent receives state and reward: , where is an extreme negative reward.

  2. This is stored so that the protocol program can automate the human’s intervention, which could allow the human to stop monitoring after all catastrophes have been stored.

This protocol prevents catastrophic actions while preserving the optimal policy and having only minimal side-effects on the agent’s learning. We can extend this protocol to environments with high-dimensional state spaces. Element (1) above remains the same. But (2) must be modified: preventing future catastrophes requires generalization across catastrophic actions (as there will be infinitely many such actions). We discuss this setting in Appendix A.

5.2 Experiment 1: Preventing Catastrophes in a Pong-like Game

Our protocol for preventing catastrophes is intended for use in a real-world environment. Here we provide a preliminary test of our protocol in a simple video game.

Our protocol treats the RL agent as a black box. To this end, we applied our protocol to an open-source implementation of the state-of-the-art RL algorithm “Trust Region Policy Optimization” from Duan et al. (2016). The environment was Catcher, a simplified version of Pong with non-visual state representation. Since there are no catastrophic actions in Catcher, we modified the game to give a large negative reward when the paddle’s speed exceeds a speed limit. We compare the performance of an agent who is assisted by the protocol (“Pruned”) and so is blocked from the catastrophic actions444

We did not use an actual human in the loop. Instead the agent was blocked by a protocol program that checked whether each action would exceed the speed limit. This is essentially the protocol outlined in Appendix A but with the classifier trained offline to recognize catastrophes. Future work will test similar protocols using actual humans. (In this experiment a human can easily recognize catastrophic actions by reading the agent’s speed directly from the game state.)

to the performance of a normal RL agent (“Not Pruned”).

Figure 5 shows the agent’s mean performance (SD over 16 trials) over the course of learning. We see that the agent with protocol support (“Pruned”) performed much better overall. This is unsurprising, as it was blocked from ever doing a catastrophic action. The gap in mean performance is large early on but diminishes as the “Not Pruned” agent learns to avoid high speeds. By the end (i.e. after 400,000 actions), “Not Pruned” is close to “Pruned” in mean performance but its total returns over the whole period are around 5 times worse. While the “Pruned” agent observes incongruous state transitions due to being blocked by our protocol, Figure 5 suggests these observations do not have negative side effects on learning.

Figure 4: Preventing Catastrophic Speeds
Figure 5: Pruning in Taxi.

5.3 Protocol for Accelerating Learning

We also conducted a simple experiment in the Taxi domain from Dietterich (2000). The Taxi problem is a more complex version of grid world: each problem instances consists of a taxi and some number of passengers. The agent directs the taxi to each passenger, picks the passenger up, and brings them to their destination and drops them off.

We use Taxi to evaluate the effect of our action pruning protocol for accelerating learning in discrete MDPs. There is a natural procedure for pruning suboptimal actions that dramatically reduces the size of the reachable state space: if the taxi is carrying a passenger but is not at the passenger’s destination, we prune the dropoff action by returning the agent back to its current state with -0.01 reward. This prevents the agent from exploring a large portion of the state space, thus accelerating learning.

5.4 Experiment 2: Accelerated Learning in Taxi

We evaluated -learning Watkins and Dayan (1992) and R-max Brafman and Tennenholtz (2003) with and without action pruning in a simple instance with one passenger. The taxi starts at , the passenger at with destination . We ran standard -Learning with -greedy exploration with and with R-max using a planning horizon of four. Results are displayed in Figure 5.

Our results suggest that the action pruning protocol simplifies the problem for a -Learner and dramatically so for R-Max. In the allotted number of episodes, we see that pruning substantially improves the overall cumulative reward achieved; in the case of R-max, the agent is able to effectively solve the problem after a small number of episodes. Further, the results suggests that the agent-agnostic method of pruning is effective without having any internal access to the agent’s code.

6 Conclusion

We presented an agent-agnostic method for giving guidance to Reinforcement Learning agents. Protocol programs written in this framework apply to any possible RL agent, so sophisticated schemes for human-agent interaction can be designed in a modular fashion without the need for adaptation to different RL algorithms. We presented some simple theoretical results that relate our method to existing schemes for interactive RL and illustrated the power of action pruning in two toy domains.

A promising avenue for future work are dynamic state manipulation protocols, which can guide an agent’s learning process by incrementally obscuring confusing features, highlighting relevant features, or simply reducing the dimensionality of the representation. Additionally, future work might investigate whether certain types of value initialization protocols can be captured by protocol programs, such as the optimistic initialization for arbitrary domains developed by  Machado et al. (2014). Moreover, the full combinatoric space of learning protocols is suggestive of teaching paradigms that have yet to be explored. We hypothesize that there are powerful teaching methods that take advantage of the interplay between state manipulation, action pruning, and reward shaping. A further challenge is to extend the formalism to account for the interplay between multiple agents, in both competitive and cooperative settings.

Additionally, in our experiments, all protocols are explicitly programmed in advance. In the future, we’d like to experiment with dynamic protocols with a human in the loop during the learning process.

Lastly, an alternate perspective on the framework is that of a centaur system: a joint Human-AI decision maker Swartout (2016). Under this view, the human trains and queries the AI dynamically in cases where the human needs help. In the future, we’d like to establish and investigate formalisms relevant to the centaur view of the framework.


This work was supported by Future of Life Institute grant 2015-144846 and by the Future of Humanity Institute (Oxford). We thank Shimon Whiteson, James MacGlashan, and D. Ellis Herskowitz for helpful conversations.


  • [1] How does facebook determine what topics are trending? Accessed: 2016-10-12.
  • Abel et al. [2015] David Abel, David Ellis Hershkowitz, Gabriel Barth-Maron, Stephen Brawner, Kevin O’Farrell, James MacGlashan, and Stefanie Tellex. Goal-based action priors. In ICAPS, pages 306–314, 2015.
  • Abel et al. [2016] David Abel, D Ellis Hershkowitz, and Michael L. Littman. Near optimal behavior via approximate state abstraction. In

    Proceedings of The 33rd International Conference on Machine Learning

    , 2016.
  • Amir et al. [2016] Ofra Amir, Ece Kamar, Andrey Kolobov, and Barbara Grosz. Interactive teaching strategies for agent training. IJCAI, 2016.
  • Brafman and Tennenholtz [2003] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 3:213–231, 2003.
  • Daniel et al. [2014] Christian Daniel, Malte Viering, Jan Metz, Oliver Kroemer, and Jan Peters. Active reward learning. In Proceedings of Robotics Science & Systems, 2014.
  • Dean et al. [1997] Thomas Dean, Robert Givan, and Sonia Leach. Model reduction techniques for computing approximately optimal solutions for markov decision processes. In

    Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence

    , pages 124–131. Morgan Kaufmann Publishers Inc., 1997.
  • Devlin and Kudenko [2012] Sam Devlin and Daniel Kudenko. Dynamic potential-based reward shaping. Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), (June):433–440, 2012.
  • Dietterich [2000] Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
  • Driessens and Džeroski [2004] Kurt Driessens and Sašo Džeroski. Integrating guidance into relational reinforcement learning. Machine Learning, 57(3):271–304, 2004.
  • Duan et al. [2016] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. arXiv preprint arXiv:1604.06778, 2016.
  • Even-Dar and Mansour [2003] Eyal Even-Dar and Yishay Mansour. Approximate equivalence of Markov decision processes. In Learning Theory and Kernel Machines, pages 581–594. Springer, 2003.
  • Ferns et al. [2006] Norman Ferns, Pablo Samuel Castro, Doina Precup, and Prakash Panangaden. Methods for computing state similarity in markov decision processes. Proceedings of the 22nd conference on Uncertainty in artificial intelligence, 2006.
  • Garcia and Fernandez [2011] Javier Garcia and Fernando Fernandez. Safe reinforcement learning in high-risk tasks through policy improvement. IEEE SSCI 2011: Symposium Series on Computational Intelligence - ADPRL 2011: 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 76–83, 2011.
  • Garcia and Fernandez [2012] Javier Garcia and Fernando Fernandez. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, 45:515–564, 2012. ISSN 10769757.
  • Garcia and Fernandez [2015] Javier Garcia and Fernando Fernandez. A Comprehensive Survey on Safe Reinforcement Learning. The Journal of Machine Learning Research, 16:1437–1480, 2015.
  • Griffith et al. [2013] Shane Griffith, Kaushik Subramanian, and J Scholz. Policy Shaping: Integrating Human Feedback with Reinforcement Learning. Advances in Neural Information Processing Systems (NIPS), pages 1–9, 2013.
  • Hans et al. [2008] Alexander Hans, Daniel Schneegaß, Anton Maximilian Schäfer, and Steffen Udluft. Safe Exploration for Reinforcement Learning.

    Proceedings of the 16th European Symposium on Artificial Neural Networks

    , (April):143–148, 2008.
  • Hansen et al. [1996] Eric A Hansen, Andrew G Barto, and Shlomo Zilberstein. Reinforcement learning for mixed open-loop and closed-loop control. In NIPS, pages 1026–1032. Citeseer, 1996.
  • Jong and Stone [2005] Nicholas K Jong and Peter Stone. State abstraction discovery from irrelevant state variables. In IJCAI, pages 752–757, 2005.
  • Judah et al. [2010] Kshitij Judah, Saikat Roy, Alan Fern, and Thomas G Dietterich. Reinforcement Learning Via Practice and Critique Advice. AAAI, pages 481–486, 2010.
  • Kaelbling et al. [1996] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, pages 237–285, 1996.
  • Knox and Stone [2009] W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The tamer framework. In Proceedings of the fifth international conference on Knowledge capture, pages 9–16. ACM, 2009.
  • Knox and Stone [2011] W Bradley Knox and Peter Stone. Augmenting reinforcement learning with human feedback.

    Proceedings of the ICML Workshop on New Developments in Imitation Learning

    , page 8, 2011.
  • K.V.N. [2012] Pradyot K.V.N. Beyond Rewards : Learning from Richer Supervision. (August), 2012.
  • Li et al. [2006] Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for mdps. In ISAIM, 2006.
  • Lipton et al. [2016] Zachary C Lipton, Jianfeng Gao, Lihong Li, Jianshu Chen, and Li Deng. Combating reinforcement learning’s sisyphean curse with intrinsic fear. arXiv preprint arXiv:1611.01211, 2016.
  • Loftin et al. [2014] Robert Loftin, Bei Peng, James MacGlashan, Michael L Littman, Matthew E Taylor, Jie Huang, and David L Roberts. Learning something from nothing: Leveraging implicit human feedback strategies. In Robot and Human Interactive Communication, 2014 RO-MAN: The 23rd IEEE International Symposium on, pages 607–612. IEEE, 2014.
  • Machado et al. [2014] Marlos C. Machado, Sriram Srinivasan, and Michael Bowling. Domain-Independent Optimistic Initialization for Reinforcement Learning. AAAI Workshop on Learning for General Competency in Video Games, 2014.
  • Maclin and Shavlik [1996] Richard Maclin and Jude W. Shavlik. Creating advice-taking reinforcement learners. Machine Learning, 22:251–281, 1996.
  • Moldovan and Abbeel [2012] Teodor Mihai Moldovan and Pieter Abbeel. Safe Exploration in Markov Decision Processes. Proceedings of the 29th International Conference on Machine Learning, 2012. URL
  • Ng et al. [1999] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
  • Ortner [2013] Ronald Ortner. Adaptive aggregation for reinforcement learning in average reward Markov decision processes. Annals of Operations Research, 208(1):321–336, 2013.
  • Paul et al. [2016] Supratik Paul, Kamil Ciosek, Michael A Osborne, and Shimon Whiteson. Alternating optimisation and quadrature for robust reinforcement learning. arXiv preprint arXiv:1605.07496, 2016.
  • Pechyony and Vapnik [2010] Dmitry Pechyony and Vladimir Vapnik. On the Theory of Learnining with Privileged Information. Advances in Neural Information Processing Systems 23, pages 1894–1902, 2010.
  • Peng et al. [2016] Bei Peng, James MacGlashan, Robert Loftin, Michael L Littman, David L Roberts, and Matthew E Taylor. A need for speed: Adapting agent action speed to improve task learning from non-expert humans. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 957–965. International Foundation for Autonomous Agents and Multiagent Systems, 2016.
  • Puterman [2014] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • Rosman and Ramamoorthy [2012] Benjamin Rosman and Subramanian Ramamoorthy. What good are actions? accelerating learning using learned action priors. In Development and Learning and Epigenetic Robotics (ICDL), 2012 IEEE International Conference on, pages 1–6. IEEE, 2012.
  • Sherstov and Stone [2005] A.A. Sherstov and P. Stone. Improving action selection in mdp’s via knowledge transfer. In Proceedings of the 20th national conference on Artificial Intelligence, pages 1024–1029. AAAI Press, 2005.
  • Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
  • Swartout [2016] William R Swartout. Virtual humans as centaurs: Melding real and virtual. In International Conference on Virtual, Augmented and Mixed Reality, pages 356–359. Springer, 2016.
  • Thomaz and Breazeal [2006] Andrea Lockerd Thomaz and Cynthia Breazeal. Reinforcement learning with human teachers: Evidence of feedback and guidance with implications for learning performance. Aaai, 6:1000–1005, 2006.
  • Torrey and Taylor [2013] Lisa Torrey and Matthew Taylor. Teaching on a budget: Agents advising agents in reinforcement learning. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pages 1053–1060. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
  • Torrey and Taylor [2012] Lisa Torrey and Matthew E. Taylor. Help an agent out: Student/teacher learning in sequential decision tasks. Proceedings of the Adaptive and Learning Agents Workshop 2012, ALA 2012 - Held in Conjunction with the 11th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2012, pages 41–48, 2012.
  • Vapnik and Vashist [2009] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5-6):544–557, 2009. URL
  • Walsh et al. [2011] Thomas J. Walsh, Daniel Hewlett, and Clayton T Morrison. Blending Autonomous Exploration and Apprenticeship Learning. Advances in Neural Information Processing Systems 24, pages 2258–2266, 2011.
  • Watkins and Dayan [1992] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • Wiewiora et al. [2003] Eric Wiewiora, Garrison Cottrell, and Charles Elkan. Principled methods for advising reinforcement learning agents. In ICML, pages 792–799, 2003.
  • Zhan et al. [2016] Yusen Zhan, Haitham Bou Ammar, and Matthew E. Taylor. Theoretically-Grounded Policy Advice from Multiple Teachers in Reinforcement Learning Settings with Applications to Negative Transfer. pages 1–10, 2016.

Appendix A Protocol program for preventing catastrophes in high-dimensional state spaces

We provide an informal overview of the protocol program for avoiding catastrophes. We focus on the differences between the high-dimensional case and the finite case described in Section 5.1. In the finite case, pruned actions are stored in a table. When the human is satisfied that all catastrophic actions are in the table, the human’s monitoring of the agent can be fully automated by the protocol program. The human may need to be in the loop until the agent has attempted each catastrophic action once – after that the human can “retire”.

In the infinite case, we replace this look-up table with a supervised classification algorithm. All visited state-actions are stored and labeled (“catastrophic” or “not catastrophic”) based on whether the human decides to block them. Once this labeled set is large enough to serve as a training set, the human trains the classifier and tests performance on held-out instances. If the classifier passes the test, the human can be replaced by the classifier. Otherwise the data-gathering process continues until the training set is large enough for the classifier to pass the test.

If the class of catastrophic actions is learnable by the classifier, this protocol prevents all catastrophes and has minimal side-effects on the agent’s learning. However, there are limitations of the protocol that will be the subject of future work:

  • The human may need to monitor the agent for a very long time to provide sufficient training data. One possible remedy is for the human to augment the training set by adding synthetically generated states to it. For example, the human might add noise to genuine states without altering their labels. Alternatively, extra training data could be sampled from an accurate generative model.

  • Some catastrophic outcomes have a “local” cause that is easy to block. If a car moves very slowly, then it can avoid hitting an obstacle by braking at the last second. But if a car has lots of momentum, it cannot be slowed down quickly enough. In such cases a human in-the-loop would have to recognize the danger some time before the actual catastrophe would take place.

  • To prevent catastrophes from ever taking place, the classifier needs to correctly identify every catastrophic action. This requires strong guarantees about the generalization performance of the classifier. Yet the distribution on state-action instances is non-stationary (violating the usual i.i.d assumption).