Mo' States Mo' Problems: Emergency Stop Mechanisms from Observation

12/03/2019 ∙ by Samuel Ainsworth, et al. ∙ University of Washington 0

In many environments, only a relatively small subset of the complete state space is necessary in order to accomplish a given task. We develop a simple technique using emergency stops (e-stops) to exploit this phenomenon. Using e-stops significantly improves sample complexity by reducing the amount of required exploration, while retaining a performance bound that efficiently trades off the rate of convergence with a small asymptotic sub-optimality gap. We analyze the regret behavior of e-stops and present empirical results in discrete and continuous settings demonstrating that our reset mechanism can provide order-of-magnitude speedups on top of existing reinforcement learning methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we consider the problem of determining when along a training roll-out feedback from the environment is no longer beneficial, and an intervention such as resetting the agent to the initial state distribution is warranted. We show that such interventions can naturally trade off a small sub-optimality gap for a dramatic decrease in sample complexity. In particular, we focus on the reinforcement learning setting in which the agent has access to a reward signal in addition to either (a) an expert supervisor triggering the e-stop mechanism in real-time or (b) expert state-only demonstrations used to “learn” an automatic e-stop trigger. Both settings fall into the same framework.

Evidence already suggests that using simple, manually-designed heuristic resets can dramatically improve training time. For example, the classic pole-balancing problem originally introduced in

WidrowSmith1964 prematurely terminates an episode and resets to an initial distribution whenever the pole exceeds some fixed angle off-vertical. More subtly, these manually designed reset rules are hard-coded into many popular OpenAI gym environments 1606.01540.

Some recent approaches have demonstrated empirical success learning when to intervene, either in the form of resetting, collecting expert feedback, or falling back to a safe policy Eysenbach2017; Laskey2016; Richter2017; Kahn2017. We specifically study reset mechanisms which are more natural for human operators to provide – in the form of large red buttons, for example – and thus perhaps less noisy than action or value feedback Bagnell2015. Further, we show how to build automatic reset mechanisms from state-only observations which are often widely available, e.g. in the form of videos Torabi2018.

The key idea of our method is to build a support set

related to the expert’s state-visitation probabilities, and to terminate the episode with a large penalty when the agent leaves this set, visualized in

LABEL:fig:overview. This support set defines a modified MDP and can either be constructed implicitly via an expert supervisor triggering e-stops in real-time or constructed a priori based on observation-only roll-outs from an expert policy. As we will show, using a support set explicitly restricts exploration to a smaller state space while maintaining guarantees on the learner’s performance. We emphasize that our technique for incorporating observations applies to any reinforcement learning algorithm in either continuous or discrete domains.

The contributions and organization of the remainder of the paper is as follows.

  • We provide a general framework for incorporating arbitrary emergency stop (e-stop) interventions from a supervisor into any reinforcement learning algorithm using the notion of support sets in LABEL:sec:interventions.

  • We present methods and analysis for building support sets from observations in LABEL:sec:learned-estop, allowing for the creation of automatic e-stop devices.

  • In LABEL:sec:experiments we empirically demonstrate on benchmark discrete and continuous domains that our reset mechanism allows us to naturally trade off a small asymptotic sub-optimality gap for significantly improved convergence rates with any reinforcement learning method.

  • Finally, in LABEL:sec:support-types, we generalize the concept of support sets to a spectrum of set types and discuss their respective tradeoffs.