In this paper, we consider the problem of determining when along a training roll-out feedback from the environment is no longer beneficial, and an intervention such as resetting the agent to the initial state distribution is warranted. We show that such interventions can naturally trade off a small sub-optimality gap for a dramatic decrease in sample complexity. In particular, we focus on the reinforcement learning setting in which the agent has access to a reward signal in addition to either (a) an expert supervisor triggering the e-stop mechanism in real-time or (b) expert state-only demonstrations used to “learn” an automatic e-stop trigger. Both settings fall into the same framework.
Evidence already suggests that using simple, manually-designed heuristic resets can dramatically improve training time. For example, the classic pole-balancing problem originally introduced inWidrowSmith1964 prematurely terminates an episode and resets to an initial distribution whenever the pole exceeds some fixed angle off-vertical. More subtly, these manually designed reset rules are hard-coded into many popular OpenAI gym environments 1606.01540.
Some recent approaches have demonstrated empirical success learning when to intervene, either in the form of resetting, collecting expert feedback, or falling back to a safe policy Eysenbach2017; Laskey2016; Richter2017; Kahn2017. We specifically study reset mechanisms which are more natural for human operators to provide – in the form of large red buttons, for example – and thus perhaps less noisy than action or value feedback Bagnell2015. Further, we show how to build automatic reset mechanisms from state-only observations which are often widely available, e.g. in the form of videos Torabi2018.
The key idea of our method is to build a support set
related to the expert’s state-visitation probabilities, and to terminate the episode with a large penalty when the agent leaves this set, visualized inLABEL:fig:overview. This support set defines a modified MDP and can either be constructed implicitly via an expert supervisor triggering e-stops in real-time or constructed a priori based on observation-only roll-outs from an expert policy. As we will show, using a support set explicitly restricts exploration to a smaller state space while maintaining guarantees on the learner’s performance. We emphasize that our technique for incorporating observations applies to any reinforcement learning algorithm in either continuous or discrete domains.
The contributions and organization of the remainder of the paper is as follows.
We provide a general framework for incorporating arbitrary emergency stop (e-stop) interventions from a supervisor into any reinforcement learning algorithm using the notion of support sets in LABEL:sec:interventions.
We present methods and analysis for building support sets from observations in LABEL:sec:learned-estop, allowing for the creation of automatic e-stop devices.
In LABEL:sec:experiments we empirically demonstrate on benchmark discrete and continuous domains that our reset mechanism allows us to naturally trade off a small asymptotic sub-optimality gap for significantly improved convergence rates with any reinforcement learning method.
Finally, in LABEL:sec:support-types, we generalize the concept of support sets to a spectrum of set types and discuss their respective tradeoffs.