Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

by   Lars Buesing, et al.

Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for model-based policy evaluation and search. Instead of de novo synthesis of data, here we assume logged, real experience and model alternative outcomes of this experience under counterfactual actions, actions that were not actually taken. Based on this, we propose the Counterfactually-Guided Policy Search (CF-GPS) algorithm for learning policies in POMDPs from off-policy experience. It leverages structural causal models for counterfactual evaluation of arbitrary policies on individual off-policy episodes. CF-GPS can improve on vanilla model-based RL algorithms by making use of available logged data to de-bias model predictions. In contrast to off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS leverages a model to explicitly consider alternative outcomes, allowing the algorithm to make better use of experience data. We find empirically that these advantages translate into improved policy evaluation and search results on a non-trivial grid-world task. Finally, we show that CF-GPS generalizes the previously proposed Guided Policy Search and that reparameterization-based algorithms such Stochastic Value Gradient can be interpreted as counterfactual methods.


page 1

page 2

page 3

page 4


Guided Policy Search Model-based Reinforcement Learning for Urban Autonomous Driving

In this paper, we continue our prior work on using imitation learning (I...

Collective Robot Reinforcement Learning with Distributed Asynchronous Guided Policy Search

In principle, reinforcement learning and policy search methods can enabl...

Meta-Reinforcement Learning Robust to Distributional Shift via Model Identification and Experience Relabeling

Reinforcement learning algorithms can acquire policies for complex tasks...

Sample-Efficient Reinforcement Learning via Counterfactual-Based Data Augmentation

Reinforcement learning (RL) algorithms usually require a substantial amo...

The Benefits of Model-Based Generalization in Reinforcement Learning

Model-Based Reinforcement Learning (RL) is widely believed to have the p...

Random Actions vs Random Policies: Bootstrapping Model-Based Direct Policy Search

This paper studies the impact of the initial data gathering method on th...

Chaining Value Functions for Off-Policy Learning

To accumulate knowledge and improve its policy of behaviour, a reinforce...

Please sign up or login with your details

Forgot password? Click here to reset