
Boosting Offline Reinforcement Learning with Residual Generative Modeling
Offline reinforcement learning (RL) tries to learn the nearoptimal poli...
read it

Highconfidence error estimates for learned value functions
Estimating the value function for a fixed policy is a fundamental proble...
read it

COMBO: Conservative Offline ModelBased Policy Optimization
Modelbased algorithms, which learn a dynamics model from logged experie...
read it

Instrumental variables regression
IV regression in the context of a resampling is considered in the work....
read it

Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning
In offline reinforcement learning (RL) an optimal policy is learnt solel...
read it

OfflinetoOnline Reinforcement Learning via Balanced Replay and Pessimistic QEnsemble
Recent advance in deep offline reinforcement learning (RL) has made it p...
read it

IVPosterior: Inverse Value Estimation for Interpretable Policy Certificates
Modelfree reinforcement learning (RL) is a powerful tool to learn a bro...
read it
On Instrumental Variable Regression for Deep Offline Policy Evaluation
We show that the popular reinforcement learning (RL) strategy of estimating the stateaction value (Qfunction) by minimizing the mean squared Bellman error leads to a regression problem with confounding, the inputs and output noise being correlated. Hence, direct minimization of the Bellman error can result in significantly biased Qfunction estimates. We explain why fixing the target Qnetwork in Deep QNetworks and Fitted Q Evaluation provides a way of overcoming this confounding, thus shedding new light on this popular but not well understood trick in the deep RL literature. An alternative approach to address confounding is to leverage techniques developed in the causality literature, notably instrumental variables (IV). We bring together here the literature on IV and RL by investigating whether IV approaches can lead to improved Qfunction estimates. This paper analyzes and compares a wide range of recent IV methods in the context of offline policy evaluation (OPE), where the goal is to estimate the value of a policy using logged data only. By applying different IV techniques to OPE, we are not only able to recover previously proposed OPE methods such as modelbased techniques but also to obtain competitive new techniques. We find empirically that stateoftheart OPE methods are closely matched in performance by some IV methods such as AGMM, which were not developed for OPE. We opensource all our code and datasets at https://github.com/liyuan9988/IVOPEwithACME.
READ FULL TEXT
Comments
There are no comments yet.