Post-Episodic Reinforcement Learning Inference

02/17/2023
by   Vasilis Syrgkanis, et al.
0

We consider estimation and inference with data collected from episodic reinforcement learning (RL) algorithms; i.e. adaptive experimentation algorithms that at each period (aka episode) interact multiple times in a sequential manner with a single treated unit. Our goal is to be able to evaluate counterfactual adaptive policies after data collection and to estimate structural parameters such as dynamic treatment effects, which can be used for credit assignment (e.g. what was the effect of the first period action on the final outcome). Such parameters of interest can be framed as solutions to moment equations, but not minimizers of a population loss function, leading to Z-estimation approaches in the case of static data. However, such estimators fail to be asymptotically normal in the case of adaptive data collection. We propose a re-weighted Z-estimation approach with carefully designed adaptive weights to stabilize the episode-varying estimation variance, which results from the nonstationary policy that typical episodic RL algorithms invoke. We identify proper weighting schemes to restore the consistency and asymptotic normality of the re-weighted Z-estimators for target parameters, which allows for hypothesis testing and constructing reliable confidence regions for target parameters of interest. Primary applications include dynamic treatment effect estimation and dynamic off-policy evaluation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/05/2022

Doubly Robust Proximal Synthetic Controls

To infer the treatment effect for a single treated unit using panel data...
research
12/01/2020

Evaluating (weighted) dynamic treatment effects by double machine learning

We consider evaluating the causal effects of dynamic treatments, i.e. of...
research
02/08/2020

Inference for Batched Bandits

As bandit algorithms are increasingly utilized in scientific studies, th...
research
06/01/2021

Post-Contextual-Bandit Inference

Contextual bandit algorithms are increasingly replacing non-adaptive A/B...
research
02/17/2021

Counterfactual Inference of the Mean Outcome under a Convergence of Average Logging Probability

Adaptive experiments, including efficient average treatment effect estim...
research
05/05/2021

Policy Learning with Adaptively Collected Data

Learning optimal policies from historical data enables the gains from pe...
research
10/16/2019

Generative Learning of Counterfactual for Synthetic Control Applications in Econometrics

A common statistical problem in econometrics is to estimate the impact o...

Please sign up or login with your details

Forgot password? Click here to reset