SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits

01/29/2023
by   Subhojyoti Mukherjee, et al.
0

In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain when executed in an environment formalized as a multi-armed bandit. In this paper, we focus on linear bandit setting with heteroscedastic reward noise. This is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the target policy. We term this as policy-weighted least square estimation and use this formulation to derive the optimal behavior policy for data collection. We then propose a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal behavior policy and derive its regret with respect to the optimal behavior policy. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/09/2022

ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling

This paper studies the problem of data collection for policy evaluation ...
research
01/19/2021

Minimax Off-Policy Evaluation for Multi-Armed Bandits

We study the problem of off-policy evaluation in the multi-armed bandit ...
research
02/26/2022

Safe Exploration for Efficient Policy Evaluation and Comparison

High-quality data plays a central role in ensuring the accuracy of polic...
research
05/06/2020

DTR Bandit: Learning to Make Response-Adaptive Decisions With Low Regret

Dynamic treatment regimes (DTRs) for are personalized, sequential treatm...
research
11/29/2021

Robust On-Policy Data Collection for Data-Efficient Policy Evaluation

This paper considers how to complement offline reinforcement learning (R...
research
07/12/2023

On Collaboration in Distributed Parameter Estimation with Resource Constraints

We study sensor/agent data collection and collaboration policies for par...
research
10/25/2020

Tractable contextual bandits beyond realizability

Tractable contextual bandit algorithms often rely on the realizability a...

Please sign up or login with your details

Forgot password? Click here to reset