Off-Policy Risk Assessment in Contextual Bandits

by   Audrey Huang, et al.

To evaluate prospective contextual bandit policies when experimentation is not possible, practitioners often rely on off-policy evaluation, using data collected under a behavioral policy. While off-policy evaluation studies typically focus on the expected return, practitioners often care about other functionals of the reward distribution (e.g., to express aversion to risk). In this paper, we first introduce the class of Lipschitz risk functionals, which subsumes many common functionals, including variance, mean-variance, and conditional value-at-risk (CVaR). For Lipschitz risk functionals, the error in off-policy risk estimation is bounded by the error in off-policy estimation of the cumulative distribution function (CDF) of rewards. Second, we propose Off-Policy Risk Assessment (OPRA), an algorithm that (i) estimates the target policy's CDF of rewards; and (ii) generates a plug-in estimate of the risk. Given a collection of Lipschitz risk functionals, OPRA provides estimates for each with corresponding error bounds that hold simultaneously. We analyze both importance sampling and variance-reduced doubly robust estimators of the CDF. Our primary theoretical contributions are (i) the first concentration inequalities for both types of CDF estimators and (ii) guarantees on our Lipschitz risk functional estimates, which converge at a rate of O(1/√(n)). For practitioners, OPRA offers a practical solution for providing high-confidence assessments of policies using a collection of relevant metrics.



There are no comments yet.


page 1

page 2

page 3

page 4


SOPE: Spectrum of Off-Policy Estimators

Many sequential decision making problems are high-stakes and require off...

High-Confidence Off-Policy (or Counterfactual) Variance Estimation

Many sequential decision-making systems leverage data collected using pr...

Doubly Robust Policy Evaluation and Learning

We study decision making in environments where the reward is only partia...

Off-Policy Interval Estimation with Lipschitz Value Iteration

Off-policy evaluation provides an essential tool for evaluating the effe...

Off-Policy Evaluation of Slate Policies under Bayes Risk

We study the problem of off-policy evaluation for slate bandits, for the...

Universal Off-Policy Evaluation

When faced with sequential decision-making problems, it is often useful ...

Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning

Empirical risk minimization (ERM) is the workhorse of machine learning, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.