Exploring Offline Policy Evaluation for the Continuous-Armed Bandit Problem

08/21/2019
by   Jules Kruijswijk, et al.
0

The (contextual) multi-armed bandit problem (MAB) provides a formalization of sequential decision-making which has many applications. However, validly evaluating MAB policies is challenging; we either resort to simulations which inherently include debatable assumptions, or we resort to expensive field trials. Recently an offline evaluation method has been suggested that is based on empirical data, thus relaxing the assumptions, and can be used to evaluate multiple competing policies in parallel. This method is however not directly suited for the continuous armed (CAB) problem; an often encountered version of the MAB problem in which the action set is continuous instead of discrete. We propose and evaluate an extension of the existing method such that it can be used to evaluate CAB policies. We empirically demonstrate that our method provides a relatively consistent ranking of policies. Furthermore, we detail how our method can be used to select policies in a real-life CAB problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/26/2020

Online learning with Corrupted context: Corrupted Contextual Bandits

We consider a novel variant of the contextual bandit problem (i.e., the ...
research
07/22/2011

Robustness of Anytime Bandit Policies

This paper studies the deviations of the regret in a stochastic multi-ar...
research
05/21/2020

Off-policy Learning for Remote Electrical Tilt Optimization

We address the problem of Remote Electrical Tilt (RET) optimization usin...
research
09/16/2022

Sales Channel Optimization via Simulations Based on Observational Data with Delayed Rewards: A Case Study at LinkedIn

Training models on data obtained from randomized experiments is ideal fo...
research
11/06/2018

contextual: Evaluating Contextual Multi-Armed Bandit Problems in R

Over the past decade, contextual bandit algorithms have been gaining in ...
research
03/15/2016

Optimal Sensing via Multi-armed Bandit Relaxations in Mixed Observability Domains

Sequential decision making under uncertainty is studied in a mixed obser...
research
02/08/2021

Counterfactual Contextual Multi-Armed Bandit: a Real-World Application to Diagnose Apple Diseases

Post-harvest diseases of apple are one of the major issues in the econom...

Please sign up or login with your details

Forgot password? Click here to reset