Reward Selection with Noisy Observations

07/12/2023
โˆ™
by   Kamyar Azizzadenesheli, et al.
โˆ™
0
โˆ™

We study a fundamental problem in optimization under uncertainty. There are n boxes; each box i contains a hidden reward x_i. Rewards are drawn i.i.d. from an unknown distribution ๐’Ÿ. For each box i, we see y_i, an unbiased estimate of its reward, which is drawn from a Normal distribution with known standard deviation ฯƒ_i (and an unknown mean x_i). Our task is to select a single box, with the goal of maximizing our reward. This problem captures a wide range of applications, e.g. ad auctions, where the hidden reward is the click-through rate of an ad. Previous work in this model [BKMR12] proves that the naive policy, which selects the box with the largest estimate y_i, is suboptimal, and suggests a linear policy, which selects the box i with the largest y_i - c ยทฯƒ_i, for some c > 0. However, no formal guarantees are given about the performance of either policy (e.g., whether their expected reward is within some factor of the optimal policy's reward). In this work, we prove that both the naive policy and the linear policy are arbitrarily bad compared to the optimal policy, even when ๐’Ÿ is well-behaved, e.g. has monotone hazard rate (MHR), and even under a "small tail" condition, which requires that not too many boxes have arbitrarily large noise. On the flip side, we propose a simple threshold policy that gives a constant approximation to the reward of a prophet (who knows the realized values x_1, โ€ฆ, x_n) under the same "small tail" condition. We prove that when this condition is not satisfied, even an optimal clairvoyant policy (that knows ๐’Ÿ) cannot get a constant approximation to the prophet, even for MHR distributions, implying that our threshold policy is optimal against the prophet benchmark, up to constants.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
โˆ™ 02/11/2021

Robust Policy Gradient against Strong Data Corruption

We study the problem of robust reinforcement learning under adversarial ...
research
โˆ™ 07/19/2022

Pandora Box Problem with Nonobligatory Inspection: Hardness and Approximation Scheme

Weitzman (1979) introduced the Pandora Box problem as a model for sequen...
research
โˆ™ 01/31/2023

Weitzman's Rule for Pandora's Box with Correlations

Pandora's Box is a central problem in decision making under uncertainty ...
research
โˆ™ 05/29/2018

Maximizing Service Reward for Queues with Deadlines

In this paper we consider a real time queuing system with rewards and de...
research
โˆ™ 05/04/2019

Pandora's Problem with Nonobligatory Inspection

Martin Weitzman's "Pandora's problem" furnishes the mathematical basis f...
research
โˆ™ 02/19/2023

Estimating Optimal Policy Value in General Linear Contextual Bandits

In many bandit problems, the maximal reward achievable by a policy is of...
research
โˆ™ 04/30/2022

Optimal Anonymous Independent Reward Scheme Design

We consider designing reward schemes that incentivize agents to create h...

Please sign up or login with your details

Forgot password? Click here to reset