Contextual Bandits in a Survey Experiment on Charitable Giving: Within-Experiment Outcomes versus Policy Learning

11/22/2022
by   Susan Athey, et al.
0

We design and implement an adaptive experiment (a “contextual bandit”) to learn a targeted treatment assignment policy, where the goal is to use a participant's survey responses to determine which charity to expose them to in a donation solicitation. The design balances two competing objectives: optimizing the outcomes for the subjects in the experiment (“cumulative regret minimization”) and gathering data that will be most useful for policy learning, that is, for learning an assignment rule that will maximize welfare if used after the experiment (“simple regret minimization”). We evaluate alternative experimental designs by collecting pilot data and then conducting a simulation study. Next, we implement our selected algorithm. Finally, we perform a second simulation study anchored to the collected data that evaluates the benefits of the algorithm we chose. Our first result is that the value of a learned policy in this setting is higher when data is collected via a uniform randomization rather than collected adaptively using standard cumulative regret minimization or policy learning algorithms. We propose a simple heuristic for adaptive experimentation that improves upon uniform randomization from the perspective of policy learning at the expense of increasing cumulative regret relative to alternative bandit algorithms. The heuristic modifies an existing contextual bandit algorithm by (i) imposing a lower bound on assignment probabilities that decay slowly so that no arm is discarded too quickly, and (ii) after adaptively collecting data, restricting policy learning to select from arms where sufficient data has been gathered.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2023

Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization

Simple regret minimization is a critical problem in learning optimal tre...
research
05/05/2021

Policy Learning with Adaptively Collected Data

Learning optimal policies from historical data enables the gains from pe...
research
02/25/2021

Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits

To balance exploration and exploitation, multi-armed bandit algorithms n...
research
09/21/2023

Optimal Conditional Inference in Adaptive Experiments

We study batched bandit experiments and consider the problem of inferenc...
research
05/09/2022

Selectively Contextual Bandits

Contextual bandits are widely used in industrial personalization systems...
research
06/03/2021

Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning

Empirical risk minimization (ERM) is the workhorse of machine learning, ...

Please sign up or login with your details

Forgot password? Click here to reset