Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts

06/01/2022
by   Giulia Romano, et al.
0

There is a rising interest in industrial online applications where data becomes available sequentially. Inspired by the recommendation of playlists to users where their preferences can be collected during the listening of the entire playlist, we study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB), in which the stochastic reward associated with the pull of an arm is partitioned over a finite number of consecutive rounds following the pull. This setting, unexplored so far to the best of our knowledge, is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull instead of being fully disclosed in a single, potentially delayed round. We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW, which exploit the partial information disclosed by the reward collected over time. We show that our algorithms provide better asymptotical regret upper bounds than delayed-feedback bandit algorithms when a property characterizing a broad set of reward structures of practical interest, namely alpha-smoothness, holds. We also empirically evaluate their performance across a wide range of settings, both synthetically generated and from a real-world media recommendation problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/13/2022

Generalizing distribution of partial rewards for multi-armed bandits with temporally-partitioned rewards

We investigate the Multi-Armed Bandit problem with Temporally-Partitione...
research
07/27/2018

Task Recommendation in Crowdsourcing Based on Learning Preferences and Reliabilities

Workers participating in a crowdsourcing platform can have a wide range ...
research
01/13/2022

Contextual Bandits for Advertising Campaigns: A Diffusion-Model Independent Approach (Extended Version)

Motivated by scenarios of information diffusion and advertising in socia...
research
12/13/2020

Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

We study the multi-armed bandit (MAB) problem with composite and anonymo...
research
03/01/2023

Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards

Decision-making problems of sequential nature, where decisions made in t...
research
01/24/2019

The Assistive Multi-Armed Bandit

Learning preferences implicit in the choices humans make is a well studi...
research
11/15/2020

DORB: Dynamically Optimizing Multiple Rewards with Bandits

Policy gradients-based reinforcement learning has proven to be a promisi...

Please sign up or login with your details

Forgot password? Click here to reset