DeepAI AI Chat
Log In Sign Up

Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment

by   Zixian Yang, et al.

Multi-armed bandit (MAB) is a classic model for understanding the exploration-exploitation trade-off. The traditional MAB model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok and YouTube Shorts, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. To understand the exploration, exploitation, and engagement in these systems, we propose a new model, called MAB-A where "A" stands for abandonment and the abandonment probability depends on the current recommended item and the user's past experience (called state). We propose two algorithms, ULCB and KL-ULCB, both of which do more exploration (being optimistic) when the user likes the previous recommended item and less exploration (being pessimistic) when the user does not like the previous item. We prove that both ULCB and KL-ULCB achieve logarithmic regret, O(log K), where K is the number of visits (or episodes). Furthermore, the regret bound under KL-ULCB is asymptotically sharp. We also extend the proposed algorithms to the general-state setting. Simulation results confirm our theoretical analysis and show that the proposed algorithms have significantly lower regrets than the traditional UCB and KL-UCB, and Q-learning-based algorithms.


page 1

page 2

page 3

page 4


Fiduciary Bandits

Recommendation systems often face exploration-exploitation tradeoffs: th...

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Multi-armed bandit problems are the most basic examples of sequential de...

CONQUER: Confusion Queried Online Bandit Learning

We present a new recommendation setting for picking out two items from a...

Bayesian Exploration with Heterogeneous Agents

It is common in recommendation systems that users both consume and produ...

Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

We propose a new learning framework that captures the tiered structure o...

Exploration of Unranked Items in Safe Online Learning to Re-Rank

Bandit algorithms for online learning to rank (OLTR) problems often aim ...

The K-Nearest Neighbour UCB algorithm for multi-armed bandits with covariates

In this paper we propose and explore the k-Nearest Neighbour UCB algorit...