UCB-based Algorithms for Multinomial Logistic Regression Bandits

03/21/2021
βˆ™
by   Sanae Amani, et al.
βˆ™
0
βˆ™

Out of the rich family of generalized linear bandits, perhaps the most well studied ones are logisitc bandits that are used in problems with binary rewards: for instance, when the learner/agent tries to maximize the profit over a user that can select one of two possible outcomes (e.g., `click' vs `no-click'). Despite remarkable recent progress and improved algorithms for logistic bandits, existing works do not address practical situations where the number of outcomes that can be selected by the user is larger than two (e.g., `click', `show me later', `never show again', `no click'). In this paper, we study such an extension. We use multinomial logit (MNL) to model the probability of each one of K+1β‰₯ 2 possible outcomes (+1 stands for the `not click' outcome): we assume that for a learner's action 𝐱_t, the user selects one of K+1β‰₯ 2 outcomes, say outcome i, with a multinomial logit (MNL) probabilistic model with corresponding unknown parameter ΞΈΜ…_βˆ— i. Each outcome i is also associated with a revenue parameter ρ_i and the goal is to maximize the expected revenue. For this problem, we present MNL-UCB, an upper confidence bound (UCB)-based algorithm, that achieves regret π’ͺΜƒ(dK√(T)) with small dependency on problem-dependent constants that can otherwise be arbitrarily large and lead to loose regret bounds. We present numerical simulations that corroborate our theoretical results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
βˆ™ 02/18/2020

Improved Optimistic Algorithms for Logistic Bandits

The generalized linear bandit framework has attracted a lot of attention...
research
βˆ™ 03/23/2020

Algorithms for Non-Stationary Generalized Linear Bandits

The statistical framework of Generalized Linear Models (GLM) can be appl...
research
βˆ™ 11/29/2022

A survey on multi-player bandits

Due mostly to its application to cognitive radio networks, multiplayer b...
research
βˆ™ 06/02/2023

A Convex Relaxation Approach to Bayesian Regret Minimization in Offline Bandits

Algorithms for offline bandits must optimize decisions in uncertain envi...
research
βˆ™ 05/28/2019

Repeated A/B Testing

We study a setting in which a learner faces a sequence of A/B tests and ...
research
βˆ™ 05/18/2018

PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits

We address the problem of regret minimization in logistic contextual ban...
research
βˆ™ 11/02/2020

Self-Concordant Analysis of Generalized Linear Bandits with Forgetting

Contextual sequential decision problems with categorical or numerical ob...

Please sign up or login with your details

Forgot password? Click here to reset