DeepAI
Log In Sign Up

Bandit Linear Optimization for Sequential Decision Making and Extensive-Form Games

03/08/2021
by   Gabriele Farina, et al.
0

Tree-form sequential decision making (TFSDM) extends classical one-shot decision making by modeling tree-form interactions between an agent and a potentially adversarial environment. It captures the online decision-making problems that each player faces in an extensive-form game, as well as Markov decision processes and partially-observable Markov decision processes where the agent conditions on observed history. Over the past decade, there has been considerable effort into designing online optimization methods for TFSDM. Virtually all of that work has been in the full-feedback setting, where the agent has access to counterfactuals, that is, information on what would have happened had the agent chosen a different action at any decision node. Little is known about the bandit setting, where that assumption is reversed (no counterfactual information is available), despite this latter setting being well understood for almost 20 years in one-shot decision making. In this paper, we give the first algorithm for the bandit linear optimization problem for TFSDM that offers both (i) linear-time iterations (in the size of the decision tree) and (ii) O(√(T)) cumulative regret in expectation compared to any fixed strategy, at all times T. This is made possible by new results that we derive, which may have independent uses as well: 1) geometry of the dilated entropy regularizer, 2) autocorrelation matrix of the natural sampling scheme for sequence-form strategies, 3) construction of an unbiased estimator for linear losses for sequence-form strategies, and 4) a refined regret analysis for mirror descent when using the dilated entropy regularizer.

READ FULL TEXT

page 1

page 2

page 3

page 4

03/08/2021

Model-Free Online Learning in Unknown Sequential Decision Making Problems and Games

Regret minimization has proved to be a versatile tool for tree-form sequ...
01/31/2021

Online Markov Decision Processes with Aggregate Bandit Feedback

We study a novel variant of online finite-horizon Markov Decision Proces...
02/14/2020

On State Variables, Bandit Problems and POMDPs

State variables are easily the most subtle dimension of sequential decis...
05/27/2022

Improving Bidding and Playing Strategies in the Trick-Taking game Wizard using Deep Q-Networks

In this work, the trick-taking game Wizard with a separate bidding and p...
08/01/2011

Exploiting Agent and Type Independence in Collaborative Graphical Bayesian Games

Efficient collaborative decision making is an important challenge for mu...
02/05/2018

Wireless Optimisation via Convex Bandits: Unlicensed LTE/WiFi Coexistence

Bandit Convex Optimisation (BCO) is a powerful framework for sequential ...
09/02/2022

Optimal design of lottery with cumulative prospect theory

A lottery is a popular form of gambling between a seller and multiple bu...