Thompson Sampling for Linear-Quadratic Control Problems

03/27/2017
by   Marc Abeille, et al.
0

We consider the exploration-exploitation tradeoff in linear quadratic (LQ) control problems, where the state dynamics is linear and the cost function is quadratic in states and controls. We analyze the regret of Thompson sampling (TS) (a.k.a. posterior-sampling for reinforcement learning) in the frequentist setting, i.e., when the parameters characterizing the LQ dynamics are fixed. Despite the empirical and theoretical success in a wide range of problems from multi-armed bandit to linear bandit, we show that when studying the frequentist regret TS in control problems, we need to trade-off the frequency of sampling optimistic parameters and the frequency of switches in the control policy. This results in an overall regret of O(T^2/3), which is significantly worse than the regret O(√(T)) achieved by the optimism-in-face-of-uncertainty algorithm in LQ control problems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/25/2012

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Multi-armed bandit problems are the most basic examples of sequential de...
research
11/20/2016

Linear Thompson Sampling Revisited

We derive an alternative proof for the regret of Thompson sampling () in...
research
11/02/2020

Exact Asymptotics for Linear Quadratic Adaptive Control

Recent progress in reinforcement learning has led to remarkable performa...
research
08/14/2019

Thompson Sampling and Approximate Inference

We study the effects of approximate inference on the performance of Thom...
research
10/28/2020

Bandit Policies for Reliable Cellular Network Handovers in Extreme Mobility

The demand for seamless Internet access under extreme user mobility, suc...
research
03/29/2017

Optimal Policies for Observing Time Series and Related Restless Bandit Problems

The trade-off between the cost of acquiring and processing data, and unc...
research
10/29/2021

Variational Bayesian Optimistic Sampling

We consider online sequential decision problems where an agent must bala...

Please sign up or login with your details

Forgot password? Click here to reset