Variational Bayesian Reinforcement Learning with Regret Bounds

07/25/2018
by   Brendan O'Donoghue, et al.
0

We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of Õ(L^3/2√(S A T)), where L is the time horizon, S is the number of states, A is the number of actions, and T is the total number of elapsed time-steps. This bound is only a factor of L larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/22/2020

Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret

We study risk-sensitive reinforcement learning in episodic Markov decisi...
research
03/16/2017

Minimax Regret Bounds for Reinforcement Learning

We consider the problem of provably optimal exploration in reinforcement...
research
02/18/2023

Efficient exploration via epistemic-risk-seeking policy optimization

Exploration remains a key challenge in deep reinforcement learning (RL)....
research
03/18/2018

Adaptive prior probabilities via optimization of risk and entropy

An agent choosing between various actions tends to take the one with the...
research
09/15/2017

The Uncertainty Bellman Equation and Exploration

We consider the exploration/exploitation problem in reinforcement learni...
research
02/07/2021

State-Aware Variational Thompson Sampling for Deep Q-Networks

Thompson sampling is a well-known approach for balancing exploration and...
research
10/24/2022

Opportunistic Episodic Reinforcement Learning

In this paper, we propose and study opportunistic reinforcement learning...

Please sign up or login with your details

Forgot password? Click here to reset