Exploration versus exploitation in reinforcement learning: a stochastic control approach

12/04/2018
by   Haoran Wang, et al.
0

We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best trade-off between exploration of a black box environment and exploitation of current knowledge. We propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a resurrection of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear--quadratic (LQ) case and deduce that the optimal control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets and justifies the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are reflected respectively by the mean and variance of the Gaussian distribution. We also find that a more random environment contains more learning opportunities in the sense that less exploration is needed, other things being equal. As the weight of exploration decays to zero, we prove the convergence of the solution to the entropy-regularized LQ problem to that of the classical LQ problem. Finally, we characterize the cost of exploration, which is shown to be proportional to the entropy regularization weight and inversely proportional to the discount rate in the LQ case.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/25/2019

Continuous-Time Mean-Variance Portfolio Selection: A Reinforcement Learning Framework

We approach the continuous-time mean-variance (MV) portfolio selection w...
research
04/25/2019

Continuous-Time Mean-Variance Portfolio Optimization via Reinforcement Learning

We consider continuous-time Mean-variance (MV) portfolio optimization pr...
research
08/17/2022

Choquet regularization for reinforcement learning

We propose Choquet regularizers to measure and manage the level of explo...
research
07/26/2019

Large scale continuous-time mean-variance portfolio allocation via reinforcement learning

We propose to solve large scale Markowitz mean-variance (MV) portfolio a...
research
10/13/2015

Dual Control for Approximate Bayesian Reinforcement Learning

Control of non-episodic, finite-horizon dynamical systems with uncertain...
research
06/20/2023

Reward Shaping via Diffusion Process in Reinforcement Learning

Reinforcement Learning (RL) models have continually evolved to navigate ...
research
01/09/2020

Regularity and stability of feedback relaxed controls

This paper proposes a relaxed control regularization with general explor...

Please sign up or login with your details

Forgot password? Click here to reset