Provably Efficient Maximum Entropy Exploration

12/06/2018
by   Elad Hazan, et al.
2

Suppose an agent is in a (possibly unknown) Markov decision process (MDP) in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? One natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. Despite the corresponding mathematical program being non-convex, our main result provides a provably efficient method (both in terms of sample size and computational complexity) to construct such a maximum-entropy exploratory policy. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2020

Active Model Estimation in Markov Decision Processes

We study the problem of efficient exploration in order to learn an accur...
research
07/09/2018

Entropy Maximization for Markov Decision Processes Under Temporal Logic Constraints

We study the problem of synthesizing a policy that maximizes the entropy...
research
08/18/2020

A Relation Analysis of Markov Decision Process Frameworks

We study the relation between different Markov Decision Process (MDP) fr...
research
03/14/2023

Fast Rates for Maximum Entropy Exploration

We consider the reinforcement learning (RL) setting, in which the agent ...
research
05/10/2017

Solving Multi-Objective MDP with Lexicographic Preference: An application to stochastic planning with multiple quantile objective

In most common settings of Markov Decision Process (MDP), an agent evalu...
research
01/31/2018

An Incremental Off-policy Search in a Model-free Markov Decision Process Using a Single Sample Path

In this paper, we consider a modified version of the control problem in ...

Please sign up or login with your details

Forgot password? Click here to reset