Robbins-Mobro conditions for persistent exploration learning strategies

08/01/2018
by   Dmitry B. Rokhlin, et al.
0

We formulate simple assumptions, implying the Robbins-Monro conditions for the Q-learning algorithm with the local learning rate, depending on the number of visits of a particular state-action pair (local clock) and the number of iteration (global clock). It is assumed that the Markov decision process is communicating and the learning policy ensures the persistent exploration. The restrictions are imposed on the functional dependence of the learning rate on the local and global clocks. The result partially confirms the conjecture of Bradkte (1994).

READ FULL TEXT
research
09/21/2021

Long-Term Exploration in Persistent MDPs

Exploration is an essential part of reinforcement learning, which restri...
research
12/13/2019

Provably Efficient Reinforcement Learning with Aggregated States

We establish that an optimistic variant of Q-learning applied to a finit...
research
02/02/2019

Phoenix: An Epidemic Approach to Time Reconstruction

Harsh deployment environments and uncertain run-time conditions create n...
research
06/15/2021

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control

Reinforcement learning is a framework for interactive decision-making wi...
research
05/11/2023

On the convergence of the MLE as an estimator of the learning rate in the Exp3 algorithm

When fitting the learning data of an individual to algorithm-like learni...
research
09/15/2019

Exploiting Fast Decaying and Locality in Multi-Agent MDP with Tree Dependence Structure

This paper considers a multi-agent Markov Decision Process (MDP), where ...

Please sign up or login with your details

Forgot password? Click here to reset