An Adiabatic Theorem for Policy Tracking with TD-learning

10/24/2020

∙

We evaluate the ability of temporal difference learning to track the reward function of a policy as it changes over time. Our results apply a new adiabatic theorem that bounds the mixing time of time-inhomogeneous Markov chains. We derive finite-time bounds for tabular temporal difference learning and Q-learning when the policy used for training changes in time. To achieve this, we develop bounds for stochastic approximation under asynchronous adiabatic updates.

READ FULL TEXT

An Adiabatic Theorem for Policy Tracking with TD-learning

Sign in with Google

Consider DeepAI Pro