Reinforcement Learning under Drift

06/07/2019

∙

We propose algorithms with state-of-the-art dynamic regret bounds for un-discounted reinforcement learning under drifting non-stationarity, where both the reward functions and state transition distributions are allowed to evolve over time. Our main contributions are: 1) A tuned Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence-Widening (SWUCRL2-CW) algorithm, which attains low dynamic regret bounds against the optimal non-stationary policy in various cases. 2) The Bandit-over-Reinforcement Learning (BORL) framework that further permits us to enjoy these dynamic regret bounds in a parameter-free manner.

READ FULL TEXT

Reinforcement Learning under Drift

Sign in with Google

Consider DeepAI Pro