
Reinforcement Learning for NonStationary Markov Decision Processes: The Blessing of (More) Optimism
We consider undiscounted reinforcement learning (RL) in Markov decision...
read it

Dynamic Regret of Policy Optimization in Nonstationary Environments
We consider reinforcement learning (RL) in episodic MDPs with adversaria...
read it

Nonstationary Reinforcement Learning without Prior Knowledge: An Optimal Blackbox Approach
We propose a blackbox reduction that turns a certain reinforcement lear...
read it

Autonomous exploration for navigating in nonstationary CMPs
We consider a setting in which the objective is to learn to navigate in ...
read it

MinimumDelay Adaptation in NonStationary Reinforcement Learning via Online HighConfidence ChangePoint Detection
Nonstationary environments are challenging for reinforcement learning a...
read it

Nonstationary Reinforcement Learning with Linear Function Approximation
We consider reinforcement learning (RL) in episodic Markov decision proc...
read it

A ProvablyEfficient ModelFree Algorithm for Constrained Markov Decision Processes
This paper presents the first modelfree, simulatorfree reinforcement l...
read it
NearOptimal Regret Bounds for ModelFree RL in NonStationary Episodic MDPs
We consider modelfree reinforcement learning (RL) in nonstationary Markov decision processes (MDPs). Both the reward functions and the state transition distributions are allowed to vary over time, either gradually or abruptly, as long as their cumulative variation magnitude does not exceed certain budgets. We propose an algorithm, named Restarted QLearning with Upper Confidence Bounds (RestartQUCB), for this setting, which adopts a simple restarting strategy and an extra optimism term. Our algorithm outperforms the stateoftheart (modelbased) solution in terms of dynamic regret. Specifically, RestartQUCB with Freedmantype bonus terms achieves a dynamic regret of O(S^1/3 A^1/3Δ^1/3 H T^2/3), where S and A are the numbers of states and actions, respectively, Δ>0 is the variation budget, H is the number of steps per episode, and T is the total number of steps. We further show that our algorithm is nearoptimal by establishing an informationtheoretical lower bound of Ω(S^1/3 A^1/3Δ^1/3 H^2/3 T^2/3), which to the best of our knowledge is the first impossibility result in nonstationary RL in general.
READ FULL TEXT
Comments
There are no comments yet.