Efficient Learning in Non-Stationary Linear Markov Decision Processes

by   Ahmed Touati, et al.

We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs), i.e, both the reward and transition kernel are linear with respect to a given feature map and are allowed to evolve either slowly or abruptly over time. For this problem setting, we propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value iteration which uses exponential weights to smoothly forget data that are far in the past. We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upped bounded by 𝒪(d^7/6H^2 Δ^1/3 K^2/3) where d is the dimension of the feature space, H is the planning horizon, K is the number of episodes and Δ is a suitable measure of non-stationarity of the MDP. This is the first regret bound for non-stationary reinforcement learning with linear function approximation.


page 1

page 2

page 3

page 4


Variational Regret Bounds for Reinforcement Learning

We consider undiscounted reinforcement learning in Markov decision proce...

Efficient Policy Learning for Non-Stationary MDPs under Adversarial Manipulation

A Markov Decision Process (MDP) is a popular model for reinforcement lea...

Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning

This work tackles the problem of robust zero-shot planning in non-statio...

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

This paper gives the first polynomial-time algorithm for tabular Markov ...

Transition Tensor Markov Decision Processes: Analyzing Shot Policies in Professional Basketball

In this paper we model basketball plays as episodes from team-specific n...

Online Reinforcement Learning for Periodic MDP

We study learning in periodic Markov Decision Process(MDP), a special ty...

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

This work focuses on off-policy evaluation (OPE) with function approxima...