Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments

05/25/2022
by   Liyu Chen, et al.
0

We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions. We start by establishing a lower bound Ω((B_⋆ SAT_⋆(Δ_c + B_⋆^2Δ_P))^1/3K^2/3), where B_⋆ is the maximum expected cost of the optimal policy of any episode starting from any state, T_⋆ is the maximum hitting time of the optimal policy of any episode starting from the initial state, SA is the number of state-action pairs, Δ_c and Δ_P are the amount of changes of the cost and transition functions respectively, and K is the number of episodes. The different roles of Δ_c and Δ_P in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of Δ_c and Δ_P, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm [Zhang et al., 2020], adaptive confidence widening [Wei and Luo, 2021], as well as some new techniques such as properly penalizing long-horizon policies. Finally, when Δ_c and Δ_P are unknown, we develop a variant of the MASTER algorithm [Wei and Luo, 2021] and integrate the aforementioned ideas into it to achieve O(min{B_⋆ S√(ALK), (B_⋆^2S^2AT_⋆(Δ_c+B_⋆Δ_P))^1/3K^2/3}) regret, where L is the unknown number of changes of the environment.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/24/2021

Minimax Regret for Stochastic Shortest Path

We study the Stochastic Shortest Path (SSP) problem in which an agent ha...
research
12/18/2021

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

We introduce two new no-regret algorithms for the stochastic shortest pa...
research
02/07/2022

Policy Optimization for Stochastic Shortest Path

Policy optimization is among the most popular and successful reinforceme...
research
05/13/2022

Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets

Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL ...
research
06/09/2021

Online Learning for Stochastic Shortest Path Model via Posterior Sampling

We consider the problem of online reinforcement learning for the Stochas...
research
06/15/2021

Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path

We introduce a generic template for developing regret minimization algor...
research
05/04/2022

Second Order Path Variationals in Non-Stationary Online Learning

We consider the problem of universal dynamic regret minimization under e...

Please sign up or login with your details

Forgot password? Click here to reset