
Stochastic Shortest Path: Minimax, ParameterFree and Towards HorizonFree Regret
We study the problem of learning in the stochastic shortest path (SSP) s...
read it

Nearoptimal Regret Bounds for Stochastic Shortest Path
Stochastic shortest path (SSP) is a wellknown problem in planning and c...
read it

Online Learning for Stochastic Shortest Path Model via Posterior Sampling
We consider the problem of online reinforcement learning for the Stochas...
read it

Implicit FiniteHorizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path
We introduce a generic template for developing regret minimization algor...
read it

NoRegret Exploration in GoalOriented Reinforcement Learning
Many popular reinforcement learning problems (e.g., navigation in a maze...
read it

Regret Bounds for Stochastic Shortest Path Problems with Linear Function Approximation
We propose two algorithms for episodic stochastic shortest path problems...
read it

On Uninformative Optimal Policies in Adaptive LQR with Unknown BMatrix
This paper presents local asymptotic minimax regret lower bounds for ada...
read it
Minimax Regret for Stochastic Shortest Path
We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for K episodes, and has to learn to approximate the optimal policy as closely as possible. In this work we show that the minimax regret for this setting is O(B_⋆√(S A K)) where B_⋆ is a bound on the expected cost of the optimal policy from any state, S is the state space, and A is the action space. This matches the lower bound of Rosenberg et al. (2020) up to logarithmic factors, and improves their regret bound by a factor of √(S). Our algorithm runs in polynomialtime per episode, and is based on a novel reduction to reinforcement learning in finitehorizon MDPs. To that end, we provide an algorithm for the finitehorizon setting whose leading term in the regret depends only logarithmically on the horizon, yielding the same regret guarantees for SSP.
READ FULL TEXT
Comments
There are no comments yet.