Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP
We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first algorithm is computationally efficient and achieves a regret bound O(√(d^3B_⋆^2T_⋆ K)), where d is the dimension of the feature space, B_⋆ and T_⋆ are upper bounds of the expected costs and hitting time of the optimal policy respectively, and K is the number of episodes. The same algorithm with a slight modification also achieves logarithmic regret of order O(d^3B_⋆^4/c_min^2gap_minln^5dB_⋆ K/c_min), where gap_min is the minimum sub-optimality gap and c_min is the minimum cost over all state-action pairs. Our result is obtained by developing a simpler and improved analysis for the finite-horizon approximation of (Cohen et al., 2021) with a smaller approximation error, which might be of independent interest. On the other hand, using variance-aware confidence sets in a global optimization problem, our second algorithm is computationally inefficient but achieves the first "horizon-free" regret bound O(d^3.5B_⋆√(K)) with no polynomial dependency on T_⋆ or 1/c_min, almost matching the Ω(dB_⋆√(K)) lower bound from (Min et al., 2021).
READ FULL TEXT