Variance-reduced Q-learning is minimax optimal
We introduce and analyze a form of variance-reduced Q-learning. For γ-discounted MDPs with finite state space X and action space U, we prove that it yields an ϵ-accurate estimate of the optimal Q-function in the ℓ_∞-norm using O((D/ϵ^2 (1-γ)^3) ( D/(1-γ)) ) samples, where D = |X| × |U|. This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity, and is the first form of model-free Q-learning proven to achieve the worst-case optimal cubic scaling in the discount complexity parameter 1/(1-γ) accompanied by optimal linear scaling in the state and action space sizes. By contrast, our past work shows that ordinary Q-learning has worst-case quartic scaling in the discount complexity.
READ FULL TEXT