SVRG Meets AdaGrad: Painless Variance Reduction
Variance reduction (VR) methods for finite-sum minimization typically require the knowledge of problem-dependent constants that are often unknown and difficult to estimate. To address this, we use ideas from adaptive gradient methods to propose AdaSVRG, which is a fully adaptive variant of SVRG, a common VR method. AdaSVRG uses AdaGrad in the inner loop of SVRG, making it robust to the choice of step-size, and allowing it to adaptively determine the length of each inner-loop. When minimizing a sum of n smooth convex functions, we prove that AdaSVRG requires O(n + 1/ϵ) gradient evaluations to achieve an ϵ-suboptimality, matching the typical rate, but without needing to know problem-dependent constants. However, VR methods including AdaSVRG are slower than SGD when used with over-parameterized models capable of interpolating the training data. Hence, we also propose a hybrid algorithm that can adaptively switch from AdaGrad to AdaSVRG, achieving the best of both stochastic gradient and VR methods, but without needing to tune their step-sizes. Via experiments on synthetic and standard real-world datasets, we validate the robustness and effectiveness of AdaSVRG, demonstrating its superior performance over other "tune-free" VR methods.
READ FULL TEXT