Linear Convergence and Implicit Regularization of Generalized Mirror Descent with Time-Dependent Mirrors

The following questions are fundamental to understanding the properties of over-parameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum? (2) What form of implicit regularization occurs through training? While significant progress has been made in answering both of these questions for gradient descent, they have yet to be answered more completely for general optimization methods. In this work, we establish sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), a generalization of mirror descent with a possibly time-dependent mirror. GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad. By using the Polyak-Lojasiewicz inequality, we first present a simple analysis under which non-stochastic GMD converges linearly to a global minimum. We then present a novel, Taylor-series based analysis to establish sufficient conditions for linear convergence of stochastic GMD. As a corollary, our result establishes sufficient conditions and provides learning rates for linear convergence of stochastic mirror descent and Adagrad. Lastly, we obtain approximate implicit regularization results for GMD by proving that GMD converges to an interpolating solution that is approximately the closest interpolating solution to the initialization in l2-norm in the dual space, thereby generalizing the result of Azizan, Lale, and Hassibi (2019) in the full batch setting.


page 1

page 2

page 3

page 4


Implicit Regularization Properties of Variance Reduced Stochastic Mirror Descent

In machine learning and statistical data analysis, we often run into obj...

Implicit Regularization in Matrix Factorization

We study implicit regularization when optimizing an underdetermined quad...

A Unified Approach to Controlling Implicit Regularization via Mirror Descent

Inspired by the remarkable success of deep neural networks, there has be...

Analysis of gradient descent methods with non-diminishing, bounded errors

The main aim of this paper is to provide an analysis of gradient descent...

Why does CTC result in peaky behavior?

The peaky behavior of CTC models is well known experimentally. However, ...

A framework for overparameterized learning

An explanation for the success of deep neural networks is a central ques...

Generalized Optimization: A First Step Towards Category Theoretic Learning Theory

The Cartesian reverse derivative is a categorical generalization of reve...

Please sign up or login with your details

Forgot password? Click here to reset