The Case for Full-Matrix Adaptive Regularization

06/08/2018 ∙ by Naman Agarwal, et al. ∙ Google Princeton University 0

Adaptive regularization methods come in diagonal and full-matrix variants. However, only the former have enjoyed widespread adoption in training large-scale deep models. This is due to the computational overhead of manipulating a full matrix in high dimension. In this paper, we show how to make full-matrix adaptive regularization practical and useful. We present GGT, a truly scalable full-matrix adaptive optimizer. At the heart of our algorithm is an efficient method for computing the inverse square root of a low-rank matrix. We show that GGT converges to first-order local minima, providing the first rigorous theoretical analysis of adaptive regularization in non-convex optimization. In preliminary experiments, GGT trains faster across a variety of synthetic tasks and standard deep learning benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic gradient descent is the workhorse behind the recent deep learning revolution. This simple and age-old algorithm has been supplemented with a variety of enhancements to improve its practical performance, and sometimes its theoretical guarantees.

Amongst the acceleration methods there are three main categories: momentum, adaptive regularization, and variance reduction. Momentum (in its various incarnations, like heavy-ball or Nesterov acceleration) is the oldest enhancement. It has a well-developed theory, and is known to improve practical convergence in a variety of tasks, small and large. It is also easy to implement. Variance reduction is the most recent advancement; in theory and practice, it is mostly applicable to convex optimization, and is thus less influential in deep learning.

This brings us to adaptive regularization: the most sophisticated, hard to implement, and debated acceleration method. While state-of-the-art optimizers such as Adam and AdaGrad [KB14, DHS11] do use adaptive regularization, they do so in a very limited form: with diagonal matrices, often marketed as per-coordinate adaptive learning-rate methods. Despite solid theoretical guarantees, the practical value of diagonal adaptive regularization as compared to “vanilla” SGD has been the subject of much debate [WRS17]. However, the efficacy of full-matrix adaptive regularization has been relatively unexplored. This is due to the prohibitive computational cost associated with full-matrix operations: full AdaGrad requires taking the inverse square root of a large matrix.

In this paper, we present GGT, a practical solution to the computational problems plaguing full-matrix adaptive regularization, making this technique scalable for modern deep models. At the heart of our method is a simple, GPU-friendly way to apply the inverse square root of the low-rank second-moment matrix of recent gradients; see Figure 

1. GGT’s running time is comparable to state-of-the-art optimizers.

We proceed to show that full-matrix preconditioning allows for much better exploitation of anisotropic curvature in loss landscapes. First, we show synthetic experiments which demonstate clear benefits of GGT over baselines, especially when the problem is ill-conditioned. Then, we implement GGT at scale, and show that the benefits translate to faster training on standard deep learning benchmarks. Our improvement is most salient in complicated landscapes like RNN training.

Our algorithm comes with theoretical guarantees. We give the first proof of convergence to first-order critical points for an algorithm with adaptive regularization in a stochastic non-convex setting, whose rate is dependent on an adaptivity ratio. We show examples where our bound is stronger than that for SGD, providing some theoretical basis for our empirical findings.

1.1 Related Work

Since the introduction of AdaGrad [DHS11]

, diagonal adaptive regularization has been a mainstay in the machine learning practitioner’s toolbox. A quick perusal of the literature shows that these methods have continued to thrive in the deep learning era, and appear in all major frameworks

[AAB16, PGC17, CLL15]. By citation count (or GitHub search hits), Adam [KB14]

is by far the most popular adaptive optimizer for training a variety of modern deep models. For this reason, this paper’s exposition is targeted towards a full-matrix drop-in replacement for Adam; however, our techniques extend straightforwardly to a plethora of related variants, like RMSprop

[TH12], Adadelta [Zei12], Nadam [Doz16], etc.

Full-matrix adaptive regularization has existed alongside the more commonly used diagonal-matrix manifestation since their common inception in [DHS11]; however, a major obstacle to the scalability of these methods is the need for the storage and inversion of square matrices in the model dimension. This becomes prohibitively expensive in dimension greater than , while state-of-the-art models regularly exceed parameters.

Matrix sketching has been employed to approximate the AdaGrad preconditioner [KMK16b, MRVW16]

; however, the sketched estimate for the matrix inverse can be sensitive to noise. In the former, the authors report a 5-10

overhead over AdaGrad, even with model parameters; we could not find a usable GPU implementation for their requisite rank-1 QR update. [GKS18] propose a way to do AdaGrad with Kronecker products of full-matrix preconditioners, a more limited setting which requires knowledge of the model’s structure. Finally, as we argue in Section 3.1, there is intrinsic value of “forgetting” past curvature using an exponential window. With this, a low-rank preconditioning matrix naturally arises, allowing us to bypass the computational need for sketching in the model dimension or architecture-dependent restriction of the preconditioner.

Our algorithm bears a superficial resemblance to L-BFGS [LN89], a version of BFGS [Bro70, Fle70, Gol70, Sha70] which uses a sliding window of gradient history. Although some are viable for large-scale implementation, these quasi-Newton methods, along with (subsampled, online, cubic-regularized) Newton methods [EM15, ABH17, LACBL16, HAK07, AAZB17, CDHS17] exhibit very different dynamics than the standard optimizers in deep learning, and thus have not seen widespread adoption. We find recent deep learning applications of second-order methods (e.g. [MG15, MBJ18]) to be intriguing, though outside the scope of this paper.

Recently, the role of adaptive regularization has been a hotly contested topic. In [WRS17], the authors suggest that properly-tuned SGD exhibits superior generalization to adaptive methods. In turn, [KS17] propose switching the optimizer from Adam to SGD at the end of training, to reap the advantages of each. Influentially, Adam’s convergence has been the object of recent scrutiny [RKK18]

. However, Adam continues to enjoy successful convergence in practice; the problematic construction involves pathological outlier gradients. We do not use the analyses of Adam or AMSGrad.

Figure 1: Sketch of how GGT performs fast full-matrix preconditioning. Note that the inverse matrices are understood here to be Moore-Penrose pseudoinverses; see Section 2.1 for a full treatment.

2 The GGT Algorithm

Our main algorithmic contribution is GGT, an efficient first-order algorithm for full-matrix adaptive preconditioning. In brief, GGT uses the preconditioner from full-matrix AdaGrad, with gradient history attenuated exponentially as in Adam, and truncated to a window parameter . The name GGT acts as a convenient mnemonic for the gradient second-moment matrix maintained by full-matrix AdaGrad, even though we never compute this matrix.

The mathematical specification of GGT is given in Algorithm 1, in the usual model of stochastic optimization (see Section 4), with gradients . Notice that the coordinate-wise scaling of Adam is recovered by zeroing out the off-diagonal entries of .

1:Input: initializer , window size , learning rate schedule , .
2:for  do
3:     Receive stochastic gradient .
4:     Let , where , or if .
5:     Update .
6:end for
Algorithm 1 GGT adaptive optimizer

GGT provides the power of full-matrix adaptive regularization at a cost not much larger than SGD. This crucially exploits the fact only a small window of historical gradients are used for preconditioning. The intuition for using a small window, as opposed to the entire history, is clear (and time-tested, by the ubiquity of Adam): the curvature of the loss surface changes, rendering previous gradient information obsolete. We expand on the benefits of forgetting gradients in section 3.1.

The fact that the preconditioning matrix is based on a small window of gradients implies that it has low rank. GGT exploits this fact by computing the inverse square root of the empirical covariance matrix indirectly, as outlined in Figure 1

. In effect, instead of inverting a full matrix in the dimension of parameters, using the special matrix structure GGT inverts a matrix of dimension window-size. The remainder of this section will discuss efficient implementation and some heuristics.

GGT has a provable guarantees even for non-convex optimization: it is guaranteed to converge to a first-order critical point. Its rate of convergence is never significantly slower than that of SGD, and in some favorable geometric conditions, can be significantly faster. These theoretical bounds are made precise in section 4.

2.1 Fast low-rank preconditioning

The window parameter should be roughly the number of copies of the model that fit in RAM; in our large-scale experiments, we use . A pessimistic but principled choice is , which truncates on the time scale of the exponential attenuation. Our key observation, highlighted in Figure 1, is that the inversion of the large low-rank matrix can be performed by diagonalizing the small matrix

, along with some extremely GPU-friendly matrix-vector operations.

The basic intuition is contained in Figure 1, but it remains to include the term. We derive the full update here. Let , be arbitrary, with

. Write the singular value decomposition

, with . Let , and let be its top left block. Let , so that the columns of are an orthonormal basis for the column space of , and its orthogonal component, noting that . Then, we have

The first term is none other than an SGD update step. The rest can be computed by taking the eigendecomposition , giving . We prefer this to taking the direct SVD of , which is times slower on GPU.

Using a cyclic buffer to store and update , the algorithm takes (sequential) time per iteration, and memory in total. Iterating over the model parameters to update incurs the same overhead cost as usual adaptive optimizers. The matrix multiplication and SVD operations benefit from decades of extensive hardware-level optimizations.

In the experiments in Section 3, we observed a (CNN) and

(RNN) running-time overhead over SGD; we note that this ratio could be even smaller in reinforcement learning (where the environment causes the time bottleneck), or universally with a more optimized implementation.

2.2 Tweaks for GGT on deep models

Below, we list some practical suggestions for applying GGT to training large-scale models.

Momentum. In order to bring GGT closer to a drop-in replacement for Adam, we can add momentum to the gradient steps: let , and apply the preconditioner to to compute the update step. We use momentum in all large-scale experiments, with the standard . We also get a small performance boost by using instead of the gradients to update . On the other hand, as long as , it makes little difference to choose , letting the window (rather than exponential attenuation) forget stale gradient information.

Interpolation with SGD. We note the possibility of decoupling the scalars and which appear in the efficient update step. Appealingly, this allows the user to tune GGT’s behavior to be arbitrarily close to that of SGD.

Numerical concerns.

For greater numerical stability, it is possible to add a small multiple of the identity matrix (we suggest

) to before computing its eigendecomposition, without noticeable differences in training.

3 Experiments

In this section, we present an empirical study of GGT. We begin with some simple experiments, showing that adaptive methods help in the presence of ill-conditioned optimization problems, as well as the value of limited gradient memory. Next, we evaluate the performance of GGT on larger-scale deep learning tasks. Finally, we present some interesting empirical insights on the training dynamics in deep learning models. Our visualizations of gradient spectra suggest that adaptive optimizers are indeed correcting for changing anisotropic curvature in the loss landscape.

3.1 Synthetic data: when do adaptivity and forgetfulness help?

The original theorems on the behavior of adaptive first-order methods are established from the perspective of online convex optimization [DHS11]. The dynamics are less understood on realistic loss landscapes in stochastic optimization. For this reason, we begin our experimental section with some simple empirical comparisons between full- and diagonal-matrix adaptive optimizers and SGD. Figure 2 summarizes our findings.

In each synthetic experiment, we generated an ill-conditioned landscape, and compared SGD with adaptive optimizers, excluding the typical accompanying heuristics (i.e. no momentum, regularization, or learning rate schedule). We tested diagonal-matrix preconditioners with and without exponential gradient attenuation (like Adam and AdaGrad, respectively), and their full-matrix analogues. The experiments were robust with respect to the choice of (we used ) and batch size.

In the first synthetic experiment (left)

, we exhibit an instance of logistic regression in dimension 10, with

samples generated from an extremely anisotropic Gaussian distribution, and binary labels determined by a random hyperplane. SGD converges the slowest, and diagonal AdaGrad consistently accelerates optimization. Finally, full-matrix preconditioning (using cubic-time matrix inversion) converges the fastest. In this setting, adding a window improved convergence, but not drastically; we elaborate below.

Next, we show an optimization problem (right) which accentuates the utility of exponentially decaying gradient memory. We consider the problem of minimizing the logarithmic barrier function of a randomly generated anisotropic polytope, otherwise known as finding its analytic center: this replaces the logistic loss terms with , with generated the same way as above, and generated uniformly from . We observed the same ranking of convergence rates as in the first experiment, but the improvement afforded by the window was much clearer.

The primary conclusion of our synthetic experiments is to demonstrate some small-scale settings in which adaptive regularization ameliorates anisotropy in the optimization landscape. A subtler point is that the windowed variants can help with changing curvature, even for convex losses. Note that the curvature of the former landscape is constant (in that its Hessian matrix at different locations only changes by a scalar factor). The latter setting, in contrast, features a changing curvature (its Hessians do not commute in general), necessitating “forgetfulness” in adaptive curvature estimation.

In Section 3.4, we will return to these proof-of-concept optimization instances, connecting them to an empirical study of curvature in more realistic landscapes.

Figure 2:

Synthetic experiments on convex loss functions, demonstrating the value of adaptive regularization and attenuation of gradient history.

Left: An ill-conditioned instance of logistic regression. Adaptive regularization finds a good preconditioner, accelerating optimization. Right: Minimizing a barrier function, an example where the curvature changes with position. Optimization is further accelerated by forgetting outdated gradient information.

3.2 GGT on deep convolutional models

We investigated the training dynamics of GGT on a typical deep architecture for computer vision. For this, we used a 26-layer 3-branch residual network with Shake-Shake regularization, recently proposed in

[Gas17]. Aside from its ability to reach state-of-the-art classification accuracy, this architecture also features a relatively low parameter count (M), enabling the use of a large window parameter ().

In each experiment, we kept the cosine learning rate annealing schedule proposed in the paper, originally from [LH16]; performance degraded consistently and significantly with a fixed learning rate. For both Adam and GGT, we chose the commonly used parameters ; for SGD, we used momentum with parameter

. With correctly tuned RMSprop and Adadelta, with the same window parameters, training curves were virtually identical to those for Adam. We used the standard data augmentation techniques of 4-pixel padding + random cropping and horizontal flipping.

Our results are shown in Figure 3 (top). In terms of training loss, GGT consistently dominated existing optimizers. We corroborate a number of observations from previous empirical studies of the generalization of optimizers. Most prominently, we found that SGD generalized slightly better than all others [WRS17, KS17] towards the end of training, including ours. The gap is less dramatic than that seen in [WRS17] for two reasons: we only show curves with a tuned and annealed learning rate; also, we use an architecture with powerful explicit regularization techniques which have gained attention since their publication. Our preliminary observation is that GGT shrinks this gap slightly, and expect that there is vastly more empirical work to be done concerning architectures synergistically tuned to default optimizers.

We also verify the long-held empirical observation that the learning rate decay of AdaGrad is too aggressive (e.g. in [Zei12]), resulting in convergence to a poor solution. Finally, as noted in [WRS17], we find that using a sufficiently low learning rate for any optimizer can result in a better training loss curve, but not without significantly degrading generalization performance ().

3.3 GGT on recurrent models

Figure 3: Results of CNN and RNN experiments. GGT dominates in training loss across both tasks, and generalizes better on the RNN task. Top: CIFAR-10 classification with a 3-branch ResNet. Bottom: PTB character-level language modeling with a 3-layer LSTM.

Next, we move to recurrent architectures for language modeling. We train a 3-layer LSTM [HS97] with M parameters for character-level modeling of the Penn Treebank dataset [MKM94]. This is the setting in which we observe the most striking improvement over baselines. The particularities of this optimization task, and why it might be especially amenable to full-matrix regularization, remain a fruitful research direction [PMB13]. Figure 3 (bottom) shows training and validation perplexities for the first epochs; no optimizer makes significant progress afterwards.

The state of the art for character-level language modeling is less thoroughly documented than its word-level counterpart, though we note that our end-to-end result (validation perplexity after epochs) is competitive with that shown in [KMK16a]. In contrast, Adam, AdaGrad, and SGD reach , , and , respectively. Note that Adam is the de facto standard optimizer for language modeling [MDB17]. Even with iterations taking twice the time, we outperform all baselines in wall-clock time throughout training.

We also tried using GGT as a drop-in replacement for Adam in the state-of-the-art word-level language modeling code accompanying [MKS17, MKS18]. Although we were competitive with Adam, we only observed an improvement in the first epochs. We hypothesize that the advantage of full-matrix regularization in this setting is more marginal, as the gradients in the embedding layers are naturally sparse in the vocabulary (“one-hot”) basis.

3.4 Empirical insights on the spectral decay

Figure 4:

Evolution of the spectrum of the gradient matrix during training. Each vertical slice is a density heatmap of the eigenvalues of

. The black lines indicate the minimum and maximum eigenvalues, smoothed in time by a median filter. Top: CNN training. Approaching the end of training, the gradients become more anisotropic. Bottom: RNN training. Within the first few epochs, the gradients become more isotropic, then stabilize. (Truncated to 5 epochs; the density was visually stable for the remainder of training.)

In this section, we unify the insights gleaned from the synthetic experiments and deep learning benchmarks. Along the way, we provide some interesting anecdotal observations on the evolution of the preconditioner matrices’ singular values.

We plot the density of the spectrum of the low-rank preconditioner as training progresses. Since the fast implementation of GGT takes an eigendecomposition of , we can read off the distribution of eigenvalues during training at no additional computational cost. Figure 4 visualizes the result of this experiment for the CNN and RNN training settings from the previous two sections. In each case, we observe that has a condition number of , noting that this can be visualized as the vertical range in the logarithmic plot.

This visualization affords a new way to see how CNN and RNN landscapes are fundamentally different: their gradient spectra evolve in very distinct ways over the course of training. Interestingly, the condition number of the CNN landscape surges near the end, which may be related to the the low-rank structure of well-trained nets noted by [AGNZ18]

, who derive rank-dependent generalization bounds for neural networks. On recurrent models, the rapidly evolving spectral structure at the early stage of training indicates a possibly more complex landscape. Intriguingly, the enormous condition number (

) correlates with the massive lead of GGT over the others, confirming our intuition that full-matrix preconditioning ameliorates anisotropy.

To our knowledge, this is the first empirical study of this kind, using the covariance matrix of recent gradients as a surrogate to examining the changing curvature of the loss landscape. In the spirit of recent empirical lenses of this flavor [RGYSD17, LXTG17], we leave this as a way to visualize deep learning dynamics, possibly of independent exploratory interest.

4 A convergence rate analysis with adaptivity

In this section we outline an idealized version of GGT, for which we can prove convergence to an approximate first-order critical point faster than SGD. As far as we know, this is the first provable guarantee for an adaptive gradient method in the non-convex setting.

Throughout this section, we consider the setting of stochastic optimization of a differentiable non-convex function , equipped with an unbiased variance-bounded stochastic gradient oracle; that is, given a point, an algorithm can query independent stochastic gradients such that

The objective, as is standard in the theory of non-convex optimization (see, e.g. [GL13, AZH16]), is to find an -approximate stationary point ; that is, .

4.1 A suitable abstraction for GGT

Even in the convex setting, a convergence theorem in the form of that shown in the Adam paper [KB14] is mild, and not useful for reasoning about the benefit of adaptivity or gradient memory. In particular, the bound degrades with the attenuation parameters and . Although [RKK18] fix a technical glitch in the Adam proof by prescribing a closely related algorithm, the convergence guarantees are of the same form.

Instead, we argue that it is more illuminating to analyze a somewhat idealized relative of the algorithm, in exchange for stronger bounds. For this, we move to a variant of (full-matrix) AdaGrad with “epochs”, or restarts, fully specified in the appendix as Algorithm 2. These restarts can be seen as another justification for using a window, in addition to our aforementioned experimental and intuitive arguments. We quantify the improvement of adaptive regularization, define the adaptivity ratio as

where is the sequence of stochastic gradients, the sequence of points played by the adaptive regularization algorithm, and is a comparator. For convex optimization problems is naturally the global minimum, but for non-convex optimization it is a subtler choice, which we detail in Appendix A.

This ratio characterizes the benefit of using adaptive regularization, and was shown in [DHS11] to be always bounded by a quantity independent of , and potentially much smaller. Specifically, it was shown to be at times inversely proportional to the dimension in certain convex optimization problems, providing a theoretical justification for the speedup of adaptive regularization algorithms. For sake of completeness, in Appendix A.2 we restate one setting exemplifying this important fact.

4.2 Adaptive convergence rate guarantee

We informally state the main theorem below. We defer the full bound without suppressed smoothness constants and logarithmic factors, as well as all technical proofs, to Appendix A.

Theorem 4.1.

Let be a bounded, Lipschitz, and smooth function with stochastic gradient oracle , whose variance is at most . In expectation, Algorithm 2 outputs an -approximate critical point of , with calls to .

This theorem matches and potentially improves the known analysis for stochastic gradient descent with the introduction of the data-dependent adaptivity constant into the leading-order term governing the rate of convergence. Since [DHS11] bounded by a quantity independent of , our theorem gives a rate of convergence.

We prove this result using two reductions. The first converts the online regret bound for the idealized algorithm to a convergence rate governed by the adaptivity ratio , for a well-conditioned convex function. This gives us an intermediate adaptive convergence result for convex optimization.

In our second reduction, using a modification of the usual descent lemma used in analyzing gradient descent in the non-convex setting, we reduce a smooth non-convex optimization problem to a sequence of well-conditioned convex optimization problems. We highlight the conceptual link between this two-stage analysis and the value of forgetting gradient history: the non-convex optimization problem is decomposed into a sequence of convex “soft trust-region” problems, between which the idealized GGT algorithm restarts its adaptive regularization. In Appendix A, we translate these intuitions into the formal main convergence theorem.

5 Conclusion

This work investigates full-matrix adaptive regularization: our main contribution is to make this technique viable for large-scale optimization, by a method for efficient multiplication by the inverse square root of a full second-moment matrix over a short window of gradients. This leads to a new algorithm, GGT, a truly scalable optimization algorithm with full-matrix adaptive preconditioning.

Through synthetic experiments, we have shown that GGT accelerates optimization in ill-conditioned loss landscapes; this is supported by accompanying adaptive convergence guarantees. Preliminary experiments show accelerated convergence on standard deep learning benchmarks, with very different training dynamics from existing diagonal adaptive methods. We accompany our algorithm and experiments with the first theoretical guarantees for adaptive regularization in the non-convex setting, giving examples of provably faster convergence to first-order critical points. We hope that GGT will be the first of a new class of algorithms for the modern large-scale optimization toolbox, and to foster new discussion towards an ever-elusive understanding of loss landscapes in deep learning.

Acknowledgments

We are grateful to Yoram Singer, Tomer Koren, Nadav Cohen, and Sanjeev Arora for helpful discussions.

References

Appendix A Full adaptive convergence analysis

To reiterate the main paper, in this section we develop an idealized version of GGT, which features full-matrix adaptive regularization and a principled choice of windowing epochs. We prove that it can converge to an approximate first-order critical point faster than SGD, with convergence rate controlled by an adaptivity ratio . To our knowledge, this is the first provable guarantee for an adaptive gradient method in the non-convex setting.

We consider the standard setting of stochastic optimization of a differentiable non-convex function , equipped with a bounded-variance stochastic gradient oracle; that is, given a point, we can query independent stochastic gradients such that

The objective, as is standard in non-convex optimization, is to find a point for which . We will also assume that has a Lipschitz gradient; i.e. .

Our algorithm makes a reduction to the case of online convex optimization. The setting formally is as follows – given a convex set and a class of convex functions , an adversary selects a sequence , and the player selects a sequence of points . The standard objective here is to minimize regret, defined as

Two popular algorithms to minimize online regret are online gradient descent (OGD) [Zin03] and AdaGrad [DHS11]. Due to adaptive regularization, AdaGrad can often be advantageous over OGD. We capture this notion by defining the adaptivity ratio as

where and . The numerator is the regret of AdaGrad, and the denominator is proportional to the upper bound on the regret of OGD.

It follows from the bounds of [DHS11] that is in the range for the diagonal version of AdaGrad, depending on the geometry of the optimization problem. The value of for full matrix AdaGrad is unknown, but examples are known for which it is significantly smaller than one. For completeness, we conclude this section with such an example.

In the rest of this section we will be using AdaGrad as a subroutine in our proposed algorithms. In this regard, while stating the bounds for our algorithms we use as an upper bound over the adaptivity ratio of each individual run of the AdaGrad subroutine. Furthermore, our algorithms will instantiate the online setting in the stochastic setting, where are picked randomly. In such settings will denote an upper bound on at each step of each run. A weak upper bound on can be obtained by where is the diameter of the underlying set.

a.1 Main Theorem

Theorem A.1.

Consider a non-convex function such that for all , and a point such that . Suppose Algorithm 2 is run with , . Then the point output by Algorithm 2 is such that

in stochastic gradient oracle calls.

1:Input: , convex optimization algorithm .
2:for  to :  do
3:     
4:      Run Algorithm 3 for starting at
5:end for
6:Output: sampled uniformly from .
Algorithm 2 Non-convex via iterative convex optimization

We prove the theorem in two steps. First we prove the following theorem about Algorithm 2 which reduces smooth non-convex optimization problem to a sequence of well-conditioned (strongly convex and smooth) convex optimization problems.

Theorem A.2.

Consider a non-convex function such that and a point such that . Further suppose we are given an iterative algorithm with the guarantee such that given a smooth and strongly convex function accesible through a stochastic gradient oracle, if is run on for steps, it produces a point such that

Then the point output by Algorithm 2 is such that

Further, we propose an AdaGrad-like algorithm (Algorithm 3) that makes calls to the stochastic gradient oracle and minimizes a smooth and strongly convex function to error. We prove the following theorem regarding Algorithm 3.

Theorem A.3.

Suppose f is a -strongly convex and -smooth function equipped with a -bounded stochastic gradient oracle. If we have initial point such that , when run with and , Algorithm 3 guarantees

in a total number of oracle calls.

Note that due to the presence of in the above bound, we hope that the above could be much better than OGD. Our analysis closely follows the analysis for SGD for strongly convex functions, given by [HK14]. We now prove Theorem A.1 using Theorems A.2 and A.3.

Proof of Theorem a.1.

The theorem simply follows as a consequence of running Algorithm 3 for , starting at point in round , which by Theorem A.3 guarantees that . Combining this with our choice of , we may invoke Theorem A.2 to arrive at the desired bound. ∎

1:Initialize:
2:for :  do
3:     Set:
4:     for  do
5:         Query: , get , where
6:         Update:
7:         Update:
8:         Set:
9:     end for
10:     Set:
11:end for
12:Output:
Algorithm 3 AdaGrad with epochs

We prove Theorem A.2 and Theorem A.3 in the rest of the section.

Proof of Theorem a.2.

From the statement of the theorem we have that

Now consider the following equations which hold for any and any .

where the last inequality follows from smoothness. Now setting , summing the inequality over and rearranging gives us that

which proves the theorem. ∎

Proof of Theorem a.3.

Define . The first claim is that at the end of each epoch , it holds that

Note that the claim is true for as .

Additionally, for any epoch we have

Note that the minimum value of for any is , so . Therefore

where the last inequality is due to the induction hypothesis and Jensen’s inequality. As a consequence, we have

The total number of stochastic gradient oracle calls the algorithm makes is . ∎

a.2 Example: the advantage of adaptivity

This section shows an example originally provided in [DHS11], where the constant of adaptivity for full matrix AdaGrad is much smaller than 1 and adaptive regularization methods have significant advantage over SGD. Consider the setting where in each iteration we receive a training example and suffer hinge loss . Let be an orthonormal matrix and denote its columns. Let and our domain be , then is an optimum. Suppose for a fixed , we receive examples in the following way:

We show that in this case . We initialize . After the first iteration, AdaGrad updates and we have zero loss until the algorithm sees . Since and are orthogonal, . Similarly, AdaGrad suffers constant loss in each dimension, and .