a learner makes predictions in the form of a vector belonging to a convex domainfor rounds. After predicting on round , a convex function is revealed to the learner, potentially in an adversarial or adaptive way, based on the learner’s past predictions. The learner then endures a loss and also receives its gradient as feedback.111Our analysis is applicable with minor changes to non-differentiable convex functions with subgradients as feedback.
The goal of the learner is to achieve low cumulative loss, coined regret, with respect to any fixed vector in the . Formally, the learner attempts to cap the quantity
Online Convex Optimization has been proven useful in the context of stochastic convex optimization, and numerous algorithms in this domain can be seen and analyzed as online optimization methods; we again refer to (Hazan, 2016) for a thorough survey of many of these algorithms. Any online algorithm achieving a sublinear regret can be readily converted to a stochastic convex optimization algorithm with convergence rate , using a standard technique called online-to-batch conversion (Cesa-Bianchi et al., 2004).
The online approach is particularly effective for the analysis of adaptive optimization methods, namely, algorithms that change the nature of their update rule on-the-fly so as to adapt to the geometry of the observed data (i.e., perceived gradients). The update rule of such algorithms often takes the form where is a (possibly stochastic) gradient of the function evaluated at , and is a regularization matrix, or a preconditioner
, used to skew the gradient step in a desirable way. Importantly, the matrixmay be chosen in an adaptive way based on past gradients, and might even depend on the gradient of the same step. The online optimization apparatus, in which the objective functions may vary almost arbitrarily, is very effective in dealing with these intricate dependencies. For a recent survey on adaptive methods in online learning and their analysis techniques see (McMahan, 2014).
One of the well-known adaptive online algorithms is AdaGrad (Duchi et al., 2011)
which is commonly used in machine learning for training sparse linear models. AdaGrad also became popular for training deep neural networks. Intuitively, AdaGrad employs an adaptive regularization for maintaining a step-size on a per-coordinate basis, and can thus perform aggressive updates on informative yet rarely seen features (a similar approach was also taken byMcMahan and Streeter, 2010). Another adaptive algorithm known in the online learning literature is the Online Newton Step (ONS) algorithm (Hazan et al., 2007). ONS incorporates an adaptive regularization technique for exploiting directional (non-isotropic) curvature of the objective function. While these adaptive regularization algorithms appear similar to each other, their derivation and analysis are disparate and technically involved. Furthermore, it is often difficult to gain insights into the specific choices of the matrices used for regularization and what role do they play in the analysis of the resulting algorithms.
In this paper, we present a general framework from which adaptive algorithms such AdaGrad and ONS can be derived using a streamlined scheme. Our framework is parameterized by a potential function . Different choices of give rise to concrete adaptive algorithms. Morally, after choosing a potential , the algorithm computes its regularization matrix, a preconditioner, for iterate by solving a minimization of the form,
Thus the algorithm strikes a balance between the potential of , , and the quality of as a regularizer for controlling the norms of the gradients with respect to the observations thus far. Not only does this balance give a natural interpretation of the regularization used by common adaptive algorithms, it also makes their analysis rather simple: an adaptive regularization algorithm can be viewed as a follow-the-leader (FTL) algorithm that operates over the class of positive definite matrices. We can thus analyze adaptive regularization methods using simple and well established FTL analyses.
Solving the minimization above over positive definite matrices is, in general, a non-trivial task. However, in certain cases we can obtain a closed form solution that gives rise to efficient algorithms. For instance, to obtain AdaGrad we pick the potential and solve the minimization via elementary differentiation, which leads to regularizers of the form . To obtain ONS we pick which yields , which constitutes the ONS update.
For both AdaGrad and ONS, we also derive diagonal versions of the algorithms by constraining the minimization to diagonal positive definite matrices. We also show that by further constraining the minimization to positive multiples of the identity matrix, one can recover familiar matrix-free (scalar) online algorithms that adaptively tune their step-size parameter according to observed gradients. As in the case of full matrices, the resulting minimization over matrices can be solved in closed form and the analyses follow seamlessly from the choice of the potential. Last we would like to note that the analysis applies to the mirror-descent family of algorithms; nevertheless, our approach can also be used to analyze dual-averaging-type algorithms, also referred to as follow-the-regularized-leader algorithms.
We denote by the positive definite cone, i.e. the set of all positive definite matrices. We use to denote the diagonal matrix whose diagonal coincides the diagonal elements of and its off-diagonal elements are . The trace of the matrix is denoted as . The element-wise inner-product of matrices and is denoted as .
The spectral norm of a matrix is denoted where . We denote by the norm of with respect to a positive definite matrix . The dual norm of is denoted and is equal to . We denote by
the projection of onto a bounded convex set with respect to the norm induced by . When , the identity matrix, we omit the superscript and simply use to denote the typical Euclidean projection operator.
Given a symmetric matrix and a function , we define as the matrix obtained by applying
to the eigenvalues of. Formally, let us rewrite using its spectral decomposition, where are ’s
’th eigenvalue and eigenvector respectively. Then, we define. The function is said to be operator monotone if implies that . A classic result in matrix theory used in our analysis is the Löwner-Heinz Theorem (see, for instance Theorem 2.6 in Carlen, 2010), which in particular asserts that the function is operator monotone for any . (Interestingly, it is not the case for .) We also use an elementary identity from matrix calculus to compute derivatives of matrix traces: .
2 Unified Adaptive Regularization
In this section we describe and analyze the meta-algorithm for Adaptive Regularization (AdaReg). The pseudocode of the algorithm is given in Algorithm 1. AdaReg constructs a succession of matrices , each multiplies its instantaneous gradient . The matrices act as pre-conditioners which reshape the gradient-based directions. In order to construct the pre-conditioners AdaReg is provided with a potential function over a subset of the positive definite matrices. On each round, casts a trade-off involving two terms. The first term promotes pre-conditioners which are inversely proportional to the accumulated outer products of gradients, namely,
The second term “pulls” back towards typically the zero matrix and is facilitated by. We define the initial regularizer .
We now state the main regret bound we prove for Algorithm 1, from which all the results in this paper are derived.
For any it holds that
That is, the regret of the algorithm is controlled by the magnitude of the gradients measured by a norm which is, in some sense, the best possible in hindsight: it is the one that minimizes the sum of the gradients’ norms plus a regularization term. The regularization term, that stems from the choice of the potential function , facilitates an explicit trade-off in the resulting regret bound between minimizing the gradients’ norms with respect to and controlling the magnitude of . The second summation term in the regret bound measures the stability of the algorithm in choosing its regularization matrices: an algorithm that changes the matrices frequently and abruptly is thus unlikely to perform well.
To prove Theorem 1, we rely on two standard tools in online optimization. The first is the Follow-the-Leader / Be-the-Leader (FTL-BTL) lemma.
Lemma 2 (FTL-BTL Lemma, Kalai and Vempala, 2005).
Let be an arbitrary sequence of functions defined over a domain . For , let , then,
(The term term is often used as regularization.)
The second tool is a standard bound for the Online Mirror Descent (OMD) algorithm, that allows for a different mirror map on each step (e.g., Duchi et al., 2011). The version of this algorithm relevant in the context of this paper starts from an arbitrary initialization and makes updates of the form,
For any , and , if are provided according to Eq. 5, the following bound holds,
For completeness, the proofs of both lemmas are given in Appendix A. We now proceed with a short proof of the theorem.
Proof of Theorem 1.
From the convexity of , it follows that . We thus get,
Hence, to obtain the claim from Lemma 3 we need to show that
To this end, define functions by setting , and
for . Then, by definition, is a minimizer of over matrices . Lemma 2 for the functions now yields
Expanding the expressions for the , we get
2.1 Spectral regularization
As we show in the sequel, the potential will often have the form , where is a (scalar) monotonically increasing function with a positive first derivative. We call this a spectral potential. In this case , and further, if then Item 2 of the algorithm becomes
Hence, the derivation of concrete algorithms from the general framework becomes extremely simple for spectral potentials and amounts to a simple transformation of the eigenvalues of the matrix . Furthermore, as we discuss below, spectral potentials make the derivation of simplified diagonal (and scalar) versions of the algorithms a straightforward task.
2.2 Diagonal regularization
To obtain a diagonal version of Algorithm 1, i.e., a version in which the maintained matrices are restricted to be diagonal, the only modification required in the algorithm is to set to be the set of all positive definite diagonal matrices, denoted . Specifically, when is a spectral potential and , then Item 2 of the algorithm becomes
Indeed, for a diagonal we have , and the minimizer of the latter over all positive definite matrices, according to Eq. 6, is the matrix . Since the latter is a diagonal matrix, it is also the minimizer of over all diagonal positive definite matrices. Consequently, diagonal versions of adaptive algorithms are obtained by replacing the full matrix in Algorithm 1 with its diagonal counterpart . instead of . In Sections 4.2 and 3.2 below, we spell out how this is accomplished for the AdaGrad and ONS algorithms.
2.3 Isotropic regularization
To obtain the corresponding scalar versions of Algorithm 1 (namely, analogous algorithms that only adaptively maintain a single scalar step-size), we can modify the algorithm so that is optimized over the set of all positive multiples of the identity matrix. If we let be a spectral potential and let , then the update in Item 2 of the algorithm is equivalent to
To see this, note that for we have
Since the minimizer of the latter over all positive definite matrices is , it is also the minimizer over .222We note that we could have arrived at the same result by choosing the potential and minimizing over . However, notice that this is not a spectral potential and its analysis is more technically involved.
3 AdaReg AdaGrad
We now derive AdaGrad (Duchi et al., 2011) from the AdaReg meta-algorithm. We first describe how to obtain the full-matrix version of the algorithm. In Section 3.2 we provide the derivation of AdaGrad’s diagonal version. Finally, in Section 3.3 we show that the well-studied adaptive version of online gradient descent can be viewed, and derived based on our framework, as a scalar version of AdaGrad. The three versions employ a potential parameterized by ,
and differ by the domain of admissible matrices . Since is a spectral potential, as we can rewrite, for . Simple calculus yields that , which in turn gives that
3.1 Full-matrix AdaGrad
AdaGrad employs the following update on each iteration,
where , for all , and is the step-size parameter. In the analysis below, we only assume that the domain is bounded and its Euclidean diameter is bounded by .
To obtain AdaGrad from Algorithm 1, we choose the potential function over the domain and set . The values of the parameters and are determined in the sequel. According to Eq. 10, the norm-regularization matrices used by Algorithm 1 are indeed , the same used by AdaGrad. Note that for projecting back to the domain , we can use a projection with respect to the norm rather than . Since the two norms only differ by a scale, this difference has no effect on the projection step. We now invoke Theorem 1 and bound the second term of the bound in Eq. 3,
We bound the left term using for a matrix and a vector . Setting and recalling that the diameter of is bounded by , we get
To bound the right term above we can use the same technique. We need to show though that for all . Indeed, the difference is PSD since and is operator monotone. We thus get,
Combining the two bounds we get,
Since , together with and a choice of , we have
for any . Note that can be taken arbitrarily small.
3.2 Diagonal AdaGrad
Duchi et al. (2011) presented a diagonal version of AdaGrad that uses faster updates based on diagonal regularization matrices,
where and for all . Following (Duchi et al., 2011), in the analysis of the diagonal algorithm we will assume a bound on the diameter of with respect to the -norm, which we denote by .
In order to obtain the diagonal version of AdaGrad we choose the same potential , but optimize over a domain restricted to diagonal positive definite matrices. From Eq. 7 and Eq. 10, the induced regularization matrices are , which recovers the diagonal version of AdaGrad.
Invoking Theorem 1 and repeating the arguments for Full-matrix AdaGrad with replacing , we obtain that
where the inequality uses the fact that for a diagonal positive semidefinite matrix , it holds that for any vector . Furthermore, for the second sum in Eq. 3 we have
where we used the elementary yet constructive fact that the support of non-zeros of the product is the same as . Overall, with the choice of we obtain the regret bound
3.3 Isotropic AdaGrad: Adaptive Gradient Descent
A classic adaptive version of the Online Gradient Descent (OGD) algorithm uses standard projected gradient updates of the form
with the decreasing step-size policy for an appropriate constant . For simplicity, we make the mild assumption that to avoid degenerate cases.
We now show how the adaptive OGD algorithm is obtained from our framework as a scalar version of AdaGrad. To establish this, consider the potential and fix the domain to be the set of all positive multiples of the identity matrix. Let us also set . Recalling Eq. 8, the resulting regularization matrices used by Algorithm 1 are . By setting we get,
recovering the adaptive OGD algorithm. In order to obtain a regret bound for the Isotropic AdaGrad algorithm, we can apply Theorem 1 and repeat the arguments for Full-matrix AdaGrad. First, we have
We also use the following,
Overall, with the choice of we obtain the regret bound
3.4 A -norm extension of AdaGrad
We conclude this section with a simple spectral extension of AdaGrad which regularizes according to the -norm of the spectral coefficients. To do so, let us choose
We then have, where , and therefore is a spectral potential. Elementary calculus yields , which in turn gives
We can use Theorem 1 as before to obtain the following regret bound,
We now set
Setting yields the AdaGrad update with the same regret bound obtained above. Moreover, the choice of provides the best regret bound among all choices for . To see that, let us denote the eigenvalues of by . The product of the two traces in Eq. 12 amounts to
by the Cauchy-Schwarz inequality.333We note, however, that the worst case analysis does not necessarily transfer to actual performance on real problems, and choosing a -norm regularization with may prove useful in practice.
4 AdaReg Online Newton Step
where as before and for all . Here again is a fixed step-size. Throughout this section, we assume that the are -Lipschitz, namely, for and the domain’s diameter, , both with respect to the Euclidean norm.
4.1 Full-matrix ONS
Let us first describe how the full-matrix version of ONS is derived through a specific choice for potential, . We assume that the cost functions are -exp-concave over the domain , with the following minor abuse of terminology. Concretely, we assume that for all and ,
To obtain the ONS update, we use the potential,
over and choose a fixed for some . Since is equal to where , is a spectral potential. We get that
Therefore, the pre-conditioning matrices induced by the potential in Algorithm 1 are , which gives rise to the update rule used by ONS.
since . In addition we have, . Let us denote the eigenvalues of by . Then, the eigenvalues of are and is bounded as follows,
The inequality above stems from to the fact that the eigenvalues of are upper bounded by , since for all due to the -Lipschitz assumption. In order to put everything together let us define and apply Theorem Eq. 3 to , we obtain
|(From Eq. 13)|