In this work we are studying stochastic algorithms for solving regularized empirical risk minimization problems, i.e., optimization problems of the form
We assume that the functions are smooth and convex, and is a proper, closed and convex regularizer, admitting a cheap proximal operator. We write .
Proximal gradient descent. A baseline method for solving problem (1) is (proximal) gradient descent (GD). This method performs a gradient step in , followed by a proximal step111The proximal operator is defined via . in , i.e.,
where is a stepsize. GD performs well when both and are not too large. However, in the big data (large ) and/or big parameter (large ) case, the formation of the gradient becomes overly expensive, rendering GD inefficient in both theory and practice. A typical remedy is to replace the gradient by a cheap-to-compute random approximation. Typically, one replaces
with a random vectorwhose mean is the gradient: , i.e., with a stochastic gradient. This results in the
(proximal) stochastic gradient descent(SGD) method:
Below we comment on the typical approaches to constructing in the big and big regimes.
Proximal SGD. In the big regime, the simplest choice is to set
for an index chosen uniformly at random. By construction, it is
times cheaper to compute this estimator than the gradient, which is a key driving force behind the efficiency of this variant ofSGD
. However, there is an infinite array of other possibilities of constructing an unbiased estimatorneedell2016batched ; gower2019sgd . Depending on how is formed, (3) specializes to one of the many existing variants of proximal SGD, each with different convergence properties and proofs.
Proximal RCD. In the big regime (this is interesting even if ), the simplest choice is to set
where is the standard Euclidean inner product, is the -th standard unit basis vector in , and is chosen uniformly at random from . With this estimator, (3) specializes to (proximal) randomized coordinate decent (RCD). There are situations where it is times cheaper to compute the partial derivative than the gradient, which is a key driving force behind the efficiency of RCD RCDM . However, there is an infinite array of other possibilities for constructing an unbiased estimator of the gradient in a similar way NSync ; PCDM ; ESO .
Issues. For the sake of argument in the rest of this section, assume that is a -strongly convex function, and let be the (necessarily) unique solution of (1). It is well known that in this case, method (3) with estimator defined as in (4) does not in general converge to . Instead, SGD converges linearly to a neighborhood of of size proportional to the stepsize , noise , and inversely proportional to moulines2011non ; NeedellWard2015 . In the generic regime with , the neighbourhood is nonzero, causing issues with convergence. This situation does not change even when tricks such as mini-batching or importance sampling (or a combination of both) are applied needell2014stochastic ; needell2016batched ; gower2019sgd . While these tricks affect both the (linear convergence) rate and the size of the neighbourhood, they are incapable222Unless, of course, in the special case when one uses the full batch approximation . of ensuring convergence to the solution. However, a remedy does exist: the situation with non-convergence can be resolved by using one of the many variance-reduction strategies for constructing developed over the last several years SAG ; SAGA ; johnson2013 ; mairal2013optimization ; SDCA . Further, while it is well known that method (3) with estimator defined as in (5) (i.e., randomized coordinate descent) converges to for RCDM ; UCDC ; NSync , it is also known that it does not generally converge to unless the regularizer is separable (e.g., or ). In hanzely2018sega , an alternative estimator (known as SEGA) was constructed from the same (random) partial derivative information , one that does not suffer from this incompatibility with general regularizers . This work resolved a long standing open problem in the theory of RCD methods.
Having experienced a “Cambrian explosion” in the last 10 years, the world of efficient SGD methods is remarkably complex. There is a large and growing set of rules for constructing the gradient estimators , with differing levels of sophistication and varying theoretical and practical properties. It includes the classical estimator (4), as well as an infinite array of mini-batch Li2014 and importance sampling NeedellWard2015 ; IProx-SDCA variants, and a growing list of variance-reduced variants SAGA . Furthermore, there are estimators of the coordinate descent variety, including the simplest one based on (5) RCDM , more elaborate variants utilizing the arbitrary sampling paradigm ALPHA , and variance reduced methods capable of handling general non-separable regularizers hanzely2018sega .
New general method and a single convergence theorem. In this paper we propose a general method—which we call GJS—which reduces to many of the aforementioned classical and several recently developed SGD type methods in special cases. We provide a single convergence theorem, establishing a linear convergence rate for GJC, assuming to be smooth and quasi strongly convex. In particular, we obtain the following methods in special cases, or their generalizations, always recovering the best-known convergence guarantees or improving upon them: SAGA SAGA ; qian2019saga ; gazagnadou2019optimal , JacSketch gower2018stochastic , LSVRG hofmann2015variance ; LSVRG , SEGA hanzely2018sega , and ISEGA mishchenko201999 (see Table 1, in which we list 17 special cases). This is the first time such a direct connection is made between many of these methods, which previously required different intuitions and dedicated analyses. Our general method, and hence also all special cases we consider, can work with a regularizer. This provides novel (although not hard) results for some methods, such as LSVRG.
First unification of SGD and RCD. As a by-product of the generality of GJS, we obtain the first unification of variance-reduced SGD and RCD methods. To the best of our knowledge, there is no algorithm besides GJS, one whose complexity is captured by a single theorem, which specializes to SGD and RCD type methods at the same time.
Generalizations to arbitrary sampling. Many specialized methods we develop are cast in a very general arbitrary sampling paradigm NSync ; quartz ; ALPHA , which allows for the estimator to be formed through information contained in a random subset (by computing for ) or a random subset (by computing for ), where these subsets are allowed to follow an arbitrary distribution. In particular, we extend SEGA hanzely2018sega , LSVRG hofmann2015variance ; LSVRG or ISEGA mishchenko201999 to this setup. Likewise, GJS specializes to an arbitrary sampling extension of the SGD-type method SAGA SAGA ; qian2019saga , obtaining state-of-the-art rates. As special case,of the arbitrary sampling paradigm, we obtain importance sampling versions of all mentioned methods.
New methods. GJS can be specialized to many new specific methods. To illustrate this, we construct 10 specific new methods in special cases, some with intriguing structure and properties (see Section 6; Table 1; and Table 2 for a summary of the rates).
Relation to JacSketch. Our method can be seen as a vast generalization of the recently proposed Jacobian sketching method JacSketch gower2018stochastic in several directions, notably by enabling arbitrary randomized linear (i.e., sketching) operators, extending the analysis to the proximal case, and replacing strong convexity assumption by quasi strong convexity.
Surprising connections. Lastly, we show a surprising link between SEGA and SAGA. In particular, SAGA is a special case of SEGA; and the novel rate we obtain for SEGA recovers the tight complexity of SAGA qian2019saga ; gazagnadou2019optimal (see Appendix R). Similarly, we recover tight rates for LSVRG using a result for SVRCD.
Limitations. We focus on developing methods capable of enjoying a linear convergence rate with a fixed stepsize and do not consider the non-convex setting. Although there exist several accelerated variance reduced algorithms lan2018optimal ; allen2017katyusha ; pmlr-v80-zhou18c ; pmlr-v89-zhou19c ; LSVRG ; kulunchakov2019estimate , we do not consider such methods here.
Notation. Let (resp. ) be the vector of all ones in (resp. ), and (resp. ) be the -th (resp. -th) unit basis vector in (resp. ). By we denote the standard Euclidean norm in and . Matrices are denoted by upper-case bold letters. Given , let and be the Frobenius norm. By (resp. ) we denote the -th column (resp. -th row) of matrix . By (resp. ) we denote the (resp. ) identity matrices. Upper-case calligraphic letters, such as , are used to denote (deterministic or random) linear operators mapping to . Most used notation is summarized in Table 3 in Appendix C.
A key object in this paper is the Jacobian matrix Note that
Extending the insights from gower2018stochastic , one of the key observations of this work is that random linear transformations
random linear transformations(sketches) of can be used to construct unbiased estimators of the gradient of . For instance, leads to the simple SGD estimator (4), and gives the simple RCD estimator (5). We will consider more elaborate examples later on. It will be useful to embed these estimators into . For instance, instead of we consider the matrix . Note that all columns of this matrix are zero, except for the -th column, which is equal to . Similarly, instead of we will consider the matrix . All rows of this matrix are zero, except for the -th row, which consists of the th partial derivatives of functions for , scaled by .
Random projections. Generalizing from these examples, we consider a random linear operator (“sketch”) . By we denote the adjoint of , i.e., linear operator satisfying for all . Given , we let be the (random) projection operator onto . That is,
where is the Moore-Penrose pseudoinverse. The identity operator is denoted by . We say that is identity in expectation, or unbiased when ; i.e., when if for all .
We will often consider the following333The algorithm we develop is, however, not limited to such sketches. sketching operators :
(i) Right sketch. Let be a random matrix. Define be a Bernoulli random variable, i.e., with probability
(ii) Left sketch. Let be a random matrix. Define by (“L-sketch”). Notice that . In particular, if is random subset of , we can define . The resulting operator (“L-sampling”) satisfies: . If we let , and instead define , then and hence us unbiased.
(iii) Scaling/Bernoulli. Let
(iv) LR sketch. All the above operators can be combined. In particular, we can define . All of the above arise as special cases of this: (i) arises for and , (ii) for and , and (iii) for and .
be a random matrix. Defineby (“R-sketch”). Notice that . In particular, if is random subset of , we can define . The resulting operator (“R-sampling”) satisfies: . If we let , and instead define , then and hence is unbiased.
be a Bernoulli random variable, i.e.,
with probabilityand with probability , where . Define by (“scaling”). Then . If we instead define , then is unbiased.
4 Generalized Jacobian Sketching (Gjs)
We are now ready to describe our method (formalized as Algorithm 1).
Let be a random linear operator (e.g., right sketch, left sketch, or scaling) such that and let be an unbiased operator. We propose to construct the gradient estimator as
where the matrices are constructed iteratively. Note that, taking expectation in , we get
and hence is indeed unbiased. We will construct so that . By doing so, the variance of decreases throughout the iterations, completely vanishing at . The sequence is updated as follows:
That is, we sketch the Jacobian , obtaining the sketch , and seek to use this information to construct a new matrix which is consistent with this sketch, and as close to as possible. The intuition here is as follows: if we repeated the sketch-and-project process (9) for fixed , the matrices would converge to , at a linear rate SIMAX2015 ; inverse . This process can be seen as SGD applied to a certain quadratic stochastic optimization problem ASDA ; gower2018stochastic . Instead, we take just one step of this iterative process, change , and repeat. Note that the unbiased sketch in (7) also claims access to . Specific variants of GJS are obtained by choosing specific operators and (see Section 6).
We now describe the main result of this paper, which depends on a relaxed strong convexity assumption and a more precise smoothness assumption on .
Problem (1) has a unique minimizer , and is -quasi strongly convex, i.e.,
Functions are convex and -smooth for some , i.e.,
Assumption 11 generalizes classical -smoothness, which is obtained in the special case . The usefulness of this assumption comes from i) the fact that ERM problems typically satisfy (11) in a non-trivial way ESO ; gower2019sgd , ii) our method is able to utilize the full information contained in these matrices for further acceleration (via increased stepsizes). Given matrices from Assumption 5.1, let be the linear operator defined via for . It is easy to check that this operator is self-adjoint and positive semi-definite, and that its square root is given by . The pseudoinverse of this operator plays an important role in our main result.
Let Assumption 5.1 hold. Let be any linear operator commuting with , and assume commutes with . Let be any linear operator for which for every . Define the Lyapunov function
where and are the random iterates produced by Algorithm 1 with stepsize . Suppose that and are chosen so that
for all . Then for all , we have
The above theorem is very general as it applies to essentially arbitrary random linear operators and . It postulates a linear convergence rate of a Lyapunov function composed of two terms: distance of from , and weighted distance of the Jacobian from . Hence, we obtain convergence of both the iterates and the Jacobian to and , respectively. Inequalities (13) and (14) are mainly assumptions one stepsize , and are used to define suitable weight operator . See Lemma E.1 for a general statement on when these inequalities are satisfied. However, we give concrete and simple answers in all special cases of GJS in the appendix. For a summary of how the operator is chosen in special cases, and the particular complexity results derived from this theorem, we refer to Table 2.
We use the trivial choice in almost all special cases. With this choice of , the condition is automatically satisfied, and inequality (14) is requested to hold for all matrices . However, a non-trivial choice of is sometimes useful; e.g., in the analysis of a subspace variant of SEGA hanzely2018sega . Further, the results of Theorem 5.2 can be generalized from a quasi strong convexity to a strong growth condition karimi2016linear on (see Appendix S). While interesting, these are not the key results of this work and we therefore suppress them to the appendix.
6 Special Cases
As outlined in the introduction, GJS (Algorithm 1) is a surprisingly versatile method. In Table 1 we list 7 existing methods (in some cases, generalizations of existing methods), and construct also 10 new variance reduced methods. We also provide a summary of all specialized iteration complexity results, and a guide to the corollaries which state them (see Table 2 in the appendix).