We consider a convex composite optimization problem
where is a conic combination (with coefficients ) of a very large number of smooth convex functions , and is a proper closed convex function.111Our choice to consider general weights is not significant; we do it for convenience as will be clear from the relevant part of the text. Hence, the reader may without the risk of missing any key ideas assume that for all . We do not assume to be smooth. In particular, can be the indicator function of a nonempty closed convex set, turning problem (1) into a constrained minimization of function . We are interested in the regime where , although all our theoretical results hold without this assumption.
In a typical setup in the literature, for all ,
corresponds to the loss of a supervised machine learning modelon example from a training dataset of size , and represents the average loss (i.e., empirical risk). Problems of the form (1) are often called “finite-sum” or regularized empirical risk minimization (ERM) problems, and are of immense importance in supervised learning, essentially forming the dominant training paradigm Shalev-Shwartz & Ben-David (2014).
1.1 Variance-reduced methods
One of the most successful methods for solving ERM problems is stochastic gradient descent (SGD)Robbins & Monro (1951); Nemirovski et al. (2009) and its many variants, including those with minibatches Takáč et al. (2013), importance sampling Needell et al. (2015); Zhao & Zhang (2015) and momentum Loizou & Richtárik (2017a, b).
One of the most interesting developments in recent years concerns variance-reduced variants of SGD. The first method in this category is the celebrated222Schmidt et al. (2017) received the 2018 Lagrange Prize in continuous optimization for their work on SAG. stochastic average gradient (SAG) method of Schmidt et al. (2017). Many additional variance-reduced methods were proposed since, including SDCA Richtárik & Takáč (2014); Shalev-Shwartz & Zhang (2013); Shalev-Shwartz (2016), SAGA Defazio et al. (2014), SVRG Johnson & Zhang (2013); Xiao & Zhang (2014b), S2GD Konečný & Richtárik (2017); Konečný et al. (2016), MISO Mairal (2015), JacSketch Gower et al. (2018) and SAGD Bibi et al. (2018).
1.2 SAGA: the known and the unknown
Since the SAG gradient estimator is not unbiased, SAG is notoriously hard to analyze. Soon after SAG was proposed, the SAGA methodDefazio et al. (2014) was developed, obtained by replacing the biased SAG estimator by a similar, but unbiased, SAGA estimator. This method admits a simpler analysis, retaining the favourable convergence properties of SAG. SAGA is one of the early and most successful variance-reduced methods for (1).
Better understanding of the behaviour of SAGA remains one of the open challenges in the literature. Consider problem (1) with for all . Assume, for simplicity, that each is -smooth and is
-strongly convex. In this regime, the iteration complexity of SAGA with uniform sampling probabilities is, where , which was established already by Defazio et al. (2014). Schmidt et al. (2015) conjectured that there exist nonuniform sampling probabilities for which the complexity improves to , where . However, the “correct” importance sampling strategy leading to this result was not discovered until recently in the work of Gower et al. (2018), where the conjecture was resolved in the affirmative. One of the key difficulties in the analysis was the construction of a suitable stochastic Lyapunov function controlling the iterative process. Likewise, until recently, very little was known about the minibatch performance of SAGA, even for the simplest uniform minibatch strategies. Notable advancements in this area were made by Gower et al. (2018), who have the currently best rates for SAGA with standard uniform minibatch strategies and the first importance sampling results for a block variant of SAGA.
Our contributions can be summarized as follows.
SAGA with arbitrary sampling. We study the performance of SAGA under fully general data sampling strategies known in the literature as arbitrary sampling, generalizing all previous results, and obtaining an array of new theoretically and practically useful samplings. We call our general method SAGA-AS. Our theorems are expressed in terms of new stochastic and deterministic Lyapunov functions, the constructions of which was essential to our success.
In the arbitrary sampling paradigm, first proposed by Richtárik & Takáč (2016) in the context of randomized coordinate descent methods, one considers all (proper) random set valued mappings (called “samplings”) with values being subsets of . A sampling is uniquely determined by assigning a probability to all subsets of . A sampling is called proper333It does not make sense to consider samplings that are not proper. Indeed, if for some , a method based on will lose access to and, consequently, ability to solve (1). if probability of each being sampled is positive; that is, if for all . Finally, the term “arbitrary sampling” refers to an arbitrary proper sampling.
Smooth case. We perform an iteration complexity analysis in the smooth case (), assuming is -strongly convex. Our analysis generalizes the results of Defazio et al. (2014) and Gower et al. (2018) to arbitrary sampling. The JacSketch method Gower et al. (2018)
and its analysis rely on the notion of a bias-correcting random variable. Unfortunately, such a random variable does not exist for SAGA-AS. We overcome this obstacle by proposing a
bias-correcting random vector (BCRV)which, as we show, always exists. While Gower et al. (2018); Bibi et al. (2018) consider particular suboptimal choices, we are able to find the BCRV which minimizes the iteration complexity bound. Unlike all known and new variants of SAGA considered in Gower et al. (2018), SAGA-AS does not arise as a special case of JacSketch. Our linear rates for SAGA-AS are the same as those for the primal-dual stochastic fixed point method Quartz Qu et al. (2015) (the first arbitrary sampling based method for (1)) in the regime when Quartz is applicable, which is the case when an explicit strongly convex regularizer is present. In contrast, we do not need an explicit regularizer, which means that SAGA-AS can utilize the strong convexity of fully, even if the strong convexity parameter is not known. While the importance sampling results in Gower et al. (2018) require each to be strongly convex, we impose this requirement on only.
Nonsmooth case. We perform an iteration complexity analysis in the general nonsmooth case. When the regularizer is strongly convex, which is the same setting as that considered in Qu et al. (2015), our iteration complexity bounds are essentially the same as that of Quartz. However, we also prove linear convergence results, with the same rates, under a quadratic functional growth condition (which does not imply strong convexity) Necoara et al. (2018). These are the first linear convergence result for any variant of SAGA without strong convexity. Moreover, to the best of our knowledge, SAGA-AS is the only variance-reduced method which achieves linear convergence without any a priori knowledge of the error bound condition number.
Our arbitrary sampling rates are summarized in Table 1.
1.4 Brief review of arbitrary sampling results
The arbitrary sampling paradigm was proposed by Richtárik & Takáč (2016), where a randomized coordinate descent method with arbitrary sampling of subsets of coordinates was analyzed for unconstrained minimization of a strongly convex function. Subsequently, the primal-dual method Quartz with arbitrary sampling of dual variables was studied in Qu et al. (2015) for solving (1) in the case when is strongly convex (and for all ). An accelerated randomized coordinate descent method with arbitrary sampling called ALPHA was proposed by Qu & Richtárik (2016a) in the context of minimizing the sum of a smooth convex function and a convex block-separable regularizer. A key concept in the analysis of all known methods in the arbitrary sampling paradigm is the notion of expected separable overapproximation (ESO), introduced by Richtárik & Takáč (2016), and in the context of arbitrary sampling studied in depth by Qu & Richtárik (2016b). A stochastic primal-dual hybrid gradient algorithm (aka Chambolle-Pock) with arbitrary sampling of dual variables was studied by Chambolle et al. (2017). Recently, an accelerated coordinate descent method with arbitrary sampling for minimizing a smooth and strongly convex function was studied by Hanzely & Richtárik (2018). Finally, the first arbitrary sampling analysis in a nonconvex setting was performed by Horváth & Richtárik (2018), which is also the first work in which the optimal sampling out of class of all samplings of a given minibatch size was identified.
2 The Algorithm
Let be defined by
and let be the Jacobian of at .
Gower et al. (2018) propose a new family of variance reduced SGD methods—called JacSketch—which progressively build a variance-reduced estimator of the gradient via the utilization of a new technique they call Jacobian sketching. As shown in Gower et al. (2018), state-of-the-art variants SAGA can be obtained as a special case of JacSketch. However, SAGA-AS does not arise as a special case of JacSketch. In fact, the generic analysis provided in Gower et al. (2018) (Theorem 3.6) is too coarse and does not lead to good bounds for any variants of SAGA with importance sampling. On the other hand, the analysis of Gower et al. (2018) which does do well for importance sampling does not generalize to arbitrary sampling, regularized objectives or regimes without strong convexity.
In this section we provide a brief review of the JacSketch method, establishing some useful notation along the way, with the goal of pointing out the moment of departure from JacSketch construction which leads to SAGA-AS. The iterations of JacSketch have the form
where is a fixed step size, and
is an variance-reduced unbiased estimator of the gradientbuilt iteratively through a process involving Jacobian sketching, sketch-and-project and bias-correction.
Starting from an arbitrary matrix , at each of JacSketch an estimator of the true Jacobian is constructed using a sketch-and-project iteration:
Above, is the Frobenius norm444Gower et al. (2018) consider a weighted Frobenius norm, but this is not useful for our purposes., and
is a random matrix drawn from some ensamble of matricesin an i.i.d. fashion in each iteration. The solution to (3) is the closest matrix to consistent with true Jacobian in its action onto . Intuitively, the higher is, the more accurate will be as an estimator of the true Jacobian . However, in order to control the cost of computing , in practical applications one chooses . In the case of standard SAGA, for instance, is a random standard unit basis vector in chosen uniformly at random (and hence ).
where is an orthogonal projection matrix (onto the column space of ), and denotes the Moore-Penrose pseudoinverse. Since is constructed to be an approximation of , and since , where , it makes sense to estimate the gradient via . However, this gradient estimator is not unbiased, which poses dramatic challenges for complexity analysis. Indeed, the celebrated SAG method, with its infamously technical analysis, uses precisely this estimator in the special case when is chosen to be a standard basis vector in sampled uniformly at random. Fortunately, as show in Gower et al. (2018), an unbiased estimator can be constructed by taking a random linear combination of and :
where is a bias-correcting random variable (dependent on ), defined as any random variable for which Under this condition, becomes an unbiased estimator of . The JacSketch method is obtained by alternating optimization steps (2) (producing iterates ) with sketch-and-project steps (producing ).
2.2 Bias-correcting random vector
In order to construct SAGA-LS, we take a departure here and consider a bias-correcting random vector
instead. From now on it will be useful to think of as an diagonal matrix, with the vector embedded in its diagonal. In contrast to (4), we propose to construct via
It is easy to see that under the following assumption, will be an unbiased estimator of .
Assumption 2.1 (Bias-correcting random vector).
We say that the diagonal random matrix is a bias-correcting random vector if
2.3 Choosing distribution
In order to complete the description of SAGA-AS, we need to specify the distribution . We choose to be a distribution over random column submatrices of the identity matrix . Such a distribution is uniquely characterized by a random subset of the columns of , i.e., a random subset of . This leads us to the notion of a sampling, already outlined in the introduction.
Definition 2.2 (Sampling).
A sampling is a random set-valued mapping with values being the subsets of . It is uniquely characterized by the choice of probabilities associated with every subset of . Given a sampling , we let . We say that is proper if for all .
So, given a proper sampling , we sample matrices as follows: i) Draw a random set , ii) Define (random column submatrix of corresponding to columns ). For and sampling define vector as follows:
It is easy to observe (see Lemma 4.7 in Gower et al. (2018)) that for we have the identity
In order to simplify notation, we will write instead of .
Let be a proper sampling and define by setting . Then condition (5) is equivalent to
This condition is satisfied by the default vector .
In general, there is an infinity of bias-correcting random vectors characterized by (7). In SAGA-AS we reserve the freedom to choose any of these vectors.
By putting all of the development above together, we have arrived at SAGA-AS (Algorithm 1). Note that since we consider problem (1) with a regularizer , the optimization step involves a proximal operator, defined as
To shed more light onto the key steps of SAGA-AS, note that that an alternative way of writing the Jacobian update is
while the gradient estimate can be alternatively written as
3 Analysis in the Smooth Case
In this section we consider problem (1) in the smooth case; i.e., we let .
3.1 Main result
Given an arbitrary proper sampling , and bias-correcting random vector , for each define
where is the cardinality of the set . As we shall see, these quantities play a key importance in our complexity result, presented next.
Let be an arbitrary proper sampling, and let be a bias-correcting random vector satisfying (7). Let be -strongly convex and be convex and -smooth. Let be the iterates produced by Algorithm 1. Consider the stochastic Lyapunov function
where for all . If stepsize satisfies
then This implies that if we choose equal to the upper bound in (9), then as long as
Our result involves a novel stochastic Lyapunov function , different from that in Gower et al. (2018). We call this Lyapnuov function stochastic because it is a random variable when conditioned on and . Virtually all analyses of stochastic optimization algorithms in the literature (to the best of our knowledge, all except for that in Gower et al. (2018)) involve deterministic Lyapunov functions.
The result posits linear convergence with iteration complexity whose leading term involves the quantities , (which depend on the sampling only) and on , and (which depend on the properties and structure of ).
3.2 Optimal bias-correcting random vector
Note that the complexity bound gets better as get smaller. Having said that, even for a fixed sampling , the choice of is not unique, Indeed, this is because depends on the choice of . In view of Lemma 2.3, we have many choices for this random vector. Let be the collection of all bias-correcting random vectors associated with sampling . In our next result we will compute the bias-correcting random vector which leads to the minimal complexity parameters . In the rest of the paper, let
Let be a proper sampling. Then
(i) for all , and the minimum is obtained at given by for all ;
(ii) for all .
3.3 Importance Sampling for Minibatches
In this part we construct an importance sampling for minibatches. This is in general a daunting task, and only a handful of papers exist on this topic. In particular, Csiba & Richtárik (2018) and Hanzely & Richtárik (2019) focused on coordinate descent methods, and Gower et al. (2018) considered minibatch SAGA with importance sampling over subsets of forming a partition.
Since is nonlinear, it is difficult to minimize (10) generally. Instead, we choose for all , in which case , and the complexity bound becomes
From now on we focus on so-called independent samplings. Alternative results for partition samplings can be found in Section A of the Supplementary. In particular, let be formed as follows: for each we flip a biased coin, independently, with probability of success . If we are successful, we include in . This type of sampling was also considered for nonconvex ERM problems in Horváth & Richtárik (2018) and for accelerated coordinate descent in Hanzely & Richtárik (2018).
First, we calculate for this sampling . Let the sampling includes every in it, independently, with probability . Then
Since , the complexity bound (11) has an upper bound as
Next, we minimize the upper bound (12) over under the condition . Let for and denote . If , then it is easy to see that minimize (12), and consequently, (12) becomes Otherwise, in order to minimize (12), we can choose for , and for such that . By choosing this optimal probability, (12) becomes
Notice that from Theorem 3.1, for -nice sampling, the iteration complexity is
Hence, compared to -nice sampling, the iteration complexity for importance independent sampling could be at most times worse, but could also be times better in some extreme cases.
3.4 SAGA-AS vs Quartz
In this section, we compare our results for SAGA-AS with known complexity results for the primal-dual method Quartz of Qu et al. (2015). We do this because this was the first and (with the exception of the dfSDCA method of Csiba & Richtárik (2015)) remains the only SGD-type method for solving (1) which was analyzed in the arbitrary sampling paradigm. Prior to this work we have conjectured that SAGA-AS would attain the same complexity as Quartz. As we shall show, this is indeed the case.
The problem studied in Qu et al. (2015) is
Quartz guarantees the duality gap to be less than in expectation using at most
iterations, where the parameters are assumed to satisfy the following expected separable overapproximation (ESO) inequality, which needs to hold for all :
If in addition is -smooth, then in (15) is smooth with Let us now consider several particular samplings:
We can see that as long as , the two bounds are essentially the same.
Parallel (-nice) sampling. is said to be -nice if it selects from all subsets of of cardinality , uniformly at random. By Lemma 6 in Qu et al. (2015),
where is the -th row of , and for each , is the number of nonzero blocks in the -th row of matrix , i.e., In the dense case, i.e., , . Hence, (16) becomes
By choosing , we get , and our iteration complexity bound in Theorem 3.1 becomes
We can see the bounds are also essentailly the same up to some constant factors. However, if , then Quartz enjoys a tighter bound.
The parameters used in Qu et al. (2015) allow one to exploit the sparsity of the data matrix and achieve almost linear speedup when is sparse or has favourable spectral properties. In the next section, we study further SAGA-AS in the case when the objective function is of the form (14), and obtain results which, like Quartz, are able to improve with data sparsity.
4 Analysis in the Composite Case
We now consider the general problem (1) with .
In order to be able to take advantage of data sparsity, we assume that functions take the form
Then clearly . Thus if SAGA-AS starts with for some , , then we always have for some , . We assume that the set of minimizers is nonempty, and let for . Further, denote the closest optimal solution from . Further, for any define to be the set of points with objective value bounded by , i.e.,
We make several further assumptions:
Assumption 4.1 (Smoothness).
Each is -smooth and convex, i.e.,
Assumption 4.2 (Quadratic functional growth condition; see Necoara et al. (2018)).
For any , there is such that for any
Assumption 4.3 (Nullspace consistency).
For any we have
4.2 Linear convergence under quadratic functional growth condition
Our first complexity result states a linear convergence rate of SAGA-AS under the quadratic functional growth condition.
where for all . Then there is a constant such that the following is true. If stepsize satisfies
then This implies that if we choose equal to the upper bound in (20), then when
Also note that the Lyapunov function considered is not stochastic (i.e., is not random).
Non-strongly convex problems in the form of (1) and (18) which satisfy Assumptions 4.1, 4.2 and 4.3 include the case when each is strongly convex and is polyhedral Necoara et al. (2018). In particular, Theorem 4.4
applies to the following logistic regression problem that we use in the experiments (and )
It is important to note that the estimation of the constant in Theorem 4.4 is not an easy task. However we can obtain satisfying (20) without any knowledge of and keep the same order of complexity bound. W.l.o.g. we can assume . Then (20) is satisfied by taking . This makes SAGA stand out from all the stochastic variance reduced methods including SDCA Shalev-Shwartz & Zhang (2013), SVRG Xiao & Zhang (2014a), SPDC Zhang & Lin (2015), etc. We also note that an adaptive strategy was proposed for SVRG in order to avoid guessing the constant Xu et al. (2017). Their adaptive approach leads to a triple-looped algorithm (as SVRG is already double-looped) and linear convergence as we have in Theorem 4.4 was not obtained.
4.3 Linear convergence for strongly convex regularizer
For the problem studied in Qu et al. (2015) where the regularizer is -strongly convex (and hence Assumption (4.2) holds and the minimizer is unique), we obtain the following refinement of Theorem 4.4.