1 Introduction
We consider a convex composite optimization problem
(1) 
where is a conic combination (with coefficients ) of a very large number of smooth convex functions , and is a proper closed convex function.^{1}^{1}1Our choice to consider general weights is not significant; we do it for convenience as will be clear from the relevant part of the text. Hence, the reader may without the risk of missing any key ideas assume that for all . We do not assume to be smooth. In particular, can be the indicator function of a nonempty closed convex set, turning problem (1) into a constrained minimization of function . We are interested in the regime where , although all our theoretical results hold without this assumption.
In a typical setup in the literature, for all ,
corresponds to the loss of a supervised machine learning model
on example from a training dataset of size , and represents the average loss (i.e., empirical risk). Problems of the form (1) are often called “finitesum” or regularized empirical risk minimization (ERM) problems, and are of immense importance in supervised learning, essentially forming the dominant training paradigm ShalevShwartz & BenDavid (2014).1.1 Variancereduced methods
One of the most successful methods for solving ERM problems is stochastic gradient descent (SGD)
Robbins & Monro (1951); Nemirovski et al. (2009) and its many variants, including those with minibatches Takáč et al. (2013), importance sampling Needell et al. (2015); Zhao & Zhang (2015) and momentum Loizou & Richtárik (2017a, b).One of the most interesting developments in recent years concerns variancereduced variants of SGD. The first method in this category is the celebrated^{2}^{2}2Schmidt et al. (2017) received the 2018 Lagrange Prize in continuous optimization for their work on SAG. stochastic average gradient (SAG) method of Schmidt et al. (2017). Many additional variancereduced methods were proposed since, including SDCA Richtárik & Takáč (2014); ShalevShwartz & Zhang (2013); ShalevShwartz (2016), SAGA Defazio et al. (2014), SVRG Johnson & Zhang (2013); Xiao & Zhang (2014b), S2GD Konečný & Richtárik (2017); Konečný et al. (2016), MISO Mairal (2015), JacSketch Gower et al. (2018) and SAGD Bibi et al. (2018).
1.2 SAGA: the known and the unknown
Since the SAG gradient estimator is not unbiased, SAG is notoriously hard to analyze. Soon after SAG was proposed, the SAGA method
Defazio et al. (2014) was developed, obtained by replacing the biased SAG estimator by a similar, but unbiased, SAGA estimator. This method admits a simpler analysis, retaining the favourable convergence properties of SAG. SAGA is one of the early and most successful variancereduced methods for (1).Better understanding of the behaviour of SAGA remains one of the open challenges in the literature. Consider problem (1) with for all . Assume, for simplicity, that each is smooth and is
strongly convex. In this regime, the iteration complexity of SAGA with uniform sampling probabilities is
, where , which was established already by Defazio et al. (2014). Schmidt et al. (2015) conjectured that there exist nonuniform sampling probabilities for which the complexity improves to , where . However, the “correct” importance sampling strategy leading to this result was not discovered until recently in the work of Gower et al. (2018), where the conjecture was resolved in the affirmative. One of the key difficulties in the analysis was the construction of a suitable stochastic Lyapunov function controlling the iterative process. Likewise, until recently, very little was known about the minibatch performance of SAGA, even for the simplest uniform minibatch strategies. Notable advancements in this area were made by Gower et al. (2018), who have the currently best rates for SAGA with standard uniform minibatch strategies and the first importance sampling results for a block variant of SAGA.1.3 Contributions
Regime  Arbitrary sampling  Theorem  


3.1  

4.4  

4.6 
Our contributions can be summarized as follows.
SAGA with arbitrary sampling. We study the performance of SAGA under fully general data sampling strategies known in the literature as arbitrary sampling, generalizing all previous results, and obtaining an array of new theoretically and practically useful samplings. We call our general method SAGAAS. Our theorems are expressed in terms of new stochastic and deterministic Lyapunov functions, the constructions of which was essential to our success.
In the arbitrary sampling paradigm, first proposed by Richtárik & Takáč (2016) in the context of randomized coordinate descent methods, one considers all (proper) random set valued mappings (called “samplings”) with values being subsets of . A sampling is uniquely determined by assigning a probability to all subsets of . A sampling is called proper^{3}^{3}3It does not make sense to consider samplings that are not proper. Indeed, if for some , a method based on will lose access to and, consequently, ability to solve (1). if probability of each being sampled is positive; that is, if for all . Finally, the term “arbitrary sampling” refers to an arbitrary proper sampling.
Smooth case. We perform an iteration complexity analysis in the smooth case (), assuming is strongly convex. Our analysis generalizes the results of Defazio et al. (2014) and Gower et al. (2018) to arbitrary sampling. The JacSketch method Gower et al. (2018)
and its analysis rely on the notion of a biascorrecting random variable. Unfortunately, such a random variable does not exist for SAGAAS. We overcome this obstacle by proposing a
biascorrecting random vector (BCRV)
which, as we show, always exists. While Gower et al. (2018); Bibi et al. (2018) consider particular suboptimal choices, we are able to find the BCRV which minimizes the iteration complexity bound. Unlike all known and new variants of SAGA considered in Gower et al. (2018), SAGAAS does not arise as a special case of JacSketch. Our linear rates for SAGAAS are the same as those for the primaldual stochastic fixed point method Quartz Qu et al. (2015) (the first arbitrary sampling based method for (1)) in the regime when Quartz is applicable, which is the case when an explicit strongly convex regularizer is present. In contrast, we do not need an explicit regularizer, which means that SAGAAS can utilize the strong convexity of fully, even if the strong convexity parameter is not known. While the importance sampling results in Gower et al. (2018) require each to be strongly convex, we impose this requirement on only.Nonsmooth case. We perform an iteration complexity analysis in the general nonsmooth case. When the regularizer is strongly convex, which is the same setting as that considered in Qu et al. (2015), our iteration complexity bounds are essentially the same as that of Quartz. However, we also prove linear convergence results, with the same rates, under a quadratic functional growth condition (which does not imply strong convexity) Necoara et al. (2018). These are the first linear convergence result for any variant of SAGA without strong convexity. Moreover, to the best of our knowledge, SAGAAS is the only variancereduced method which achieves linear convergence without any a priori knowledge of the error bound condition number.
Our arbitrary sampling rates are summarized in Table 1.
1.4 Brief review of arbitrary sampling results
The arbitrary sampling paradigm was proposed by Richtárik & Takáč (2016), where a randomized coordinate descent method with arbitrary sampling of subsets of coordinates was analyzed for unconstrained minimization of a strongly convex function. Subsequently, the primaldual method Quartz with arbitrary sampling of dual variables was studied in Qu et al. (2015) for solving (1) in the case when is strongly convex (and for all ). An accelerated randomized coordinate descent method with arbitrary sampling called ALPHA was proposed by Qu & Richtárik (2016a) in the context of minimizing the sum of a smooth convex function and a convex blockseparable regularizer. A key concept in the analysis of all known methods in the arbitrary sampling paradigm is the notion of expected separable overapproximation (ESO), introduced by Richtárik & Takáč (2016), and in the context of arbitrary sampling studied in depth by Qu & Richtárik (2016b). A stochastic primaldual hybrid gradient algorithm (aka ChambollePock) with arbitrary sampling of dual variables was studied by Chambolle et al. (2017). Recently, an accelerated coordinate descent method with arbitrary sampling for minimizing a smooth and strongly convex function was studied by Hanzely & Richtárik (2018). Finally, the first arbitrary sampling analysis in a nonconvex setting was performed by Horváth & Richtárik (2018), which is also the first work in which the optimal sampling out of class of all samplings of a given minibatch size was identified.
2 The Algorithm
Let be defined by
and let be the Jacobian of at .
2.1 JacSketch
Gower et al. (2018) propose a new family of variance reduced SGD methods—called JacSketch—which progressively build a variancereduced estimator of the gradient via the utilization of a new technique they call Jacobian sketching. As shown in Gower et al. (2018), stateoftheart variants SAGA can be obtained as a special case of JacSketch. However, SAGAAS does not arise as a special case of JacSketch. In fact, the generic analysis provided in Gower et al. (2018) (Theorem 3.6) is too coarse and does not lead to good bounds for any variants of SAGA with importance sampling. On the other hand, the analysis of Gower et al. (2018) which does do well for importance sampling does not generalize to arbitrary sampling, regularized objectives or regimes without strong convexity.
In this section we provide a brief review of the JacSketch method, establishing some useful notation along the way, with the goal of pointing out the moment of departure from JacSketch construction which leads to SAGAAS. The iterations of JacSketch have the form
(2) 
where is a fixed step size, and
is an variancereduced unbiased estimator of the gradient
built iteratively through a process involving Jacobian sketching, sketchandproject and biascorrection.Starting from an arbitrary matrix , at each of JacSketch an estimator of the true Jacobian is constructed using a sketchandproject iteration:
(3) 
Above, is the Frobenius norm^{4}^{4}4Gower et al. (2018) consider a weighted Frobenius norm, but this is not useful for our purposes., and
is a random matrix drawn from some ensamble of matrices
in an i.i.d. fashion in each iteration. The solution to (3) is the closest matrix to consistent with true Jacobian in its action onto . Intuitively, the higher is, the more accurate will be as an estimator of the true Jacobian . However, in order to control the cost of computing , in practical applications one chooses . In the case of standard SAGA, for instance, is a random standard unit basis vector in chosen uniformly at random (and hence ).The projection subproblem (3) has the explicit solution (see Lemma B.1 in Gower et al. (2018)):
where is an orthogonal projection matrix (onto the column space of ), and denotes the MoorePenrose pseudoinverse. Since is constructed to be an approximation of , and since , where , it makes sense to estimate the gradient via . However, this gradient estimator is not unbiased, which poses dramatic challenges for complexity analysis. Indeed, the celebrated SAG method, with its infamously technical analysis, uses precisely this estimator in the special case when is chosen to be a standard basis vector in sampled uniformly at random. Fortunately, as show in Gower et al. (2018), an unbiased estimator can be constructed by taking a random linear combination of and :
(4)  
where is a biascorrecting random variable (dependent on ), defined as any random variable for which Under this condition, becomes an unbiased estimator of . The JacSketch method is obtained by alternating optimization steps (2) (producing iterates ) with sketchandproject steps (producing ).
2.2 Biascorrecting random vector
In order to construct SAGALS, we take a departure here and consider a biascorrecting random vector
instead. From now on it will be useful to think of as an diagonal matrix, with the vector embedded in its diagonal. In contrast to (4), we propose to construct via
It is easy to see that under the following assumption, will be an unbiased estimator of .
Assumption 2.1 (Biascorrecting random vector).
We say that the diagonal random matrix is a biascorrecting random vector if
(5) 
2.3 Choosing distribution
In order to complete the description of SAGAAS, we need to specify the distribution . We choose to be a distribution over random column submatrices of the identity matrix . Such a distribution is uniquely characterized by a random subset of the columns of , i.e., a random subset of . This leads us to the notion of a sampling, already outlined in the introduction.
Definition 2.2 (Sampling).
A sampling is a random setvalued mapping with values being the subsets of . It is uniquely characterized by the choice of probabilities associated with every subset of . Given a sampling , we let . We say that is proper if for all .
So, given a proper sampling , we sample matrices as follows: i) Draw a random set , ii) Define (random column submatrix of corresponding to columns ). For and sampling define vector as follows:
It is easy to observe (see Lemma 4.7 in Gower et al. (2018)) that for we have the identity
(6) 
In order to simplify notation, we will write instead of .
Lemma 2.3.
Let be a proper sampling and define by setting . Then condition (5) is equivalent to
(7) 
This condition is satisfied by the default vector .
In general, there is an infinity of biascorrecting random vectors characterized by (7). In SAGAAS we reserve the freedom to choose any of these vectors.
2.4 SagaAs
By putting all of the development above together, we have arrived at SAGAAS (Algorithm 1). Note that since we consider problem (1) with a regularizer , the optimization step involves a proximal operator, defined as
To shed more light onto the key steps of SAGAAS, note that that an alternative way of writing the Jacobian update is
while the gradient estimate can be alternatively written as
3 Analysis in the Smooth Case
In this section we consider problem (1) in the smooth case; i.e., we let .
3.1 Main result
Given an arbitrary proper sampling , and biascorrecting random vector , for each define
(8) 
where is the cardinality of the set . As we shall see, these quantities play a key importance in our complexity result, presented next.
Theorem 3.1.
Let be an arbitrary proper sampling, and let be a biascorrecting random vector satisfying (7). Let be strongly convex and be convex and smooth. Let be the iterates produced by Algorithm 1. Consider the stochastic Lyapunov function
where for all . If stepsize satisfies
(9) 
then This implies that if we choose equal to the upper bound in (9), then as long as
Our result involves a novel stochastic Lyapunov function , different from that in Gower et al. (2018). We call this Lyapnuov function stochastic because it is a random variable when conditioned on and . Virtually all analyses of stochastic optimization algorithms in the literature (to the best of our knowledge, all except for that in Gower et al. (2018)) involve deterministic Lyapunov functions.
The result posits linear convergence with iteration complexity whose leading term involves the quantities , (which depend on the sampling only) and on , and (which depend on the properties and structure of ).
3.2 Optimal biascorrecting random vector
Note that the complexity bound gets better as get smaller. Having said that, even for a fixed sampling , the choice of is not unique, Indeed, this is because depends on the choice of . In view of Lemma 2.3, we have many choices for this random vector. Let be the collection of all biascorrecting random vectors associated with sampling . In our next result we will compute the biascorrecting random vector which leads to the minimal complexity parameters . In the rest of the paper, let
Lemma 3.2.
Let be a proper sampling. Then
(i)
for all , and the minimum is obtained at given by for all ;
(ii)
for all .
3.3 Importance Sampling for Minibatches
In this part we construct an importance sampling for minibatches. This is in general a daunting task, and only a handful of papers exist on this topic. In particular, Csiba & Richtárik (2018) and Hanzely & Richtárik (2019) focused on coordinate descent methods, and Gower et al. (2018) considered minibatch SAGA with importance sampling over subsets of forming a partition.
Let be the expected minibatch size. By Lemma 3.2 and Theorem 3.1, if we choose the optimal as for all , then , and the iteration complexity bound becomes
(10) 
Since is nonlinear, it is difficult to minimize (10) generally. Instead, we choose for all , in which case , and the complexity bound becomes
(11) 
From now on we focus on socalled independent samplings. Alternative results for partition samplings can be found in Section A of the Supplementary. In particular, let be formed as follows: for each we flip a biased coin, independently, with probability of success . If we are successful, we include in . This type of sampling was also considered for nonconvex ERM problems in Horváth & Richtárik (2018) and for accelerated coordinate descent in Hanzely & Richtárik (2018).
First, we calculate for this sampling . Let the sampling includes every in it, independently, with probability . Then
Since , the complexity bound (11) has an upper bound as
(12) 
Next, we minimize the upper bound (12) over under the condition . Let for and denote . If , then it is easy to see that minimize (12), and consequently, (12) becomes Otherwise, in order to minimize (12), we can choose for , and for such that . By choosing this optimal probability, (12) becomes
Notice that from Theorem 3.1, for nice sampling, the iteration complexity is
(13) 
Hence, compared to nice sampling, the iteration complexity for importance independent sampling could be at most times worse, but could also be times better in some extreme cases.
3.4 SAGAAS vs Quartz
In this section, we compare our results for SAGAAS with known complexity results for the primaldual method Quartz of Qu et al. (2015). We do this because this was the first and (with the exception of the dfSDCA method of Csiba & Richtárik (2015)) remains the only SGDtype method for solving (1) which was analyzed in the arbitrary sampling paradigm. Prior to this work we have conjectured that SAGAAS would attain the same complexity as Quartz. As we shall show, this is indeed the case.
The problem studied in Qu et al. (2015) is
(14) 
where , is smooth and convex, is a strongly convex function. When is also smooth, problem (14) can be written in the form of problem (1) with , and
(15) 
Quartz guarantees the duality gap to be less than in expectation using at most
(16) 
iterations, where the parameters are assumed to satisfy the following expected separable overapproximation (ESO) inequality, which needs to hold for all :
(17) 
If in addition is smooth, then in (15) is smooth with Let us now consider several particular samplings:
Serial samplings. is said to be serial if with probability 1. So, for all . By Lemma 5 in Qu et al. (2015), . Hence the iteration complexity bound (16) becomes
By choosing (this is both the default choice mentioned in Lemma 2.3 and the optimal choice in view of Lemma 3.2), and our iteration complexity bound in Theorem 3.1 becomes
We can see that as long as , the two bounds are essentially the same.
Parallel (nice) sampling. is said to be nice if it selects from all subsets of of cardinality , uniformly at random. By Lemma 6 in Qu et al. (2015),
where is the th row of , and for each , is the number of nonzero blocks in the th row of matrix , i.e., In the dense case, i.e., , . Hence, (16) becomes
By choosing , we get , and our iteration complexity bound in Theorem 3.1 becomes
We can see the bounds are also essentailly the same up to some constant factors. However, if , then Quartz enjoys a tighter bound.
The parameters used in Qu et al. (2015) allow one to exploit the sparsity of the data matrix and achieve almost linear speedup when is sparse or has favourable spectral properties. In the next section, we study further SAGAAS in the case when the objective function is of the form (14), and obtain results which, like Quartz, are able to improve with data sparsity.
4 Analysis in the Composite Case
We now consider the general problem (1) with .
4.1 Assumptions
In order to be able to take advantage of data sparsity, we assume that functions take the form
(18) 
Then clearly . Thus if SAGAAS starts with for some , , then we always have for some , . We assume that the set of minimizers is nonempty, and let for . Further, denote the closest optimal solution from . Further, for any define to be the set of points with objective value bounded by , i.e.,
We make several further assumptions:
Assumption 4.1 (Smoothness).
Each is smooth and convex, i.e.,
Assumption 4.2 (Quadratic functional growth condition; see Necoara et al. (2018)).
For any , there is such that for any
(19) 
Assumption 4.3 (Nullspace consistency).
For any we have
4.2 Linear convergence under quadratic functional growth condition
Our first complexity result states a linear convergence rate of SAGAAS under the quadratic functional growth condition.
Theorem 4.4.
Assume is smooth. Let be a positive vector satisfying (17) for a proper sampling . Let be the iterates produced by Algorithm 1 with for all and . Let any . Consider the Lyapunov function
where for all . Then there is a constant such that the following is true. If stepsize satisfies
(20) 
then This implies that if we choose equal to the upper bound in (20), then when
Also note that the Lyapunov function considered is not stochastic (i.e., is not random).
Nonstrongly convex problems in the form of (1) and (18) which satisfy Assumptions 4.1, 4.2 and 4.3 include the case when each is strongly convex and is polyhedral Necoara et al. (2018). In particular, Theorem 4.4
applies to the following logistic regression problem that we use in the experiments (
and )(21) 
Remark 4.5.
It is important to note that the estimation of the constant in Theorem 4.4 is not an easy task. However we can obtain satisfying (20) without any knowledge of and keep the same order of complexity bound. W.l.o.g. we can assume . Then (20) is satisfied by taking . This makes SAGA stand out from all the stochastic variance reduced methods including SDCA ShalevShwartz & Zhang (2013), SVRG Xiao & Zhang (2014a), SPDC Zhang & Lin (2015), etc. We also note that an adaptive strategy was proposed for SVRG in order to avoid guessing the constant Xu et al. (2017). Their adaptive approach leads to a triplelooped algorithm (as SVRG is already doublelooped) and linear convergence as we have in Theorem 4.4 was not obtained.
Comments
There are no comments yet.