This paper connects two algorithms which until now have remained remarkably disjoint in the literature: the randomized Kaczmarz algorithm for solving linear systems and the stochastic gradient descent (SGD) method for optimizing a convex objective using unbiased gradient estimates. The connection enables us to make contributions by borrowing from each body of literature to the other. In particular, it helps us highlight the role of weighted sampling for SGD and obtain a tighter guarantee on the linear convergence regime of SGD.
Recall that stochastic gradient descent is a method for minimizing a convex objective based on access to unbiased stochastic gradient estimates, i.e. to an estimate for the gradient at a given point , such that . Viewing as an expectation , the unbiased gradient estimate can be obtained by drawing and using its gradient: . SGD originated as “Stochastic Approximation” in the pioneering work of Robbins and Monroe 
, and has recently received renewed attention for confronting very large scale problems, especially in the context of machine learning[4, 42, 31, 2]. Classical analysis of SGD shows a polynomial rate on the sub-optimality of the objective value, , namely for non-smooth objectives, and for smooth, or non-smooth but strongly convex objectives. Such convergence can be ensured even if the iterates do not necessarily converge to a unique optimum , as might be the case if is not strongly convex. Here we focus on the strongly convex case, where the optimum is unique, and on convergence of the iterates to the optimum .
Bach and Moulines  recently provided a non-asymptotic bound on the convergence of the iterates in strongly convex SGD, improving on previous results of this kind [26, Section 2.2][3, Section 3.2]. In particular, they showed that if each is smooth and if is a minimizer of (almost) all , i.e. , then goes to zero exponentially, rather than polynomially, in . That is, reaching a desired accuracy of requires a number of steps that scales only logarithmically in . Bach and Moulines’s bound on the required number of iterations further depends on the average squared conditioning number , where is the Lipschitz constant of (i.e. are “-smooth”), and is -strongly convex. If is not an exact minimizer of each , the bound degrades gracefully as a function of , and includes an unavoidable term that behaves as .
In a seemingly independent line of research, the Kaczmarz method was proposed as an iterative method for solving (usually overdetermined) systems of linear equations . The simplicity of the method makes it useful in a wide array of applications ranging from computer tomography to digital signal processing [16, 27, 18]. Recently, Strohmer and Vershynin 
proposed a variant of the Kaczmarz method using a random selection method which select rows with probability proportional to their squared norm, and showed that using this selection strategy, a desired accuracy ofcan be reached in the noiseless setting in a number of steps that scales like and linearly in the condition number.
1.1. Importance sampling in stochastic optimization
From a birds-eye perspective, this paper aims to extend the notion of importance sampling from stochastic sampling methods for numerical linear algebra applications, to more general stochastic convex optimization problems. Strohmer and Vershynin’s incorporation of importance sampling into the Kaczmarz setup  is just one such example, and most closely related to the SGD set-up. But importance sampling has also been considered in stochastic coordinate-descent methods [33, 40]. There also, the weights are proportional to some power of the Lipschitz constants (of the gradient coordinates).
Importance sampling has also played a key role in designing sampling-based low-rank matrix approximation algorithms – both row/column based sampling and entry-wise sampling – where it goes by the name of leverage score sampling. The resulting sampling methods are again proportional to the squared Euclidean norms of rows and columns of the underlying matrix. See [5, 25, 44, 9], and references therein for applications to the column subset selection problem and matrix completion. See [14, 24, 48] for applications of importance sampling to the Nyström Method.
Importance sampling has also been introduced to the compressive sensing framework, where it translates to sampling rows of an orthonormal matrix proportionally to their squared inner products with the rows of a second orthonormal matrix in which the underlying signal is assumed sparse. See [39, 20] for more details.
Inspired by the analysis of Strohmer and Vershynin and Bach and Moulines, we prove convergence results for stochastic gradient descent as well as for SGD variants where gradient estimates are chosen based on a weighted sampling distribution, highlighting the role of importance sampling in SGD.
We first show (Corollary 2.2 in Section 2) that without perturbing the sampling distribution, we can obtain a linear dependence on the uniform conditioning , but it is not possible to obtain a linear dependence on the average conditioning . This is a quadratic improvement over the previous results  in regimes where the components have similar Lipschitz constants.
We then turn to importance sampling, using a weighted sampling distribution. We show that weighting components proportionally to their Lipschitz constants , as is essentially done by Strohmer and Vershynin, can reduce the dependence on the conditioning to a linear dependence on the average conditioning . However, this comes at an increased dependence on the residual . But, we show that by only partially biasing the sampling towards , we can enjoy the best of both worlds, obtaining a linear dependence on the average conditioning , without amplifying the dependence on the residual. Thus, using importance sampling, we obtain a guarantee dominating, and improving over the previous best-known results  (Corollary 3.1 in Section 2).
In Section 4, we consider the benefits of reweighted SGD also in other scenarios and regimes. We show how also for smooth but not-strongly-convex objectives, importance sampling can improve a dependence on a uniform bound over smoothness, , to a dependence on the average smoothness
—such an improvement is not possible without importance sampling. For non-smooth objectives, we show that importance sampling can eliminate a dependence on the variance in the Lipschitz constants of the components. In parallel work we recently became aware of,Zhao and Zhang  also consider importance sampling for non-smooth objectives, including composite objectives, suggesting the same reweighting as we obtain here.
Finally, in Section 5, we turn to the Kaczmarz algorithm, explain how it is an instantiation of SGD, and how using partially biased sampling improves known guarantees in this context as well. We show that the randomized Kaczmarz method with uniform i.i.d. row selection can be recast as an instance of preconditioned Stochastic Gradient Descent acting on a re-weighted least squares problem and through this connection, provide exponential convergence rates for this algorithm. We also consider the Kaczmarz algorithm corresponding to SGD with hybrid row selection strategy which shares the exponential convergence rate of Strohmer and Vershynin  while also sharing a small error residual term of the SGD algorithm. This presents a clear tradeoff between convergence rate and the convergence residual, not present in other results for the method.
2. SGD for Strongly Convex Smooth Optimization
We consider the problem of minimizing a smooth convex function,
where is of the form for smooth functionals over endowed with the standard Euclidean norm , or over a Hilbert space with the norm . Here is drawn from some source distribution over an arbitrary probability space. Throughout this manuscript, unless explicitly specified otherwise, expectations will be with respect to indices drawn from the source distribution . That is, we write . We also denote by the “residual” quantity at the minimum,
We will instate the following assumptions on the function :
Each is continuously differentiable and the gradient function has Lipschitz constant ; that is,
for all vectorsand .
has strong convexity parameter ; that is, for all vectors and .
We denote the supremum of the support of , i.e. the smallest such that a.s., and similarly denote the infimum. We denote the average Lipschitz constant as .
A unbiased gradient estimate for can be obtained by drawing and using as the estimate. The SGD updates with (fixed) step size based on these gradient estimates are then given by:
where are drawn i.i.d. from . We are interested in the distance of the iterates from the unique minimum, and denote the initial distance by .
Bach and Moulines [1, Theorem 1] considered this setting111Bach and Moulines’s results are somewhat more general. Their Lipschitz requirement is a bit weaker and more complicated, but in terms of yields (2.3). They also study the use of polynomial decaying step-sizes, but these do not lead to improved runtime if the target accuracy is known ahead of time. and established that
SGD iterations of the form (2.2), with an appropriate step-size, are sufficient to ensure , where the expectations is over the random sampling. As long as , i.e. the same minimizer minimizes all components (though of course it need not be a unique minimizer of any of them), this yields linear convergence to , with a graceful degradation as . However, in the linear convergence regime, the number of required iterations scales with the expected squared conditioning . In this paper, we reduce this quadratic dependence to a linear dependence. We begin with a guarantee ensuring linear dependence, though with a dependence on rather than :
Let each be convex where has Lipschitz constant , with a.s., and let be -strongly convex. Set , where . Suppose that . Then the SGD iterates given by (2.2) satisfy:
where the expectation is with respect to the sampling of .
If we are given a desired tolerance, , and we know the Lipschitz constants and parameters of strong convexity, we may optimize the step-size , and obtain:
For any desired , using a step-size of
we have that after
SGD iterations, , where and where the expectation is with respect to the sampling of .
Substituting into the second term of (2.4) and simplifying gives the bound
Now asking that
substituting for , and rearranging to solve for , shows that we need such that
Utilizing the fact that for and rearranging again yields the requirement that
The crux of the improvement over Bach and Moulines is in a tighter recursive equation. Bach and Moulines rely on the recursion
whereas we use the Co-Coercivity Lemma A.1, with which we can obtain the recursion
where is the Lipschitz constant of the component used in the current iterate. The significant difference is that one of the factors of (an upper bound on the second derivative), in the third term inside the parenthesis, is replaced by (a lower bound on the second derivative of ). A complete proof can be found in the appendix.
Comparison to results of Bach and Moulines.
Our bound (2.5) replaces the dependence on the average square conditioning with a linear dependence on the uniform conditioning . When all Lipschitz constants are of similar magnitude, this is a quadratic improvement in the number of required iterations. However, when different components have widely different scaling, i.e. are highly variable, the supremum might be larger then the average square conditioning.
Considering the above, one might hope to obtain a linear dependence on the average conditioning . However, as the following example shows, this is not possible. Consider a uniform source distribution over quadratics, with the first quadratic being and all others being , and . Any method must examine in order to recover to within error less then one, but by uniformly sampling indices , this takes iterations in expectation. It is easy to verify that in this case, , and . For large , a linear dependence on would mean that a constant number of iterations suffice (as ), but we just saw that any method that sampled uniformly must consider at least samples in expectation to get non-trivial error. Note that both and indeed correspond to the correct number of iterations required by SGD.
We therefore see that the choice between a dependence on the average quadratic conditioning , or a linear dependence on the uniform conditioning , is unavoidable. A linear dependence on the average conditioning is not possible with any method that samples from the source distribution . In the next Section, we will show how we can obtain a linear dependence on the average conditioning , using importance sampling, i.e. by sampling from a modified distribution.
3. Importance Sampling
We will now consider stochastic gradient descent, where gradient estimates are sampled from a weighted distribution.
3.1. Reweighting a Distribution
For a weight function which assigns a non-negative weight to each index , the weighted distribution is defined as the distribution such that
where is an event (subset of indices) and its indicator function. For a discrete distribution with probability mass function this corresponds to weighting the probabilities to obtain a new probability mass function:
Similarly, for a continuous distribution, this corresponds to multiplying the density by and renormalizing.
One way to construct the weighted distribution , and sample from it, is through rejection sampling: sample , and accept with probability , for some . Otherwise, reject and continue to re-sample until a suggestion is accepted. The accepted samples are then distributed according to .
We use to denote an expectation where indices are sampled from the weighted distribution . An important property of such an expectation is that for any quantity that depends on :
where recall that the expectations on the r.h.s. are with respect to . In particular, when , we have that . In fact, we will consider only weights s.t. , and refer to such weights as normalized.
3.2. Reweighted SGD
For any normalized weight function , we can weight each component , defining:
The representation (3.3) is an equivalent, and equally valid, stochastic representation of the objective , and we can just as well base SGD on this representation. In this case, at each iteration we sample and then use as an unbiased gradient estimate. SGD iterates based on the representation (3.3), which we will also refer to as -weighted SGD, are then given by
where are drawn i.i.d. from .
The important observation here is that all SGD guarantees are equally valid for the -weighted updates (3.4)–the objective is the same objective , the sub-optimality is the same, and the minimizer is the same. We do need, however, to calculate the relevant quantities controlling SGD convergence with respect to the modified components and the weighted distribution .
3.3. Strongly Convex Smooth Optimization using Weighted SGD
We now return to the analysis of strongly convex smooth optimization and investigate how re-weighting can yield a better guarantee. To do so, we must analyze the relevant quantities involved.
The Lipschitz constant of each component is now scaled, and we have, . The supremum is given by:
It is easy to verify that (3.5) is minimized by the weights
and that with this choice of weights
Note that the average Lipschitz constant is invariant under weightings.
Before applying Corollary 2.2, we must also calculate:
3.4. Partially biased sampling
If , i.e. we are in the “realizable” situation, with true linear convergence, then we also have . In this case, we already obtain the desired guarantee: linear convergence with a linear dependence on the average conditioning , strictly improving over Bach and Moulines. However, the inequality in (3.8) might be tight in the presence of components with very small that contribute towards the residual error (as might well be the case for a small component). When , we therefore get a dissatisfying scaling of the second term, relative to Bach and Moulines, by a factor of .
Fortunately, we can easily overcome this factor. To do so, consider sampling from a distribution which is a mixture of the original source distribution and its re-weighting using the weights (3.6). That is, sampling using the weights:
We refer to this as partially biased sampling. Using these weights, we have
Plugging these into Corollary 2.2 we obtain:
We now obtain the desired linear scaling on , without introducing any additional factor to the residual term, except for a constant factor of two. We thus obtain a result which dominates Bach and Moulines (up to a factor of 2) and substantially improves upon it (with a linear rather than quadratic dependence on the conditioning).
One might also ask whether the previous best known result (2.3) could be improved using weighted sampling. The relevant quantity to consider is the average square Lipschitz constant for the weighted representation: (3.3):
Interestingly, this quantity is minimized by the same weights as , given by (3.6), and with these weights we have:
Again, we can use the partially biased weights give in (3.10), which yields and also ensures . In any case, we get a dependence on instead of , which is indeed an improvement. Thus, the Bach and Moulines guarantee is also improved by using biased sampling, and in particular the partially biased sampling specified by the weights (3.10). However, relying on Bach and Moulines we still have a quadratic dependence on , as opposed to the linear dependence we obtain in Corollary 3.1.
3.5. Implementing Importance Sampling
As discussed above, when the magnitudes of are highly variable, importance sampling is necessarily in order to obtain a dependence on the average, rather than worst-case, conditioning. In some applications, especially when the Lipschitz constants are known in advance or easily calculated or bounded, such importance sampling might be possible by directly sampling from . This is the case, for example, in trigonometric approximation problems or linear systems which need to be solved repeatedly, or when the Lipschitz constant is easily computed from the data, and multiple passes over the data are needed anyway. We do acknowledge that in other regimes, when data is presented in an online fashion, or when we only have sampling access to the source distribution (or the implied distribution over gradient estimates), importance sampling might be difficult.
One option that could be considered, in light of the above results, is to use rejection sampling to simulate sampling from . For the weights (3.6), this can be done by accepting samples with probability proportional to . The overall probability of accepting a sample is then , introducing an additional factor of . This results in a sample complexity with a linear dependence on , as in Corollary 2.2 (for the weights (3.10), we can first accept with probability 1/2, and then if we do not accept, perform this procedure). Thus, if we are presented samples from , and the cost of obtaining the sample dominates the cost of taking the gradient step, we do not gain (but do not lose much either) from rejection sampling. We might still gain from rejection sampling if the cost of operating on a sample (calculating the actual gradient and taking a step according to it) dominates the cost of obtaining it and (a bound on) the Lipschitz constant.
3.6. A Family of Partially Biased Schemes
The choice of weights (3.10
) corresponds to an equal mix of uniform and fully biased sampling. More generally, we could consider sampling according to any one of a family of weights which interpolate between uniform and fully biased sampling:
To be concrete, we summarize below the a template algorithm for SGD with partially biased sampling:
Initial estimate Bias parameter Step size Tolerance parameter Access to the source distribution If : bounds on the Lipschitz constants ; the weights derived from them (see eq. 3.16); and access to the weighted distribution . Estimated solution to the problem Draw an index .
For arbitrary , we have the bounds
Plugging these quantities into Corollary 2.2, we obtain:
In this corollary, even if is close to 1, i.e. we add only a small amount of bias to the sampling, we obtain a bound with a linear dependence on the average conditioning (multiplied by a factor of ), since we can bound .
4. Importance Sampling for SGD in Other Scenarios
In the previous Section, we considered SGD for smooth and strongly convex objectives, and were particularly interested in the regime where the residual is low, and the linear convergence term is dominant. Weighted SGD could of course be relevant also in other scenarios, and we now briefly survey them, as well as relate them to our main scenario of interest.
4.1. Smooth, Not Strongly Convex
When each component is convex, non-negative, and has an -Lipschitz gradient, but the objective is not necessarily strongly convex, then after
iterations of SGD with an appropriately chosen step-size we will have , where is an appropriate averaging of the iterates Srebro et al. . The relevant quantity here determining the iteration complexity is again . Furthermore, Srebro et al. , relying on an example from Foygel and Srebro , point out that the dependence on the supremum is unavoidable and cannot be replaced with the average Lipschitz constant . That is, if we sample gradients according to the source distribution , we must have a linear dependence on .
The only quantity in the bound (4.1) that changes with a re-weighting is —all other quantities (, , and the sub-optimality ) are invariant to re-weightings. We can therefor replace the dependence on with a dependence on by using a weighted SGD as in (3.4). As we already calculated, the optimal weights are given by (3.6), and using them we have . In this case, there is no need for partially biased sampling, and we obtain that with an appropriate step-size,
We again see that using importance sampling allows us to reduce the dependence on , which is unavoidable without biased sampling, to a dependence on .
4.2. Non-Smooth Objectives
We now turn to non-smooth objectives, where the components might not be smooth, but each component is -Lipschitz. Roughly speaking, is a bound on the first derivative (gradient) of , while is a bound on the second derivatives of
. Here, the performance of SGD depends on the second moment. The precise iteration complexity depends on whether the objective is strongly convex or whether is bounded, but in either case depends linearly on (see e.g. [32, 43]).
By using weighted SGD we can replace the linear dependence on with a linear dependence on , where is the Lipschitz constant of the scaled and is given by . Again, this follows directly from the standard SGD guarantees, where we consider the representation (3.3) and use any subgradient from .
We can calculate:
which is minimized by the weights:
where . Using these weights we have . Using importance sampling, we can thus reduce the linear dependence on to a linear dependence on . Its helpful to recall that . What we save is therefore exactly the variance of the Lipschitz constants .
In parallel work, Zhao and Zhang  also consider importance sampling for stochastic optimization for non-smooth objectives. Zhao and Zhang consider a more general setting, with a composite objective that is only partially linearized. But also there, the iteration complexity depends on the second moment of the gradient estimates, and the analysis performed above applies (Zhao and Zhang perform a specialized analysis instead).
4.3. Non-Realizable Regime
Returning to the smooth and strongly convex setting of Sections 2 and 3, let us consider more carefully the residual term . This quantity definitively depends on the weighting, and in the analysis of Section 3.3, we avoided increasing it too much, introducing partial biasing for this purpose. However, if this is the dominant term, we might want to choose weights so as to minimize this term. The optimal weights here would be proportional to . The problem is that we do not know the minimizer , and so cannot calculate these weights. Approaches which dynamically update the weights based on the current iterates as a surrogate for are possible, but beyond the scope of this paper.
An alternative approach is to bound and so . Taking this bound, we are back to the same quantity as in the non-smooth case, and the optimal weights are proportional to . Note that this is a different weighting then using weights proportional to , which optimize the linear-convergence term as studied in Section 3.3.
To understand how weighting according to and are different, consider a generalized linear objective where , and is a scalar function with and . We have that while . Weighting according to the Lipschitz constants of the gradients, i.e. the “smoothness” parameters, as in (3.6), versus weighting according to the Lipschitz constants of as in (4.4), thus corresponds to weighting according to versus , and are rather different. We can also calculate that weighing by (i.e. following (3.6)), yields . That is, weights proportional to yield a suboptimal gradient-dependent term (the same dependence as if no weighting at all was used). Conversely, using weights proportional to , i.e. proportional to yields – a suboptimal dependence, though better then no weighting at all.
Again, as with partially biased sampling, we can weight by the average, and ensure both terms are optimal up to a factor of two.
5. The least squares case and the Randomized Kaczmarz Method
A special case of interest is the least squares problem, where
with an -dimensional vector, an matrix with rows , and is the least-squares solution. Writing the least squares problem (5.1) in the form (2.1), we see that the source distribution is uniform over , the components are , the Lipschitz constants are , the average Lipschitz constant is , the strong convexity parameter is , so that , and the residual is . Note that in the case that is not full-rank, one can instead replace
with the smallest nonzero eigenvalue ofas in [23, Equation (3)]. In that case, we instead write as the appropriate condition number.
The randomized Kaczmarz method [46, 7, 17, 27, 49, 8, 47, 15, 52, 28] for solving the least squares problem (5.1) begins with an arbitrary estimate , and in the th iteration selects a row i.i.d. at random from the matrix and iterates by:
where the step size in the standard method.
Strohmer and Vershynin provided the first non-asymptotic convergence rates, showing that drawing rows proportionally to leads to provable exponential convergence in expectation for the full-rank case . Their method can easily be extended to the case when the matrix is not full-rank to yield convergence to some solution, see e.g. [23, Equation (3)]. Recent works use acceleration techniques to improve convergence rates [22, 34, 11, 38, 52, 12, 10, 6, 35, 36, 37, 30, 29].
The reduction of the quadratic dependence on the conditioning to a linear dependence in Theorem 2.1, as well as the use of biased sampling which we investigate here was motivated by Strohmer and Vershynin’s analysis of the randomized Kaczmarz method. Indeed, applying Theorem 2.1 to the weighted SGD iterates (2.2) for (5.1) with the weights (3.6) and a stepsize of yields precisely the Strohmer and Vershynin  guarantee.
Understanding the randomized Kaczmarz method as SGD allows us also to obtain improved methods and results for the randomized Kaczmarz method:
As shown by Strohmer and Vershynin  and extended by Needell , the randomized Kaczmarz method with weighted sampling exhibits exponential convergence, but only to within a radius, or convergence horizon, of the least-squares solution. This is because a step-size of is used, and so the second term in (2.4) does not vanish. It has been shown [49, 8, 47, 15, 30] that changing the step size can allow for convergence inside of this convergence horizon, although non-asymptotic results have been difficult to obtain. Our results allow for finite-iteration guarantees with arbitrary step-sizes and can be immediately applied to this setting. Indeed, applying Theorem 2.1 with the weights (3.6) gives
Let be an matrix with rows . Set , where is the minimizer of the problem
Suppose that . Set , . Then the expected error at the iteration of the Kaczmarz method described by (5.2) with row selected with probability satisfies
with . The expectation is taken with respect to the weighted distribution over the rows.
When e.g. , we recover the exponential rate of Strohmer and Vershynin  up to a factor of , and nearly the same convergence horizon. For arbitrary , Corollary 5.1 implies a tradeoff between a smaller convergence horizon and a slower convergence rate.
Uniform Row Selection.
The Kaczmarz variant of Strohmer and Vershynin  calls for weighted row sampling, and thus requires pre-computing all the row norms. Although certainly possible in some applications, in other cases this might be better avoided. Understanding the randomized Kaczmarz as SGD allows us to apply Theorem 2.1 also with uniform weights (i.e. to the unbiased SGD), and obtain a randomized Kaczmarz using uniform sampling, which converges to the least-squares solution and enjoys finite-iteration guarantees:
Let be an matrix with rows . Let be the diagonal matrix with terms , and consider the composite matrix . Set , where is the minimizer of the weighted least squares problem
Suppose that . Then the expected error after iterations of the Kaczmarz method described by (5.2) with uniform row selection satisfies