 # Well-Conditioned Methods for Ill-Conditioned Systems: Linear Regression with Semi-Random Noise

Classical iterative algorithms for linear system solving and regression are brittle to the condition number of the data matrix. Even a semi-random adversary, constrained to only give additional consistent information, can arbitrarily hinder the resulting computational guarantees of existing solvers. We show how to overcome this barrier by developing a framework which takes state-of-the-art solvers and "robustifies" them to achieve comparable guarantees against a semi-random adversary. Given a matrix which contains an (unknown) well-conditioned submatrix, our methods obtain computational and statistical guarantees as if the entire matrix was well-conditioned. We complement our theoretical results with preliminary experimental evidence, showing that our methods are effective in practice.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Given matrix and labels , the fundamental problems of solving linear systems () and performing linear regression () are pervasive in statistics and scientific computing, and serve as benchmarks inspiring development of popular methods in convex optimization and numerical linear algebra (see e.g. [ATK88]). We consider first-order methods for these problems with state-of-the-art dependence on standard conditioning measures of the problem, and initiate the study of their tolerance to semi-random noise, a well-studied corruption model for testing robustness of algorithms to model misspecification (cf. discussion in Section 1.2).

Consider, for instance, an overdetermined linear system . When has a small condition number, i.e. , gradient methods rapidly converge. However, what if a subset of the rows is well-conditioned (e.g.  is small), but the whole matrix is not? If was known a priori, we could solve in just this subset of rows to efficiently recover the solution, ignoring the additional data. Counterintuitively, by giving additional consistent information, an adversary can arbitrarily decrease conditioning, breaking performance of first-order methods. Motivated by this phenomenon, we address the following main question.

###### Question 1.

Can we design algorithms for linear system solving and linear regression in matrices which contain a subset of rows with some “good property”, with computational and statistical guarantees comparable to if the entire matrix had this “good property”?

Our main contribution is answering this question affirmatively under standard settings where first-order methods apply. Our algorithms proceed by learning a reweighting of the rows of , such that the resulting reweighted matrix inherits the desirable “good property” provided there exists a well-behaved subset of rows. Moreover, the reweighting scheme itself is computationally efficient and does not bottleneck the overall runtime by more than a polylogarithmic factor. In this sense, our reweighting scheme “robustifies” standard first-order solvers against consistent (but possibly adversarial) noise by allowing them to perform algorithmic tasks in the best subset of examples, effectively ignoring poorly-behaved rows. The specific good properties we consider are motivated by the runtime guarantees of state-of-the-art first-order methods.

### 1.1 Linear systems and regression in the classical setting

We define the main algorithmic tasks we consider in this work. For simplicity, we study “overconstrained” variants of these tasks throughout, where has no kernel and solutions are unique.

###### Problem 1 (Overconstrained linear system solving).

Given full column rank and , find the unique with .

When choosing an algorithm for Problem 1, a primary consideration is its computational guarantees, i.e. convergence rate in terms of matrix properties. We define two fundamental matrix properties:

 (1)

When is clear from context, we refer to these quantities as and ; note that . denotes the standard condition number of and

can be thought of as an “average condition number”, as it is the ratio of the average eigenvalue of

and its smallest. First-order methods for Problem 1 largely fall into two regimes,111We focus on first-order methods with linear or sublinear dimension dependence in this work. Algorithms explicitly solving linear systems [CW13, NN13, LMP13, CLM+15] with runtimes at least , where is the matrix multiplication constant [WIL12, GAL14], also exist, but are expensive in practice. which we term “full gradient” and “row sampling”. Full gradient methods have polynomial dependence on and linear dependence on , the number of nonzero entries of ; row sampling methods are randomized and depend polynomially on , and linearly (or sublinearly) on dimensions , . We state two state-of-the-art runtimes for Problem 1 in these two regimes, which serve as benchmarks for our robust runtimes, and defer a broader, more formal discussion to Appendix B for brevity.

###### Proposition 1 ([Nes83]).

In the setting of Problem 1, accelerated gradient descent produces with in iterations, each taking

matrix-vector multiplications in

.

###### Proposition 2 ([Jz13, Akk+20]).

Stochastic variance reduced gradient (SVRG) methods solve Problem

1 to distance to the solution in iterations, which can be accelerated to , each taking

-dimensional vector operations. Both of these results hold with high probability

.

We remark that at the cost of a dependence on rather than , logarithmic factors in the runtime of Proposition 2 can be removed [ALL17]. Finally, for sparse matrices the dimension-dependent factors in Proposition 2 can be improved (e.g. the additive can be improved to an and the per-iteration complexity depends on row sparsity); such improvements also apply to our methods, and we omit them for brevity of runtime statements.

###### Remark 1.

Propositions 1 and 2 also solve regression (minimizing ) and inversion222To avoid overloading the definition of “system solving” (i.e. with respect to our Problem 1), we use inversion to mean applying an inverse to a vector rather than explicitly computing the inverse of a matrix. (applying the inverse of to a vector ) in corresponding runtimes depending on . For a more thorough discussion of this reduction, see Appendix B.

The second problem we consider in this work is a statistical variant of linear regression.

###### Problem 2 (Statistical regression).

Given full column rank and produced via

 b=Axtrue+ξ,ξ∼N(0,Σ), (2)

produce so that the mean-squared error is small (expectations are taken over ).

Problem 2 is a strict generalization of Problem 1, since we can set . Although our regression algorithms inherit the computational speedups of their Problem 1 counterparts, the minimizer to the regression problem changes under reweightings of and . Nonetheless, we demonstrate that under well-studied noise models (choices of in (2)), our reweighted regression solutions obtain mean-squared error guarantees as good or better than their unweighted counterparts, consistent with our empirical findings; we note that our risk guarantees are via surrogate upper bounds for mean squared error, which may be loose in some settings (cf. Section 5 for a discussion). This motivates the statistical notion of linear regression we study (used in the definition of Problem 2).

### 1.2 Semi-random noise models

The semi-random noise framework was first studied in the context of graph coloring [BS95]. It is a well-studied model in statistics and theoretical computer science and aims to bridge the average-case and worst-case guarantees of algorithms. A typical semi-random model consists of an (unknown) planted instance which a classical algorithm performs well against, augmented by additional information given by a “monotone” or “helpful” adversary masking the planted instance. In particular, if the algorithm fails when given additional, helpful information, then perhaps that algorithm has overfit to its original problem specification. This model has been studied for e.g. clique finding [JER92, FK00], community detection [FK01, MPW16], and graph partitioning [MMV12] in random graph models, where the monotone adversary removes edges between clusters and adds edges within clusters. Recent work [CG18] also defined a semi-random noise model for matrix completion, where entries of a matrix are revealed with only greater probability than in the classical setting.

Our semi-random adversary takes a well-conditioned planted set of rows

and appends additional rows to form a larger (possibly ill-conditioned) matrix

. This adversary for linear systems and regression is motivated through lenses of model misspecification and average-case complexity. As data is typically assumed to be randomly generated from a distribution (which may not be uniformly well-conditioned), it is desirable for algorithms to attain good guarantees even when only a portion of the data is well-behaved. Existing methods have dependences on parameters such as and (1

), which are sensitive to outliers that can arbitrarily affect these parameters. Our setting differs from most prior work (

[CG18] being a notable exception) in that semi-random noise has typically been studied for computationally hard optimization problems, where the goal is to prove polynomial-time tractability. For our problem, polynomial-time algorithms unconditionally exist, and the goal is to match the performance of the fastest algorithms in the uncorrupted, well-conditioned setting. Our guarantees fit into a broader program of giving tight runtime bounds for algorithmic tasks under data corruption (e.g. [CG18, CDG19b, CDG+19a, DHL19]).

Finally, we note there are first-order iterative methods which can handle a few large eigenvalue directions if the rest of the spectrum is well-conditioned, such as the conjugate gradient method (see discussion in [MNS+18]). Our results apply broadly to an arbitrary semi-random adversary.

### 1.3 Our results

In the following, the notation denotes that matrix is formed by vertically concatenating a row subset of . We also define the following quantities, analogous to (1), denoting the conditioning of the “best row subset” of a matrix:

 κg(A):=minAg⊆Aκ(Ag),τg(A):=minAg⊆Aτ(Ag). (3)

We give full algorithm statements in Appendix A, but describe them in Sections 3 and 4.

###### Theorem 1.

Algorithm 2, , takes as input a full column rank matrix

and an estimate of

within a polynomial multiplicative factor in problem parameters, and returns nonnegative diagonal weight matrix , such that333We say if for some constant .

 κ(W12A)=O(κg(A)), in time ˜O(√κg(A)⋅nnz(A)),

with high probability in . Thus, there is a (randomized) algorithm which solves Problem 1 in time with high probability in .

Compared to Proposition 1, we obtain a runtime equivalent (up to logarithmic factors) to applying accelerated gradient descent on , if was as well-conditioned as its best subset of rows, without a priori knowledge of the best subset. We give the algorithm and its analysis in Section 3.

###### Theorem 2.

Algorithm 3, , takes as input a full column rank matrix and an estimate of within a polynomial multiplicative factor in problem parameters, and returns nonnegative diagonal weight matrix , such that

 τ(W12A)=O(τg(A)), in time ˜O((n+√dτg(A))⋅d),

with high probability in . Thus, there is a (randomized) algorithm which solves Problem 1 in time with high probability in .

Compared to Proposition 2, our runtime is equivalent (up to logarithmic factors) to those of accelerated variance-reduced methods on , if had a trace-to-smallest-eigenvalue bound as good as its best subset of rows. The algorithm and analysis are given in Section 4. We remark that even when there does not exist a well-conditioned subset of rows , the guarantees of Theorem 1 and Theorem 2 inherit the conditioning of the most well-conditioned mixture . We state our claims with respect to the best row subsets for simplicity, and consistency with the noise model.

We apply our weight-learning methods to study Problem 2 under several generative models in Section 5. We show in various settings that the minimizer of the reweighted regression problem has risk upper bounds as good or better than the original regression minimizer; moreover, the regression procedure inherits the reweighted runtime improvements.

## 2 Preliminaries

General notation. Applied to a vector, is the norm. is the simplex in dimensions, the subset of with unit norm. Applied to matrices, is the Schatten- norm. is the multivariate Gaussian with specified mean and covariance. is the set of naturals .

Matrices. , , and Tr denote the largest eigenvalue, smallest eigenvalue, and trace of a symmetric matrix. The set of symmetric matrices is , and the positive semidefinite cone is . The inner product between matrices is the trace product, . We use the Loewner order on : if and only if . is the identity of appropriate dimension when clear. for is the diagonal matrix with diagonal entries ; we use when context is clear. For full-rank , . For with eigendecomposition , , where is entrywise to the diagonal.

Approximations. is an -approximation to if , for , . An -approximation is an “-multiplicative approximation” and a -approximation is a “-additive approximation”. This notation simplifies bounds in the remainder of this section.

Our algorithms instantiate the recent mixed positive SDP solver of [JLL+20], with careful arguments about per-iteration costs via instance structure. The solver of [JLL+20] gives width-independent (multiplicative) guarantees on a family of structured SDP instances with positivity constraints (cf. Appendix E for discussion of width-independence). The specializations known as “pure packing” SDPs has found great utility in the robust statistics literature [CG18, CDG19b], but for our purposes we require a full mixed packing-covering solver. We give a variant of the main claim of [JLL+20].

###### Definition 1 (MPC feasibility problem).

Given sets of matrices and , and error tolerance , the mixed packing-covering (MPC) feasibility problem asks to return weights such that

 λmax⎛⎝∑i∈[n]wiPi⎞⎠≤(1+ϵ)λmin⎛⎝∑i∈[n]wiCi⎞⎠, (4)

or conclude that the following is infeasible for :

 λmax⎛⎝∑i∈[n]wiPi⎞⎠≤λmin⎛⎝∑i∈[n]wiCi⎞⎠. (5)
###### Proposition 3 ([Jll+20]).

Algorithm 1, , takes inputs , , and error tolerance , and solves problem (4), (5), in iterations, where , . Each iteration uses -dimensional vector operations, and computes, for with for ,

 (6) (ϵ20,e−10log(ndρ)ϵTr(Ci))-approximations to ⟨Ci,exp(−∑i∈[n]wiCi)Trexp(−∑i∈[n]wiCi)⟩∀i∈[n].

We give Algorithm 1 in Appendix A. While this latter additive guarantee appears cumbersome, in our instances will be a constant, so the error must simply be bounded by .

We prove Theorem 1, the guarantees of our full gradient-based method, in this section. The proof relies on a number of computational building blocks for implementing steps of Proposition 3, which we state as Lemmas 1 and 2 during the course of the proof of Theorem 1, and defer the proofs of these helper lemmas to Appendix C.

###### Proof of Theorem 1.

Assume for now that we know , denoted as for brevity. Throughout, let the rows of be . We instantiate Proposition 3 with sufficiently small constant , and

 Pi:=aia⊤i,Ci:=κgaia⊤i. (7)

This instance has , where is defined in Proposition 3. Moreover, as (5) is feasible (by setting to be the indicator of the best row set ), the algorithm cannot conclude infeasibility, and thus will return weights satisfying (4) in iterations. The resulting implication is

 κ(W12A)=λmax(∑i∈[n]wiaia⊤i)λmin(∑i∈[n]wiaia⊤i)=κgλmax(∑i∈[n]wiPi)λmin(∑i∈[n]wiCi)=O(κg).

#### Approximating matrix exponential-vector products.

We now state Lemmas 1 and 2, helper lemmas which bound the complexity of computing quantities in (6) to sufficient accuracy throughout the algorithm. Lemma 1 is a standard computational subroutine in prior positive SDP algorithms [PTZ16, ALO16a]. However, obtaining Lemma 2 requires further insight. In its proof, we exploit the existence of degree- polynomials in which approximate . Since

 λmax(I+∑i∈[n]wiCiΔ)λmin(I+∑i∈[n]wiCiΔ)≤λmax(I+∑i∈[n]wiCiΔ)=O(κgR)=˜O(κg)

throughout our algorithm, we can apply efficiently via Proposition 1. This exploits structure of our instance (7), and that our guess is an underestimate to certify cheap iterate implementation. In the following, we assume for simplicity that Proposition 1 solves a linear system in exactly in time . We defer proofs to Appendix C, and discuss precision issues for solvers in Appendix D; the resulting runtime loss is no more than a logarithmic factor.

###### Lemma 1.

Define by (7), , and constant . We can compute -approximations to and all in with high probability.

###### Lemma 2.

Define by (7), , , and constant . We can compute an -approximation to and -approximations to all in with high probability.

Packing inner products. By the assumption in Proposition 3, in each iteration for some , where we recall is a constant. Therefore, by Lemma 1 we compute -multiplicative approximations to and all in time . Combining these guarantees yields the desired approximations.

Covering inner products. For some , by assumption and in each iteration. Here, we used that for all . Therefore, by Lemma 2 we compute the ratio between each and to the desired approximation in time .444Composing an -approximation with an -approximation yields a -approximation for their product, for sufficiently small , .

Searching for . We now lift the assumption of knowledge of . Recall that we assume access to a polynomially bounded estimate to this quantity. By starting at the lower bound of our estimate and iteratively incrementing by multiplicative factors of , we incur only a logarithmic overall overhead in runtime as is a constant. Moreover, each run will never have a worse runtime by more than a constant (as is never significantly overestimated), and the definition of feasibility implies the quality of the output will not be affected by more than a constant. We remark that in practice, this search procedure was not needed; see Appendix G for a discussion.

Linear system solving. As , and the linear system was assumed consistent, by using Proposition 1 on our learned reweighting, we attain the runtime guarantee for Problem 1. ∎

## 4 Row sampling methods

We prove Theorem 2, the analog of Theorem 1, when the matrix contains a subset of rows satisfying a good trace-to-bottom-eigenvalue bound . In its course, we give a helper Lemma 3 (which again exploits SDP structure to certify fast system solves in gradient computations), whose proof is deferred to Appendix C. Solver precision issues are discussed in Appendix D.

###### Proof of Theorem 2.

Assume for simplicity that we know , denoted as for brevity. Throughout, as in Theorem 1, let rows of be . We instantiate Proposition 3 with a sufficiently small constant, and

 Pi:=∥ai∥22,Ci:=τgaia⊤i. (8)

In this instance, . Further, all , so . By definition of , (5) is feasible (by letting indicate the best subset ) so the algorithm returns weights satisfying (4) in iterations. The resulting implication is

 τ(W12A)=Tr(∑i∈[n]wiaia⊤i)λmin(∑i∈[n]wiaia⊤i)=τgλmax(∑i∈[n]wiPi)λmin(∑i∈[n]wiCi)=O(τg).

Inner product computations. We use the following helper claim.

###### Lemma 3.

Define by (8), constant , , and . We can compute -approximations to all and an -approximation to in with high probability.

We can precompute the (constant) inner products in in , as they are -dimensional and thus always simply . Next, by assumption, in each iteration , for some , and . By the structure of our covering matrices,

 Tr⎛⎝∑i∈[n]wiCi⎞⎠=τg∑i∈[n]wi∥ai∥22≤τgR.

Lemma 3 thus computes ratios between all and to the desired quality in time each iteration.

Cleaning up. The remaining details (incrementally searching for and solving the linear system in the reweighted matrix) follow analogously to Theorem 1, and we omit them for brevity. ∎

## 5 Linear regression

We derive guarantees for applying our algorithms to statistical linear regression, Problem 2. Our approach is to return the minimizer555We assume in this section for simplicity that we can solve all regression problems exactly; this is not prohibitive since in all our settings algorithms depend logarithmically on the target accuracy. of a weighted regression problem . As in Section 1.3, we obtain computational wins by learning via Theorems 1 and 2. We further bound incurred risk under the generative model for noise , where is the true underlying feature vector. When is a multiple of , this is the well-studied homoskedastic setting pervasive in statistical modeling. When varies with the data , the model is “heteroskedastic” (cf. [GRE90]). In most cases, we do not directly give guarantees on exact mean squared errors via our preprocessing, but rather certify (possibly loose) upper bound surrogates. Preliminary experiments (cf. Section 6) corroborate our predictions of improved risk in planted instances, albeit not to the degree implied by the conservative surrogate bounds; we leave direct certification of conditioning and risk simultaneously without going through a surrogate bound as an interesting future direction.

### 5.1 Heteroskedastic statistical guarantees

“Noisy features”. Consider the setting where the covariance in the model (2) has the form , for matrix . Under this assumption, we can rewrite (2) as Intuitively, this corresponds to exact observations under noisy features . As in this case always, regression is equivalent to linear system solving (Problem 1), and thus the solution to regression has . We then compute the risk

 E[∥x∗−xtrue∥22]=E[∥∥ξ′∥∥22]=Tr[Σ′].

This guarantee is invariant under the specific linear system solved, so we can obtain computational speedups via Theorems 12 without statistical loss.

“Row norm noise”. Consider the setting where the covariance in the model (2) has the form

 (9)

Intuitively, this corresponds to the setting where noise is independent across examples and the size scales linearly with the squared row norm. We prove the following in Appendix F.

###### Lemma 4.

Under the generative model (2), (9), letting , we have the bound .

It is immediate that if we knew rows with an improved , solving the restricted instance yields improved (surrogate) risk guarantees. We now consider the statistical guarantees of reweighted regression, via a helper calculation.

###### Lemma 5.

Under the generative model (2), (9), letting be the solution of regression in , for nonnegative diagonal weights , we have the bound .

This yields the following corollary by applying the preprocessing of Theorem 2 (see also Remark 1).

###### Corollary 1.

Consider an instance of Problem 2, generated by (2), (9). There is a (randomized) algorithm which takes as input and an estimate of , and returns with high probability with

 E[∥~x−xtrue∥22]≤O(σ2τg(A)), in time ˜O((n+√dτg(A))⋅d).

Thus, Corollary 1 improves the surrogate risk bound in Lemma 4 by a factor of .

### 5.2 Homoskedastic statistical guarantees

We consider the homoskedastic setting where in (2), which yields the bound (cf. Lemma 8)

 E[∥x∗−xtrue∥22]≤σ2d⋅1λmin(A⊤A). (10)

In this setting, the notion of adversarial semi-random noise is at odds in the computational and statistical senses; by giving additional rows, the bound (

10) can only improve, since is monotonically increasing as rows are added to the data matrix. To address this, we show anywhere along the “computational-statistical tradeoff curve”, we can efficiently learn weights matching the performance of the best planted instance.

###### Lemma 6.

Consider an instance of Problem 2, generated by (2) with . For any pair where there exists satisfying , there is a (randomized) algorithm which takes as input , , and and returns with high probability with

 E[∥~x−xtrue∥22]≤O(σ2d⋅νg), in time ˜O(√κg⋅nnz(A)).

Lemma 6 shows that for any minimum eigenvalue “level” (certifying a statistical guarantee à la (10)), we can obtain the best computational guarantee for solving regression. We prove Lemma 6 and give an analog with a runtime dependence on for (see Lemma 10) in Appendix F.

## 6 Experiments

We perform preliminary experiments to validate utility of our method and its guarantees; all were done on a 2015 laptop using an Intel Core i5 CPU with 2.9 GHz clock frequency and 8 GB RAM. All code for these experiments, as well as a README containing instructions, can be found at the following web address: https://github.com/kjtian/semirandom-regression-experiments.

Problem instance. We consider the following “hard planted instance”, parameterized by four nonnegative integers and scaling factor . The matrix is composed of matrices , , vertically concatenated, with the rows randomly shuffled.

1. is a set of random orthogonal basis vectors, scaled by .

2. is a set of random unit vectors all of which are orthogonalized against .

3. is a set of random unit vectors, scaled by .

Note that this is a hard planted instance since the matrix consisting of downweighted and all of will be very well-conditioned, but it is difficult to identify rows of due to ’s presence.

Parameters. We used Algorithm 3 for preprocessing in each case. We found that the method was quite robust to aggressive parameter choices. We used a degree- rational approximation, Johnson-Lindenstrauss sketch directions (see Algorithm 3), step size , and ran for iterations; cf. Appendix G for further discussion.

Robustifying SAG. The scikit-learn.linear_model implementation of SAG [SRB17]

in their ridge regression solver was very brittle to

instances as described above, and did not converge for Problem 1. Algorithm 3 was able to learn a reweighted matrix for which SAG rapidly converged to the correct solution, while only calling SAG to implement our rational approximation within the algorithm. For , the end-to-end runtime of our procedure using SAG for every linear system solve was seconds. We defer additional details regarding this experiment to Appendix G.

Improving mean-squared error for heteroskedastic noise. We considered the “row norm noise” setting of Section 5.1, for with , , , and varying levels at even intervals. The results are presented in Figure 1. Our preprocessing stabilized and decreased mean squared errors, with stronger performance as (the number of “clean” rows in our model) increased. This is in line with what our theory predicts, as more clean rows improves concentration of the planted instance (resembling the identity), which is the setting where our surrogate risk bounds are roughly tight; conversely, bounds for the total instance were too pessimistic, but our method consistently obtained better risk. For both solving the (unweighted) regression problem and in each linear system solve in our procedure, we used the default “lsqr” ridge regression solver in scikit-learn.

For parameter settings used in these experiments, our end-to-end runtime using the “lsqr” solver in subroutines was slower than directly using the “lsqr” solver by a somewhat significant factor (roughly to ). However, for linear regression in much larger dimensions (e.g. ), our runtime was more competitive with “lsqr” (around slower). Our code was optimized only to a point and the scale of problems we considered was limited; we leave for future investigation whether high-performance implementations can obtain more competitive runtimes in practice for broader parameter ranges. Figure 1: Squared errors from true labels for d=100,200,300, averaged over 15 runs. Vertical error bars represent 1 standard deviation. Unweighted regression (vanilla “lsqr”) errors are in blue, reweighted regression (after preprocessing by our algorithm) errors are in orange. Our algorithm achieves better error and is more consistent; as d increases, the performance gap also widens.

Harder planted instance and further experiments. In Appendix G, we present a variety of additional experimental findings. Specifically, we consider a slightly more involved “hard problem instance” than the model presented in this section, which obtains a more dramatic value under the initial reweighting of our Algorithm 3, scaling polynomially in even when is . This allows for very poorly-behaved instances under “naïve defenses” when is comparable to .

We provide empirical results regarding the stability of decaying values across iterations of our preprocessing procedure for this harder instance. We also replicate Figure 1 for this instance under the heteroskedastic noise model, and show how the mean squared error decays as a function of increasing the dimension . In all cases, our preprocessing both decreases and stabilizes the risk under this noise model, in line with our theoretical predictions.

### Acknowledgments

We thank Arun Jambulapati for helpful discussions. This research was supported in part by NSF CAREER Award CCF-1844855 and a PayPal research gift award.

## References

• [ACH03] D. Achlioptas (2003) Database-friendly random projections: johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci. 66 (4), pp. 671–687. Cited by: Fact 1.
• [AKK+20] N. Agarwal, S. M. Kakade, R. Kidambi, Y. T. Lee, P. Netrapalli, and A. Sidford (2020) Leverage score sampling for faster accelerated regression and ERM. In International Conference on Algorithmic Learning Theory 2020, Cited by: Appendix B, Proposition 2.
• [ALO16a] Z. Allen Zhu, Y. T. Lee, and L. Orecchia (2016) Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pp. 1824–1831. Cited by: §E.1, §3.
• [AQR+16b] Z. Allen Zhu, Z. Qu, P. Richtárik, and Y. Yuan (2016) Even faster accelerated coordinate descent using non-uniform sampling. In

Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016

,
pp. 1110–1119. Cited by: Appendix B.
• [ALL17] Z. Allen-Zhu (2017) Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18, pp. 221:1–221:51. Cited by: §1.1.
• [ATK88] K. A. Atkinson (1988) An introduction to numerical analysis. John Wiley and Sons. Cited by: §1.
• [BBN13] M. Baes, M. Bürgisser, and A. Nemirovski (2013) A randomized mirror-prox method for solving structured large-scale matrix saddle-point problems. SIAM J. Optimization 23 (2), pp. 934–962. Cited by: §E.2.
• [BS95] A. Blum and J. Spencer (1995) Coloring random and semi-random k-colorable graphs. J. Algorithms 19 (2), pp. 204–234. Cited by: §1.2.
• [CDS+19] Y. Carmon, J. C. Duchi, A. Sidford, and K. Tian (2019) A rank-1 sketch for matrix multiplicative weights. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pp. 589–623. Cited by: §E.2.
• [CDG+19a] Y. Cheng, I. Diakonikolas, R. Ge, and D. P. Woodruff (2019) Faster algorithms for high-dimensional robust covariance estimation. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pp. 727–757. Cited by: §1.2.
• [CDG19b] Y. Cheng, I. Diakonikolas, and R. Ge (2019) High-dimensional robust mean estimation in nearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pp. 2755–2771. Cited by: §1.2, §2.
• [CG18] Y. Cheng and R. Ge (2018) Non-convex matrix completion against a semi-random adversary. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pp. 1362–1394. Cited by: §1.2, §1.2, §2.
• [CW13] K. L. Clarkson and D. P. Woodruff (2013) Low rank approximation and regression in input sparsity time. In

Symposium on Theory of Computing Conference, STOC’13, Palo Alto, CA, USA, June 1-4, 2013

,
pp. 81–90. Cited by: footnote 1.
• [CLM+15] M. B. Cohen, Y. T. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford (2015) Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, ITCS 2015, Rehovot, Israel, January 11-13, 2015, pp. 181–190. Cited by: footnote 1.
• [DBL14] A. Defazio, F. R. Bach, and S. Lacoste-Julien (2014) SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 1646–1654. Cited by: Appendix B.
• [DHL19] Y. Dong, S. B. Hopkins, and J. Li (2019)

Quantum entropy scoring for fast robust mean estimation and improved outlier detection

.
In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 6065–6075. Cited by: §1.2.
• [FK01] U. Feige and J. Kilian (2001) Heuristics for semirandom graph problems. J. Comput. Syst. Sci. 63 (4), pp. 639–671. Cited by: §1.2.
• [FK00] U. Feige and R. Krauthgamer (2000) Finding and certifying a large hidden clique in a semirandom graph. Random Struct. Algorithms 16 (2), pp. 195–208. Cited by: §1.2.
• [FGK+15] R. Frostig, R. Ge, S. M. Kakade, and A. Sidford (2015) Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 2540–2548. Cited by: Appendix B.
• [GAL14] F. L. Gall (2014)

Powers of tensors and fast matrix multiplication

.
In International Symposium on Symbolic and Algebraic Computation, ISSAC ’14, Kobe, Japan, July 23-25, 2014, pp. 296–303. Cited by: footnote 1.
• [GRE90] W. H. Greene (1990) Econometric analysis. Prentice Hall. Cited by: §5.
• [JLL+20] A. Jambulapati, Y. T. Lee, J. Li, S. Padmanabhan, and K. Tian (2020) Positive semidefinite programming: mixed, parallel, and width-independent. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Cited by: Appendix A, §E.1, §E.1, §2, Proposition 3.
• [JER92] M. Jerrum (1992) Large cliques elude the metropolis process. Random Struct. Algorithms 3 (4), pp. 347–360. Cited by: §1.2.
• [JZ13] R. Johnson and T. Zhang (2013)

Accelerating stochastic gradient descent using predictive variance reduction

.
In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp. 315–323. Cited by: Proposition 2.
• [LS13] Y. T. Lee and A. Sidford (2013) Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pp. 147–156. Cited by: Proposition 4.
• [LMP13] M. Li, G. L. Miller, and R. Peng (2013) Iterative row sampling. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pp. 127–136. Cited by: footnote 1.
• [LMH15] H. Lin, J. Mairal, and Z. Harchaoui (2015) A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 3384–3392. Cited by: Appendix B.
• [MMV12] K. Makarychev, Y. Makarychev, and A. Vijayaraghavan (2012) Approximation algorithms for semi-random partitioning problems. In Proceedings of the 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22, 2012, pp. 367–384. Cited by: §1.2.
• [MPW16] A. Moitra, W. Perry, and A. S. Wein (2016) How robust are reconstruction thresholds for community detection?. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pp. 828–841. Cited by: §1.2.
• [MNS+18] C. Musco, P. Netrapalli, A. Sidford, S. Ubaru, and D. P. Woodruff (2018) Spectrum approximation beyond fast matrix multiplication: algorithms and hardness. In 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA, pp. 8:1–8:21. Cited by: §1.2.
• [NN13] J. Nelson and H. L. Nguyen (2013) OSNAP: faster numerical linear algebra algorithms via sparser subspace embeddings. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pp. 117–126. Cited by: footnote 1.
• [NES83] Y. Nesterov (1983) A method for solving a convex programming problem with convergence rate . Doklady AN SSSR 269, pp. 543–547. Cited by: Appendix B, Appendix B, Proposition 1.
• [PTZ16] R. Peng, K. Tangwongsan, and P. Zhang (2016) Faster and simpler width-independent parallel algorithms for positive semidefinite programming. CoRR abs/1201.5135v3. Cited by: §E.1, §3.
• [SV14] S. Sachdeva and N. K. Vishnoi (2014) Faster algorithms via approximation theory. Foundations and Trends in Theoretical Computer Science 9 (2), pp. 125–210. Cited by: Fact 2, Fact 3.
• [SRB17] M. Schmidt, N. L. Roux, and F. R. Bach (2017) Minimizing finite sums with the stochastic average gradient. Math. Program. 162 (1-2), pp. 83–112. Cited by: Appendix B, §6.
• [SZ16] S. Shalev-Shwartz and T. Zhang (2016) Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155 (1-2), pp. 105–145. Cited by: Appendix B.
• [SV06] T. Strohmer and R. Vershynin (2006) A randomized solver for linear systems with exponential convergence. In

Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 9th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2006 and 10th International Workshop on Randomization and Computation, RANDOM 2006, Barcelona, Spain, August 28-30 2006, Proceedings

,
pp. 499–507. Cited by: Proposition 4.
• [WIL12] V. V. Williams (2012) Multiplying matrices faster than coppersmith-winograd. In Proceedings of the 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22, 2012, pp. 887–898. Cited by: footnote 1.

## Appendix A Algorithm statements

In this section, we give pseudocode for (the algorithm of Proposition 3 from [JLL+20]), (the algorithm of Theorem 1), and (the algorithm of Theorem 2). We note in the algorithm statement of [JLL+20] (and for the provable runtime of Proposition 3), was chosen to be , and was chosen to be .

The parameter settings required for the algorithms in this section required to yield our provable guarantees can be found in proofs of Theorem 1 and 2. We found in practice (cf. Section 6) that these parameter settings were on the conservative side and could be chosen more aggressively.