1 Introduction
Given matrix and labels , the fundamental problems of solving linear systems () and performing linear regression () are pervasive in statistics and scientific computing, and serve as benchmarks inspiring development of popular methods in convex optimization and numerical linear algebra (see e.g. [ATK88]). We consider firstorder methods for these problems with stateoftheart dependence on standard conditioning measures of the problem, and initiate the study of their tolerance to semirandom noise, a wellstudied corruption model for testing robustness of algorithms to model misspecification (cf. discussion in Section 1.2).
Consider, for instance, an overdetermined linear system . When has a small condition number, i.e. , gradient methods rapidly converge. However, what if a subset of the rows is wellconditioned (e.g. is small), but the whole matrix is not? If was known a priori, we could solve in just this subset of rows to efficiently recover the solution, ignoring the additional data. Counterintuitively, by giving additional consistent information, an adversary can arbitrarily decrease conditioning, breaking performance of firstorder methods. Motivated by this phenomenon, we address the following main question.
Question 1.
Can we design algorithms for linear system solving and linear regression in matrices which contain a subset of rows with some “good property”, with computational and statistical guarantees comparable to if the entire matrix had this “good property”?
Our main contribution is answering this question affirmatively under standard settings where firstorder methods apply. Our algorithms proceed by learning a reweighting of the rows of , such that the resulting reweighted matrix inherits the desirable “good property” provided there exists a wellbehaved subset of rows. Moreover, the reweighting scheme itself is computationally efficient and does not bottleneck the overall runtime by more than a polylogarithmic factor. In this sense, our reweighting scheme “robustifies” standard firstorder solvers against consistent (but possibly adversarial) noise by allowing them to perform algorithmic tasks in the best subset of examples, effectively ignoring poorlybehaved rows. The specific good properties we consider are motivated by the runtime guarantees of stateoftheart firstorder methods.
1.1 Linear systems and regression in the classical setting
We define the main algorithmic tasks we consider in this work. For simplicity, we study “overconstrained” variants of these tasks throughout, where has no kernel and solutions are unique.
Problem 1 (Overconstrained linear system solving).
Given full column rank and , find the unique with .
When choosing an algorithm for Problem 1, a primary consideration is its computational guarantees, i.e. convergence rate in terms of matrix properties. We define two fundamental matrix properties:
(1) 
When is clear from context, we refer to these quantities as and ; note that . denotes the standard condition number of and
can be thought of as an “average condition number”, as it is the ratio of the average eigenvalue of
and its smallest. Firstorder methods for Problem 1 largely fall into two regimes,^{1}^{1}1We focus on firstorder methods with linear or sublinear dimension dependence in this work. Algorithms explicitly solving linear systems [CW13, NN13, LMP13, CLM+15] with runtimes at least , where is the matrix multiplication constant [WIL12, GAL14], also exist, but are expensive in practice. which we term “full gradient” and “row sampling”. Full gradient methods have polynomial dependence on and linear dependence on , the number of nonzero entries of ; row sampling methods are randomized and depend polynomially on , and linearly (or sublinearly) on dimensions , . We state two stateoftheart runtimes for Problem 1 in these two regimes, which serve as benchmarks for our robust runtimes, and defer a broader, more formal discussion to Appendix B for brevity.Proposition 1 ([Nes83]).
Proposition 2 ([Jz13, Akk+20]).
Stochastic variance reduced gradient (SVRG) methods solve Problem
1 to distance to the solution in iterations, which can be accelerated to , each takingdimensional vector operations. Both of these results hold with high probability
.We remark that at the cost of a dependence on rather than , logarithmic factors in the runtime of Proposition 2 can be removed [ALL17]. Finally, for sparse matrices the dimensiondependent factors in Proposition 2 can be improved (e.g. the additive can be improved to an and the periteration complexity depends on row sparsity); such improvements also apply to our methods, and we omit them for brevity of runtime statements.
Remark 1.
Propositions 1 and 2 also solve regression (minimizing ) and inversion^{2}^{2}2To avoid overloading the definition of “system solving” (i.e. with respect to our Problem 1), we use inversion to mean applying an inverse to a vector rather than explicitly computing the inverse of a matrix. (applying the inverse of to a vector ) in corresponding runtimes depending on . For a more thorough discussion of this reduction, see Appendix B.
The second problem we consider in this work is a statistical variant of linear regression.
Problem 2 (Statistical regression).
Given full column rank and produced via
(2) 
produce so that the meansquared error is small (expectations are taken over ).
Problem 2 is a strict generalization of Problem 1, since we can set . Although our regression algorithms inherit the computational speedups of their Problem 1 counterparts, the minimizer to the regression problem changes under reweightings of and . Nonetheless, we demonstrate that under wellstudied noise models (choices of in (2)), our reweighted regression solutions obtain meansquared error guarantees as good or better than their unweighted counterparts, consistent with our empirical findings; we note that our risk guarantees are via surrogate upper bounds for mean squared error, which may be loose in some settings (cf. Section 5 for a discussion). This motivates the statistical notion of linear regression we study (used in the definition of Problem 2).
1.2 Semirandom noise models
The semirandom noise framework was first studied in the context of graph coloring [BS95]. It is a wellstudied model in statistics and theoretical computer science and aims to bridge the averagecase and worstcase guarantees of algorithms. A typical semirandom model consists of an (unknown) planted instance which a classical algorithm performs well against, augmented by additional information given by a “monotone” or “helpful” adversary masking the planted instance. In particular, if the algorithm fails when given additional, helpful information, then perhaps that algorithm has overfit to its original problem specification. This model has been studied for e.g. clique finding [JER92, FK00], community detection [FK01, MPW16], and graph partitioning [MMV12] in random graph models, where the monotone adversary removes edges between clusters and adds edges within clusters. Recent work [CG18] also defined a semirandom noise model for matrix completion, where entries of a matrix are revealed with only greater probability than in the classical setting.
Our semirandom adversary takes a wellconditioned planted set of rows
and appends additional rows to form a larger (possibly illconditioned) matrix
. This adversary for linear systems and regression is motivated through lenses of model misspecification and averagecase complexity. As data is typically assumed to be randomly generated from a distribution (which may not be uniformly wellconditioned), it is desirable for algorithms to attain good guarantees even when only a portion of the data is wellbehaved. Existing methods have dependences on parameters such as and (1), which are sensitive to outliers that can arbitrarily affect these parameters. Our setting differs from most prior work (
[CG18] being a notable exception) in that semirandom noise has typically been studied for computationally hard optimization problems, where the goal is to prove polynomialtime tractability. For our problem, polynomialtime algorithms unconditionally exist, and the goal is to match the performance of the fastest algorithms in the uncorrupted, wellconditioned setting. Our guarantees fit into a broader program of giving tight runtime bounds for algorithmic tasks under data corruption (e.g. [CG18, CDG19b, CDG+19a, DHL19]).Finally, we note there are firstorder iterative methods which can handle a few large eigenvalue directions if the rest of the spectrum is wellconditioned, such as the conjugate gradient method (see discussion in [MNS+18]). Our results apply broadly to an arbitrary semirandom adversary.
1.3 Our results
In the following, the notation denotes that matrix is formed by vertically concatenating a row subset of . We also define the following quantities, analogous to (1), denoting the conditioning of the “best row subset” of a matrix:
(3) 
We give full algorithm statements in Appendix A, but describe them in Sections 3 and 4.
Theorem 1.
Algorithm 2, , takes as input a full column rank matrix
and an estimate of
within a polynomial multiplicative factor in problem parameters, and returns nonnegative diagonal weight matrix , such that^{3}^{3}3We say if for some constant .with high probability in . Thus, there is a (randomized) algorithm which solves Problem 1 in time with high probability in .
Compared to Proposition 1, we obtain a runtime equivalent (up to logarithmic factors) to applying accelerated gradient descent on , if was as wellconditioned as its best subset of rows, without a priori knowledge of the best subset. We give the algorithm and its analysis in Section 3.
Theorem 2.
Algorithm 3, , takes as input a full column rank matrix and an estimate of within a polynomial multiplicative factor in problem parameters, and returns nonnegative diagonal weight matrix , such that
with high probability in . Thus, there is a (randomized) algorithm which solves Problem 1 in time with high probability in .
Compared to Proposition 2, our runtime is equivalent (up to logarithmic factors) to those of accelerated variancereduced methods on , if had a tracetosmallesteigenvalue bound as good as its best subset of rows. The algorithm and analysis are given in Section 4. We remark that even when there does not exist a wellconditioned subset of rows , the guarantees of Theorem 1 and Theorem 2 inherit the conditioning of the most wellconditioned mixture . We state our claims with respect to the best row subsets for simplicity, and consistency with the noise model.
We apply our weightlearning methods to study Problem 2 under several generative models in Section 5. We show in various settings that the minimizer of the reweighted regression problem has risk upper bounds as good or better than the original regression minimizer; moreover, the regression procedure inherits the reweighted runtime improvements.
2 Preliminaries
General notation. Applied to a vector, is the norm. is the simplex in dimensions, the subset of with unit norm. Applied to matrices, is the Schatten norm. is the multivariate Gaussian with specified mean and covariance. is the set of naturals .
Matrices. , , and Tr denote the largest eigenvalue, smallest eigenvalue, and trace of a symmetric matrix. The set of symmetric matrices is , and the positive semidefinite cone is . The inner product between matrices is the trace product, . We use the Loewner order on : if and only if . is the identity of appropriate dimension when clear. for is the diagonal matrix with diagonal entries ; we use when context is clear. For fullrank , . For with eigendecomposition , , where is entrywise to the diagonal.
Approximations. is an approximation to if , for , . An approximation is an “multiplicative approximation” and a approximation is a “additive approximation”. This notation simplifies bounds in the remainder of this section.
Our algorithms instantiate the recent mixed positive SDP solver of [JLL+20], with careful arguments about periteration costs via instance structure. The solver of [JLL+20] gives widthindependent (multiplicative) guarantees on a family of structured SDP instances with positivity constraints (cf. Appendix E for discussion of widthindependence). The specializations known as “pure packing” SDPs has found great utility in the robust statistics literature [CG18, CDG19b], but for our purposes we require a full mixed packingcovering solver. We give a variant of the main claim of [JLL+20].
Definition 1 (MPC feasibility problem).
Given sets of matrices and , and error tolerance , the mixed packingcovering (MPC) feasibility problem asks to return weights such that
(4) 
or conclude that the following is infeasible for :
(5) 
Proposition 3 ([Jll+20]).
3 Full gradient methods
We prove Theorem 1, the guarantees of our full gradientbased method, in this section. The proof relies on a number of computational building blocks for implementing steps of Proposition 3, which we state as Lemmas 1 and 2 during the course of the proof of Theorem 1, and defer the proofs of these helper lemmas to Appendix C.
Proof of Theorem 1.
Assume for now that we know , denoted as for brevity. Throughout, let the rows of be . We instantiate Proposition 3 with sufficiently small constant , and
(7) 
This instance has , where is defined in Proposition 3. Moreover, as (5) is feasible (by setting to be the indicator of the best row set ), the algorithm cannot conclude infeasibility, and thus will return weights satisfying (4) in iterations. The resulting implication is
Approximating matrix exponentialvector products.
We now state Lemmas 1 and 2, helper lemmas which bound the complexity of computing quantities in (6) to sufficient accuracy throughout the algorithm. Lemma 1 is a standard computational subroutine in prior positive SDP algorithms [PTZ16, ALO16a]. However, obtaining Lemma 2 requires further insight. In its proof, we exploit the existence of degree polynomials in which approximate . Since
throughout our algorithm, we can apply efficiently via Proposition 1. This exploits structure of our instance (7), and that our guess is an underestimate to certify cheap iterate implementation. In the following, we assume for simplicity that Proposition 1 solves a linear system in exactly in time . We defer proofs to Appendix C, and discuss precision issues for solvers in Appendix D; the resulting runtime loss is no more than a logarithmic factor.
Lemma 1.
Define by (7), , and constant . We can compute approximations to and all in with high probability.
Lemma 2.
Define by (7), , , and constant . We can compute an approximation to and approximations to all in with high probability.
Packing inner products. By the assumption in Proposition 3, in each iteration for some , where we recall is a constant. Therefore, by Lemma 1 we compute multiplicative approximations to and all in time . Combining these guarantees yields the desired approximations.
Covering inner products. For some , by assumption and in each iteration. Here, we used that for all . Therefore, by Lemma 2 we compute the ratio between each and to the desired approximation in time .^{4}^{4}4Composing an approximation with an approximation yields a approximation for their product, for sufficiently small , .
Searching for . We now lift the assumption of knowledge of . Recall that we assume access to a polynomially bounded estimate to this quantity. By starting at the lower bound of our estimate and iteratively incrementing by multiplicative factors of , we incur only a logarithmic overall overhead in runtime as is a constant. Moreover, each run will never have a worse runtime by more than a constant (as is never significantly overestimated), and the definition of feasibility implies the quality of the output will not be affected by more than a constant. We remark that in practice, this search procedure was not needed; see Appendix G for a discussion.
4 Row sampling methods
We prove Theorem 2, the analog of Theorem 1, when the matrix contains a subset of rows satisfying a good tracetobottomeigenvalue bound . In its course, we give a helper Lemma 3 (which again exploits SDP structure to certify fast system solves in gradient computations), whose proof is deferred to Appendix C. Solver precision issues are discussed in Appendix D.
Proof of Theorem 2.
Assume for simplicity that we know , denoted as for brevity. Throughout, as in Theorem 1, let rows of be . We instantiate Proposition 3 with a sufficiently small constant, and
(8) 
In this instance, . Further, all , so . By definition of , (5) is feasible (by letting indicate the best subset ) so the algorithm returns weights satisfying (4) in iterations. The resulting implication is
Inner product computations. We use the following helper claim.
Lemma 3.
Define by (8), constant , , and . We can compute approximations to all and an approximation to in with high probability.
We can precompute the (constant) inner products in in , as they are dimensional and thus always simply . Next, by assumption, in each iteration , for some , and . By the structure of our covering matrices,
Lemma 3 thus computes ratios between all and to the desired quality in time each iteration.
Cleaning up. The remaining details (incrementally searching for and solving the linear system in the reweighted matrix) follow analogously to Theorem 1, and we omit them for brevity. ∎
5 Linear regression
We derive guarantees for applying our algorithms to statistical linear regression, Problem 2. Our approach is to return the minimizer^{5}^{5}5We assume in this section for simplicity that we can solve all regression problems exactly; this is not prohibitive since in all our settings algorithms depend logarithmically on the target accuracy. of a weighted regression problem . As in Section 1.3, we obtain computational wins by learning via Theorems 1 and 2. We further bound incurred risk under the generative model for noise , where is the true underlying feature vector. When is a multiple of , this is the wellstudied homoskedastic setting pervasive in statistical modeling. When varies with the data , the model is “heteroskedastic” (cf. [GRE90]). In most cases, we do not directly give guarantees on exact mean squared errors via our preprocessing, but rather certify (possibly loose) upper bound surrogates. Preliminary experiments (cf. Section 6) corroborate our predictions of improved risk in planted instances, albeit not to the degree implied by the conservative surrogate bounds; we leave direct certification of conditioning and risk simultaneously without going through a surrogate bound as an interesting future direction.
5.1 Heteroskedastic statistical guarantees
“Noisy features”. Consider the setting where the covariance in the model (2) has the form , for matrix . Under this assumption, we can rewrite (2) as Intuitively, this corresponds to exact observations under noisy features . As in this case always, regression is equivalent to linear system solving (Problem 1), and thus the solution to regression has . We then compute the risk
This guarantee is invariant under the specific linear system solved, so we can obtain computational speedups via Theorems 1, 2 without statistical loss.
“Row norm noise”. Consider the setting where the covariance in the model (2) has the form
(9) 
Intuitively, this corresponds to the setting where noise is independent across examples and the size scales linearly with the squared row norm. We prove the following in Appendix F.
It is immediate that if we knew rows with an improved , solving the restricted instance yields improved (surrogate) risk guarantees. We now consider the statistical guarantees of reweighted regression, via a helper calculation.
Lemma 5.
Corollary 1.
5.2 Homoskedastic statistical guarantees
We consider the homoskedastic setting where in (2), which yields the bound (cf. Lemma 8)
(10) 
In this setting, the notion of adversarial semirandom noise is at odds in the computational and statistical senses; by giving additional rows, the bound (
10) can only improve, since is monotonically increasing as rows are added to the data matrix. To address this, we show anywhere along the “computationalstatistical tradeoff curve”, we can efficiently learn weights matching the performance of the best planted instance.Lemma 6.
6 Experiments
We perform preliminary experiments to validate utility of our method and its guarantees; all were done on a 2015 laptop using an Intel Core i5 CPU with 2.9 GHz clock frequency and 8 GB RAM. All code for these experiments, as well as a README containing instructions, can be found at the following web address: https://github.com/kjtian/semirandomregressionexperiments.
Problem instance. We consider the following “hard planted instance”, parameterized by four nonnegative integers and scaling factor . The matrix is composed of matrices , , vertically concatenated, with the rows randomly shuffled.

is a set of random orthogonal basis vectors, scaled by .

is a set of random unit vectors all of which are orthogonalized against .

is a set of random unit vectors, scaled by .
Note that this is a hard planted instance since the matrix consisting of downweighted and all of will be very wellconditioned, but it is difficult to identify rows of due to ’s presence.
Parameters. We used Algorithm 3 for preprocessing in each case. We found that the method was quite robust to aggressive parameter choices. We used a degree rational approximation, JohnsonLindenstrauss sketch directions (see Algorithm 3), step size , and ran for iterations; cf. Appendix G for further discussion.
Robustifying SAG. The scikitlearn.linear_model implementation of SAG [SRB17]
in their ridge regression solver was very brittle to
instances as described above, and did not converge for Problem 1. Algorithm 3 was able to learn a reweighted matrix for which SAG rapidly converged to the correct solution, while only calling SAG to implement our rational approximation within the algorithm. For , the endtoend runtime of our procedure using SAG for every linear system solve was seconds. We defer additional details regarding this experiment to Appendix G.Improving meansquared error for heteroskedastic noise. We considered the “row norm noise” setting of Section 5.1, for with , , , and varying levels at even intervals. The results are presented in Figure 1. Our preprocessing stabilized and decreased mean squared errors, with stronger performance as (the number of “clean” rows in our model) increased. This is in line with what our theory predicts, as more clean rows improves concentration of the planted instance (resembling the identity), which is the setting where our surrogate risk bounds are roughly tight; conversely, bounds for the total instance were too pessimistic, but our method consistently obtained better risk. For both solving the (unweighted) regression problem and in each linear system solve in our procedure, we used the default “lsqr” ridge regression solver in scikitlearn.
For parameter settings used in these experiments, our endtoend runtime using the “lsqr” solver in subroutines was slower than directly using the “lsqr” solver by a somewhat significant factor (roughly to ). However, for linear regression in much larger dimensions (e.g. ), our runtime was more competitive with “lsqr” (around slower). Our code was optimized only to a point and the scale of problems we considered was limited; we leave for future investigation whether highperformance implementations can obtain more competitive runtimes in practice for broader parameter ranges.
, averaged over 15 runs. Vertical error bars represent 1 standard deviation. Unweighted regression (vanilla “lsqr”) errors are in blue, reweighted regression (after preprocessing by our algorithm) errors are in orange. Our algorithm achieves better error and is more consistent; as
increases, the performance gap also widens.Harder planted instance and further experiments. In Appendix G, we present a variety of additional experimental findings. Specifically, we consider a slightly more involved “hard problem instance” than the model presented in this section, which obtains a more dramatic value under the initial reweighting of our Algorithm 3, scaling polynomially in even when is . This allows for very poorlybehaved instances under “naïve defenses” when is comparable to .
We provide empirical results regarding the stability of decaying values across iterations of our preprocessing procedure for this harder instance. We also replicate Figure 1 for this instance under the heteroskedastic noise model, and show how the mean squared error decays as a function of increasing the dimension . In all cases, our preprocessing both decreases and stabilizes the risk under this noise model, in line with our theoretical predictions.
Acknowledgments
We thank Arun Jambulapati for helpful discussions. This research was supported in part by NSF CAREER Award CCF1844855 and a PayPal research gift award.
References
 [ACH03] (2003) Databasefriendly random projections: johnsonlindenstrauss with binary coins. J. Comput. Syst. Sci. 66 (4), pp. 671–687. Cited by: Fact 1.
 [AKK+20] (2020) Leverage score sampling for faster accelerated regression and ERM. In International Conference on Algorithmic Learning Theory 2020, Cited by: Appendix B, Proposition 2.
 [ALO16a] (2016) Using optimization to obtain a widthindependent, parallel, simpler, and faster positive SDP solver. In Proceedings of the TwentySeventh Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 1012, 2016, pp. 1824–1831. Cited by: §E.1, §3.

[AQR+16b]
(2016)
Even faster accelerated coordinate descent using nonuniform sampling.
In
Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016
, pp. 1110–1119. Cited by: Appendix B.  [ALL17] (2017) Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18, pp. 221:1–221:51. Cited by: §1.1.
 [ATK88] (1988) An introduction to numerical analysis. John Wiley and Sons. Cited by: §1.
 [BBN13] (2013) A randomized mirrorprox method for solving structured largescale matrix saddlepoint problems. SIAM J. Optimization 23 (2), pp. 934–962. Cited by: §E.2.
 [BS95] (1995) Coloring random and semirandom kcolorable graphs. J. Algorithms 19 (2), pp. 204–234. Cited by: §1.2.
 [CDS+19] (2019) A rank1 sketch for matrix multiplicative weights. In Conference on Learning Theory, COLT 2019, 2528 June 2019, Phoenix, AZ, USA, pp. 589–623. Cited by: §E.2.
 [CDG+19a] (2019) Faster algorithms for highdimensional robust covariance estimation. In Conference on Learning Theory, COLT 2019, 2528 June 2019, Phoenix, AZ, USA, pp. 727–757. Cited by: §1.2.
 [CDG19b] (2019) Highdimensional robust mean estimation in nearlylinear time. In Proceedings of the Thirtieth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 69, 2019, pp. 2755–2771. Cited by: §1.2, §2.
 [CG18] (2018) Nonconvex matrix completion against a semirandom adversary. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 69 July 2018, pp. 1362–1394. Cited by: §1.2, §1.2, §2.

[CW13]
(2013)
Low rank approximation and regression in input sparsity time.
In
Symposium on Theory of Computing Conference, STOC’13, Palo Alto, CA, USA, June 14, 2013
, pp. 81–90. Cited by: footnote 1.  [CLM+15] (2015) Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, ITCS 2015, Rehovot, Israel, January 1113, 2015, pp. 181–190. Cited by: footnote 1.
 [DBL14] (2014) SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pp. 1646–1654. Cited by: Appendix B.

[DHL19]
(2019)
Quantum entropy scoring for fast robust mean estimation and improved outlier detection
. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 814 December 2019, Vancouver, BC, Canada, pp. 6065–6075. Cited by: §1.2.  [FK01] (2001) Heuristics for semirandom graph problems. J. Comput. Syst. Sci. 63 (4), pp. 639–671. Cited by: §1.2.
 [FK00] (2000) Finding and certifying a large hidden clique in a semirandom graph. Random Struct. Algorithms 16 (2), pp. 195–208. Cited by: §1.2.
 [FGK+15] (2015) Unregularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pp. 2540–2548. Cited by: Appendix B.

[GAL14]
(2014)
Powers of tensors and fast matrix multiplication
. In International Symposium on Symbolic and Algebraic Computation, ISSAC ’14, Kobe, Japan, July 2325, 2014, pp. 296–303. Cited by: footnote 1.  [GRE90] (1990) Econometric analysis. Prentice Hall. Cited by: §5.
 [JLL+20] (2020) Positive semidefinite programming: mixed, parallel, and widthindependent. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Cited by: Appendix A, §E.1, §E.1, §2, Proposition 3.
 [JER92] (1992) Large cliques elude the metropolis process. Random Struct. Algorithms 3 (4), pp. 347–360. Cited by: §1.2.

[JZ13]
(2013)
Accelerating stochastic gradient descent using predictive variance reduction
. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 58, 2013, Lake Tahoe, Nevada, United States, pp. 315–323. Cited by: Proposition 2.  [LS13] (2013) Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 2629 October, 2013, Berkeley, CA, USA, pp. 147–156. Cited by: Proposition 4.
 [LMP13] (2013) Iterative row sampling. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 2629 October, 2013, Berkeley, CA, USA, pp. 127–136. Cited by: footnote 1.
 [LMH15] (2015) A universal catalyst for firstorder optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 712, 2015, Montreal, Quebec, Canada, pp. 3384–3392. Cited by: Appendix B.
 [MMV12] (2012) Approximation algorithms for semirandom partitioning problems. In Proceedings of the 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA, May 19  22, 2012, pp. 367–384. Cited by: §1.2.
 [MPW16] (2016) How robust are reconstruction thresholds for community detection?. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 1821, 2016, pp. 828–841. Cited by: §1.2.
 [MNS+18] (2018) Spectrum approximation beyond fast matrix multiplication: algorithms and hardness. In 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 1114, 2018, Cambridge, MA, USA, pp. 8:1–8:21. Cited by: §1.2.
 [NN13] (2013) OSNAP: faster numerical linear algebra algorithms via sparser subspace embeddings. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 2629 October, 2013, Berkeley, CA, USA, pp. 117–126. Cited by: footnote 1.
 [NES83] (1983) A method for solving a convex programming problem with convergence rate . Doklady AN SSSR 269, pp. 543–547. Cited by: Appendix B, Appendix B, Proposition 1.
 [PTZ16] (2016) Faster and simpler widthindependent parallel algorithms for positive semidefinite programming. CoRR abs/1201.5135v3. Cited by: §E.1, §3.
 [SV14] (2014) Faster algorithms via approximation theory. Foundations and Trends in Theoretical Computer Science 9 (2), pp. 125–210. Cited by: Fact 2, Fact 3.
 [SRB17] (2017) Minimizing finite sums with the stochastic average gradient. Math. Program. 162 (12), pp. 83–112. Cited by: Appendix B, §6.
 [SZ16] (2016) Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155 (12), pp. 105–145. Cited by: Appendix B.

[SV06]
(2006)
A randomized solver for linear systems with exponential convergence.
In
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 9th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2006 and 10th International Workshop on Randomization and Computation, RANDOM 2006, Barcelona, Spain, August 2830 2006, Proceedings
, pp. 499–507. Cited by: Proposition 4.  [WIL12] (2012) Multiplying matrices faster than coppersmithwinograd. In Proceedings of the 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA, May 19  22, 2012, pp. 887–898. Cited by: footnote 1.
Comments
There are no comments yet.