1 Introduction
Matrix algorithms lie at the heart of many applications, both historically in areas such as signal processing and scientific computing as well as more recently in areas such as machine learning and data analysis. Essentially, the reason is that matrices provide a convenient mathematical structure with which to model data arising in a broad range of applications: an realvalued matrix provides a natural structure for encoding information about objects, each of which is described by features. Alternatively, an realvalued matrix can be used to describe the correlations between all pairs of data points, or the weighted edgeedge adjacency matrix structure of an node graph. In astronomy, for example, very small angular regions of the sky imaged at a range of electromagnetic frequency bands can be represented as a matrix—in that case, an object is a region and the features are the elements of the frequency bands. Similarly, in genetics, DNA SNP (Single Nucleotide Polymorphism) or DNA microarray expression data can be represented in such a framework, with representing the expression level of the gene or SNP in the experimental condition or individual. Similarly, termdocument matrices can be constructed in many Internet applications, with indicating the frequency of the term in the document.
Most traditional algorithms for matrix problems are designed to run on a single machine, focusing on minimizing the number of floatingpoint operations per second (FLOPS). On the other hand, motivated by the ability to generate very large quantities of data in relatively automated ways, analyzing data sets of billions or more of records has now become a regular task in many companies and institutions. In a distributed computational environment, which is typical in these applications, communication costs, e.g., between different machines, are often much more important than computational costs. What is more, if the data cannot fit into memory on a single machine, then one must scan the records from secondary storage, e.g., hard disk, which makes each pass through the data associated with enormous I/O costs. Given that, in many of these largescale applications, regression, lowrank approximation, and related matrix problems are ubiquitous, the fast computation of their solutions on largescale data platforms is of interest.
In this paper, we will provide an overview of recent work in Randomized Numerical Linear Algebra (RandNLA) on implementing randomized matrix algorithms in largescale parallel and distributed computational environments. RandNLA is a large area that applies randomization as an algorithmic resource to develop improved algorithms for regression, lowrank matrix approximation, and related problems [1]
. To limit the presentation, here we will be most interested in very large very rectangular linear regression problems on up to terabytesized data: in particular, in the
regression (also known as least squares, or LS) problem and its robust alternative, the regression (also known as least absolute deviations, LAD, or least absolute errors, LAE) problem, with strongly rectangular “tall” data. Although our main focus is on and regression, much of the underlying theory holds for regression, either for or for all , and thus for simplicity we formulate many of our results in .Several important conclusions will emerge from our presentation.

First, many of the basic ideas from RandNLA in RAM extend to RandNLA in parallel/distributed environments in a relatively straightforward manner, assuming that one is more concerned about communication than computation. This is important from an algorithm design perspective, as it highlights which aspects of these RandNLA algorithms are peculiar to the use of randomization and which aspects are peculiar to parallel/distributed environments.

Second, with appropriate engineering of random sampling and random projection algorithms, it is possible to compute good approximate solutions—to low precision (e.g., or digits of precision), medium precision (e.g., or digits of precision), or high precision (e.g., up to machine precision)—to several common matrix problems in only a few passes over the original matrix on up to terabytesized data. While low precision is certainly appropriate for many data analysis and machine learning applications involving noisy input data, the appropriate level of precision is a choice for user of an algorithm to make; and there are obvious advantages to having the developer of an algorithm provide control to the user on the quality of the answer returned by the algorithm.

Third, the design principles for developing highquality RandNLA matrix algorithms depend strongly on whether one is interested in low, medium, or high precision. (An example of this is whether to solve the randomized subproblem with a traditional method or to use the randomized subproblem to create a preconditioned version of the original problem.) Understanding these principles, the connections between them, and how they relate to traditional principles of NLA algorithm design is important for providing highquality implementations of recent theoretical developments in the RandNLA literature.
SNP  number of SNPs ()  number of subjects () 

TinyImages  number of images ()  number of pixels in each image () 
PDE  number of degrees of freedom 
number of time steps 
sensor network  size of sensing data  number of sensors 
NLP  number of words and grams  number of documents 
tick data  number of ticks  number of stocks 
Although many of the ideas we will discuss can be extended to related matrix problems such as lowrank matrix approximation, there are two main reasons for restricting attention to strongly rectangular data. The first, most obvious, reason is that strongly rectangular data arises in many fields to which machine learning and data analysis methods are routinely applied. Consider, e.g., Table 1, which lists a few examples.

In genetics, single nucleotide polymorphisms (SNPs) are important in the study of human health. There are roughly million SNPs in the human genome. However, there are typically at most a few thousand subjects for a study of a certain type of disease, due to the high cost of determination of genotypes and limited number of target subjects.

In Internet applications, strongly rectangular datasets are common. For example, the image dataset called TinyImages [2] which contains million images of size collected from Internet.

In spatial discretization of highdimensional partial differential equations (PDEs), the number of degrees of freedom grows exponentially as dimension increases. For 3D problems, it is common that the number of degrees of freedom reaches
, for example, by having a discretization of a cubic domain. However, for a timedependent problem, time stays onedimensional. Though depending on spatial discretization (e.g., the CourantFriedrichsLewy condition for hyperbolic PDEs), the number of time steps is usually much less than the number of degrees of freedoms in spatial discretization. 
In geophysical applications, especially in seismology, the number of sensors is much less than the number of data points each sensor collects. For example, WernerAllen et al. [3] deployed three wireless sensors to monitor volcanic eruptions. In 54 hours, each sensor sent back approximately million packets.

In natural language processing (NLP), the number of documents is much less than the number of
grams, which grows geometrically as increases. For example, the webspam^{1}^{1}1http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html dataset contains 350,000 documents and 254 unigrams, but 680,715 trigrams. 
In highfrequency trading, the number of relevant stocks is much less than the number of ticks, changes to the best bid and ask. For example, in 2012 ISE Historical Options Tick Data^{2}^{2}2http://www.ise.com/hotdata has daily files with average size greater than 100GB uncompressed.
A second, less obvious, reason for restricting attention to strongly rectangular data is that many of the algorithmic methods that are developed for them (both the RandNLA methods we will review as well as deterministic NLA methods that have been used traditionally) have extensions to lowrank matrix approximation and to related problems on more general “fat” matrices. For example, many of the methods for SVDbased lowrank approximation and related rankrevealing QR decompositions of general matrices have strong connections to QR decomposition methods for rectangular matrices; and, similarly, many of the methods for more general linear and convex programming arise in special (
e.g.,regression) linear programming problems. Thus, they are a good problem class to consider the development of matrix algorithms (either in general or for RandNLA algorithms) in parallel and distributed environments.
It is worth emphasizing that the phrase “parallel and distributed” can mean quite different things to different research communities, in particular to what might be termed HPC (high performance computing) or scientific computing researchers versus data analytics or database or distributed data systems researchers. There are important technical and cultural differences here, but there are also some important similarities. For example, to achieve parallelism, one can use multithreading on a sharedmemory machine, or one can use message passing on a multinode cluster. Alternatively, to process massive data on large commodity clusters, Google’s MapReduce [4] describes a computational framework for distributed computation with fault tolerance. For computation not requiring any internode communication, one can achieve even better parallelism. We don’t want to dwell on many of these important details here: this is a complicated and evolving space; and no doubt the details of the implementation of many widelyused algorithms will evolve as the space evolves. To give the interested reader a quick sense of some of these issues, though, here we provide a very highlevel representative description of parallel environments and how they scale. As one goes down this list, one tends to get larger and larger.
name  cores  memory  notes 

Shared memory  ^{3}^{3}3http://www.sgi.com/pdfs/4358.pdf  
Message passing  ^{4}^{4}4http://www.top500.org/list/2011/11/100  CUDA cores: ^{5}^{5}5http://i.top500.org/site/50310  
GPU memory:  
MapReduce  ^{6}^{6}6http://www.cloudera.com/blog/2010/04/pushingthelimitsofdistributedprocessing/  storage: ^{7}^{7}7http://hortonworks.com/blog/anintroductiontohdfsfederation/  
Distributed computing  ^{8}^{8}8http://folding.stanford.edu/ 
In addition, it is also worth emphasizing that there is a great deal of related work in parallel and distributed computing, both in numerical linear algebra as well as more generally in scientific computing. For example, Valiant has provided a widelyused model for parallel computation [5]; Aggarwal et al. have analyzed the communication complexity of PRAMs [6]; Lint and Agerwala have highlighted communication issues that arise in the design of parallel algorithms [7]; Heller has surveyed parallel algorithms in numerical linear algebra [8]; Toledo has provided a survey of outofcore algorithms in numerical linear algebra [9]; Ballard et al. have focused on developing algorithms for minimizing communication in numerical linear algebra [10]; and Bertsekas and Tsitsiklis have surveyed parallel and distributed iterative algorithms [11]. We expect that some of the most interesting developments in upcoming years will involve coupling the ideas for implementing RandNLA algorithms in parallel and distributed environments that we describe in this review with these more traditional ideas for performing parallel and distributed computation.
In the next section, Section 2, we will review the basic ideas underlying RandNLA methods, as they have been developed in the special case of regression in the RAM model. Then, in Section 3, we will provide notation, some background and preliminaries on and more general regression problems, as well as traditional methods for their solution. Then, in Section 4, we will describe rounding and embedding methods that are used in a critical manner by RandNLA algorithms; and in Section 5, we will review recent empirical results on implementing these ideas to solve up to terabytesized and regression problems. Finally, in Section 6, we will provide a brief discussion and conclusion. An overview of the general RandNLA area has been provided [1], and we refer the interested reader to this overview. In addition, two other reviews are available to the interested reader: an overview of how RandNLA methods can be coupled with traditional NLA algorithms for lowrank matrix approximation [12]; and an overview of how dataoblivious subspace embedding methods are used in RandNLA [13].
2 RandNLA in RAM
In this section, we will highlight several core ideas that have been central to prior work in RandNLA in (theory and/or practice in) RAM that we will see are also important as design principles for extending RandNLA methods to largerscale parallel and distributed environments. We start in Section 2.1 by describing a prototypical example of a RandNLA algorithm for the very overdetermined LS problem; then we describe in Section 2.2 two problemspecific complexity measures that are important for lowprecision and highprecision solutions to matrix problems, respectively, as well as two complementary ways in which randomization can be used by RandNLA algorithms; and we conclude in Section 2.3 with a brief discussion of running time considerations.
2.1 A metaalgorithm for RandNLA
A prototypical example of the RandNLA approach is given by the following metaalgorithm for very overdetermined LS problems [14, 1, 15, 16]. In particular, the problem of interest is to solve:
(1) 
The following metaalgorithm takes as input an matrix , where
, a vector
, and a probability distribution
, and it returns as output an approximate solution, which is an estimate of the exact answer
of Problem (1).
Randomly sampling. Randomly sample constraints, i.e., rows of and the corresponding elements of , using as an importance sampling distribution.

Subproblem construction. Rescale each sampled row/element by to form a weighted LS subproblem.

Solving the subproblem. Solve the weighted LS subproblem, formally given in (2) below, and then return the solution .
It is convenient to describe this metaalgorithm in terms of a random “sampling matrix” , in the following manner. If we draw samples (rows or constraints or data points) with replacement, then define an sampling matrix, , where each of the rows of has one nonzero element indicating which row of (and element of ) is chosen in a given random trial. In this case, the element of equals if the data point is chosen in the random trial (meaning, in particular, that every nonzero element of equals for sampling uniformly at random). With this notation, this metaalgorithm constructs and solves the weighted LS estimator:
(2) 
Since this metaalgorithm samples constraints and not variables, the dimensionality of the vector that solves the (still overconstrained, but smaller) weighted LS subproblem is the same as that of the vector that solves the original LS problem. The former may thus be taken as an approximation of the latter, where, of course, the quality of the approximation depends critically on the choice of . Although uniform subsampling (with or without replacement) is very simple to implement, it is easy to construct examples where it will perform very poorly [14, 1, 16]. On the other hand, it has been shown that, for a parameter that can be tuned, if
(3) 
where the socalled statistical leverage scores are defined in (6) below, i.e., if one draws the sample according to an importance sampling distribution that is proportional to the leverage scores of , then with constant probability (that can be easily boosted to probability , for any ) the following relativeerror bounds hold:
(4)  
(5) 
where is the condition number of and where is a parameter defining the amount of the mass of inside the column space of [14, 1, 15].
Due to the crucial role of the statistical leverage scores in (3), this canonical RandNLA procedure has been referred to as the algorithmic leveraging approach to approximating LS approximation [16]. In addition, although this metaalgorithm has been described here only for very overdetermined LS problems, it generalizes to other linear regression problems and lowrank matrix approximation problems on less rectangular matrices^{9}^{9}9Let be a matrix with dimension by where . A less rectangular matrix is a matrix that has smaller . [17, 18, 19, 20, 21].
2.2 Leveraging, conditioning, and using randomization
Leveraging and conditioning refer to two types of problemspecific complexity measures, i.e., quantities that can be computed for any problem instance that characterize how difficult that problem instance is for a particular class of algorithms. Understanding these, as well as different uses of randomization in algorithm design, is important for designing RandNLA algorithms, both in theory and/or practice in RAM as well as in larger parallel and distributed environments. For now, we describe these in the context of very overdetermined LS problems.

Statistical leverage.
(Related to eigenvectors; important for obtaining lowprecision solutions.) If we let
, where the inverse can be replaced with the MoorePenrose pseudoinverse if is rank deficient, be the projection matrix onto the column span of , then the diagonal element of ,(6) where is the row of , is the statistical leverage of observation or sample. Since can alternatively be expressed as , where is any orthogonal basis for the column space of , e.g., the matrix from a QR decomposition or the matrix of left singular vectors from the thin SVD, the leverage of the observation can also be expressed as
(7) where is the row of
. Leverage scores provide a notion of “coherence” or “outlierness,” in that they measure how wellcorrelated the singular vectors are with the canonical basis
[18, 15, 22] as well as which rows/constraints have largest “influence” on the LS fit [23, 24, 25, 26]. Computing the leverage scores exactly is generally as hard as solving the original LS problem (but approximations to them can be computed more quickly, for arbitrary input matrices [15]).Leverage scores are important from an algorithm design perspective since they define the key nonuniformity structure needed to control the complexity of highquality random sampling algorithms. In particular, naïve uniform random sampling algorithms perform poorly when the leverage scores are very nonuniform, while randomly sampling in a manner that depends on the leverage scores leads to highquality solutions. Thus, in designing RandNLA algorithms, whether in RAM or in paralleldistributed environments, one must either quickly compute approximations to the leverage scores or quickly preprocess the input matrix so they are nearly uniformized—in which case uniform random sampling on the preprocessed matrix performs well.
Informally, the leverage scores characterize where in the highdimensional Euclidean space the (singular value) information in
is being sent, i.e., how the quadratic well (with aspect ratio that is implicitly defined by the matrix ) “sits” with respect to the canonical axes of the highdimensional Euclidean space. If one is interested in obtaining lowprecision solutions, e.g., , that can be obtained by an algorithm that provides relativeerror approximations for a fixed value of but whose dependence is polynomial in , then the key quantities that must be dealt with are statistical leverage scores of the input data. 
Condition number.
(Related to eigenvalues; important for obtaining highprecision solutions.) If we let
and denote the largest and smallest nonzero singular values of , respectively, then is the norm condition number of which is formally defined in Definition 3. Computing exactly is generally as hard as solving the original LS problem. The condition number is important from an algorithm design perspective since defines the key nonuniformity structure needed to control the complexity of highprecision iterative algorithms, i.e., it bounds the number of iterations needed for iterative methods to converge. In particular, for illconditioned problems, e.g., if , then the convergence speed of iterative methods is very slow, while if then iterative algorithms converge very quickly. Informally, defines the aspect ratio of the quadratic well implicitly defined by in the highdimensional Euclidean space. If one is interested in obtaining highprecision solutions, e.g., , that can be obtained by iterating a lowprecision solution to high precision with an iterative algorithm that converges as , then the key quantity that must be dealt with is the condition number of the input data. 
Monte Carlo versus Las Vegas uses of randomization. Note that the guarantee provided by the metaalgorithm, as stated above, is of the following form: the algorithm runs in no more than a specified time , and with probability at least it returns a solution that is an good approximation to the exact solution. Randomized algorithms that provide guarantees of this form, i.e., with running time that is is deterministic, but whose output may be incorrect with a certain small probability, are known as Monte Carlo algorithms [27]. A related class of randomized algorithms, known as Las Vegas algorithms, provide a different type of guaranatee: they always produce the correct answer, but the amount of time they take varies randomly [27]. In many applications of RandNLA algorithms, guarantees of this latter form are preferable.
The notions of condition number and leverage scores have been described here only for very overdetermined regression problems. However, as discussed in Section 3 below (as well as previously [17, 19]), these notions generalize to very overdetermined , for , regression problems [19] as well as to for less rectangular matrices, as long as one specifies a rank parameter [17]. Understanding these generalizations, as well as the associated tradeoffs, will be important for developing RandNLA algorithms in parallel and distributed environments.
2.3 Running Time Considerations in RAM
As presented, the metaalgorithm of the previous subsection has a running time that depends on both the time to construct the probability distribution, , and the time to solve the subsampled problem. For uniform sampling, the former is trivial and the latter depends on the size of the subproblem. For estimators that depend on the exact or approximate (recall the flexibility in (3) provided by ) leverage scores, the running time is dominated by the exact or approximate computation of those scores. A naïve algorithm involves using a QR decomposition or the thin SVD of to obtain the exact leverage scores. This naïve implementation of the metaalgorithm takes roughly time and is thus no faster (in the RAM model) than solving the original LS problem exactly [14, 17]. There are two other potential problems with practical implementations of the metaalgorithm: the running time dependence of roughly time scales polynomially with , which is prohibitive if one is interested in moderately small (e.g., ) to very small (e.g., ) values of ; and, since this is a randomized Monte Carlo algorithm, with some probability the algorithm might completely fail.
Importantly, all three of these potential problems can be solved to yield improved variants of the metaalgorithm.

Making the algorithm fast: improving the dependence on and . We can make this metaalgorithm “fast” in worstcase theory in RAM [28, 29, 14, 15, 20]. In particular, this metaalgorithm runs in time in RAM if one does either of the following: if one performs a Hadamardbased random random projection and then performs uniform sampling in the randomly rotated basis [28, 29] (which, recall, is basically what random projection algorithms do when applied to vectors in a Euclidean space [1]); or if one quickly computes approximations to the statistical leverage scores (using the algorithm of [15], the running time bottleneck of which is applying a random projection to the input data) and then uses those approximate scores as an importance sampling distribution [14, 15]. In addition, by using carefullyconstructed extremelysparse random projections, both of these two approaches can be made to run in socalled “input sparsity time,” i.e., in time proportional to the number of nonzeros in the input data, plus lowerorder terms that depend on the lower dimension of the input matrix [20].

Making the algorithm highprecision: improving the dependence on . We can make this metaalgorithm “fast” in practice, e.g., in “high precision” numerical implementation in RAM [30, 31, 32, 33]. In particular, this metaalgorithm runs in time in RAM if one uses the subsampled problem constructed by the random projection/sampling process to construct a preconditioner, using it as a preconditioner for a traditional iterative algorithm on the original full problem [30, 31, 32]. This is important since, although the worstcase theory holds for any fixed , it is quite coarse in the sense that the sampling complexity depends on as and not . In particular, this means that obtaining highprecision with (say) is not practically possible. In this iterative use case, there are several tradeoffs: e.g., one could construct a very highquality preconditioner (e.g., using a number of samples that would yield a error approximation if one solved the LS problem on the subproblem) and perform fewer iterations, or one could construct a lower quality preconditioner by drawing many fewer samples and perform a few extra iterations. Here too, the input sparsity time algorithm of [20] could be used to improve the running time still further.

Dealing with the failure probability. Although fixing a failure probability is convenient for theoretical analysis, in certain applications having even a very small probability that the algorithm might return a completely meaningless answer is undesirable. In this case, one is interested in converting a Monte Carlo algorithm into a Las Vegas algorithm. Fortuitously, those application areas, e.g., scientific computing, are often more interested in moderate to high precision solutions than in low precision solutions. In these case, using the subsampled problem to create a preconditioner for iterative algorithms on the original problem has the side effect that one changes a “fixed running time but might fail” algorithm to an “expected running time but will never fail” algorithm.
From above, we can make the following conclusions. The “fast in worstcase theory” variants of our metaalgorithm ([28, 29, 14, 15, 20]) represent qualitative improvements to the worstcase asymptotic running time of traditional algorithms for the LS problem going back to Gaussian elimination. The “fast in numerical implementation” variants of the metaalgorithm ([30, 31, 32]) have been shown to beat Lapack’s direct dense leastsquares solver by a large margin on essentially any dense tall matrix, illustrating that the worstcase asymptotic theory holds for matrices as small as several thousand by several hundred [31].
While these results are a remarkable success for RandNLA in RAM, they leave open the question of how these RandNLA methods perform in largerscale parallel/distributed environments, and they raise the question of whether the same RandNLA principles can be extended to other common regression problems. In the remainder of this paper, we will review recent work showing that if one wants to solve regression problems in parallel/distributed environments, and if one wants to solve regression problems in theory or in RAM or in parallel/distributed environments, then one can use the same RandNLA metaalgorithm and design principles. Importantly, though, depending on the exact situation, one must instantiate the same algorithmic principles in different ways, e.g., one must worry much more about communication rather than FLOPS.
3 Preliminaries on regression problems
In this section, we will start in Section 3.1 with a brief review of notation that we will use in the remainder of the paper. Then, in Sections 3.2, 3.3, and 3.4, we will review regression problems and the notions of condition number and preconditioning for these problems. Finally, in Sections 3.5 and 3.6, we will review traditional deterministic solvers for as well as and more general regression problems.
3.1 Notation conventions
We briefly list the notation conventions we follow in this work:

We use uppercase letters to denote matrices and constants, e.g., , , , etc.

We use lowercase letters to denote vectors and scalars, e.g., , , , , , etc.

We use to denote the norm of a vector, the spectral norm of a matrix, the Frobenius norm of a matrix, and the elementwise norm of a matrix.

We use uppercase calligraphic letters to denote point sets, e.g., for the linear subspace spanned by ’s columns, for a convex set, and for an ellipsoid, except that is used for big Onotation.

The “” accent is used for sketches of matrices, e.g., , the “” superscript is used for indicating optimal solutions, e.g., , and the “” accent is used for estimates of solutions, e.g., .
3.2 regression problems
In this work, a parameterized family of linear regression problems that is of particular interest is the regression problem.
Definition 1 ( regression)
Given a matrix , a vector , and , the regression problem specified by , , and is the following optimization problem:
(8) 
where the norm of a vector is , defined to be for . We call the problem strongly overdetermined if , and strongly underdetermined if .
Important special cases include the regression problem, also known as linear least squares (LS), and the regression problem, also known as least absolute deviations (LAD) or least absolute errors (LAE). The former is ubiquitous; and the latter is of particular interest as a robust regression technique, in that it is less sensitive to the presence of outliers than the former.
For general , denote the set of optimal solutions to (8). Let be an arbitrary optimal solution, and let be the optimal objective value. We will be particularly interested in finding a relativeerror approximation, in terms of the objective value, to the general regression problem (8).
Definition 2 (Relativeerror approximation)
Given an error parameter , is a approximate solution to the regression problem (8) if and only if
In order to make our theory and our algorithms for general regression simpler more concise, we can use an equivalent formulation of (8) in our discussion.
(9)  
Above, the “new” is concatenated with , i.e., and is a vector with a at the last coordinate and zeros elsewhere, i.e., and , to force the last element of any feasible solution to be . We note that the same formulation is also used by [34] for solving unconstrained convex problems in relative scale. This formulation of regression, which consists of a homogeneous objective and an affine constraint, can be shown to be equivalent to the formulation of (8).
Consider, next, the special case . If, in the LS problem
(10) 
we let , then recall that if (the LS problem is underdetermined or rankdeficient), then (10) has an infinite number of minimizers. In that case, the set of all minimizers is convex and hence has a unique element having minimum length. On the other hand, if so the problem has full rank, there exists only one minimizer to (10) and hence it must have the minimum length. In either case, we denote this unique minlength solution to (10) by , and we are interested in computing in this work. This was defined in Problem (1) above. In this case, we will also be interested in bounding , for arbitrary or worstcase input, where was defined in Problem (2) above and is an approximation to .
3.3 norm condition number
An important concept in and more general regression problems, and in developing efficient algorithms for their solution, is the concept of condition number. For linear systems and LS problems, the norm condition number is already a wellestablished term.
Definition 3 (norm condition number)
Given a matrix with full column rank, let be the largest singular value and be the smallest singular value of . The norm condition number of is defined as . For simplicity, we use , , and when the underlying matrix is clear from context.
For general norm and general regression problems, here we state here two related notions of condition number and then a lemma that characterizes the relationship between them.
Definition 4 (norm condition number (Clarkson et al. [19]))
Given a matrix and , let
Then, we denote by the norm condition number of , defined to be:
For simplicity, we use , , and when the underlying matrix is clear.
Definition 5 (conditioning (Dasgupta et al. [35]))
Given a matrix and , let be the dual norm of . Then is conditioned if (1) , and (2) for all , . Define , the condition number of , as the minimum value of such that is conditioned. We use for simplicity if the underlying matrix is clear.
Lemma 1 (Equivalence of and (Clarkson et al. [19]))
Given a matrix and , we always have
That is, by Lemma 1, if , then the notions of condition number provided by Definition 4 and Definition 5 are equivalent, up to lowdimensional factors. These lowdimensional factors typically do not matter in theoretical formulations of the problem, but they can matter in practical implementations.
The norm condition number of a matrix can be arbitrarily large. Given the equivalence established by Lemma 1, we say that a matrix is wellconditioned in the norm if or , independent of the high dimension . We see in the following sections that the condition number plays a very important part in the analysis of traditional algorithms.
3.4 Preconditioning regression problems
Preconditioning refers to the application of a transformation, called the preconditioner, to a given problem instance such that the transformed instance is moreeasily solved by a given class of algorithms. Most commonly, the preconditioned problem is solved with an iterative algorithm, the complexity of which depends on the condition number of the preconditioned problem.
To start, consider , and recall that for a square linear system of full rank, this preconditioning usually takes one of the following forms:
left preconditioning  
right preconditioning  
left and right preconditioning 
Clearly, the preconditioned system is consistent with the original one, i.e., has the same as the unique solution, if the preconditioners and are nonsingular.
For the general LS Problem (1), more care should be taken so that the preconditioned system has the same minlength solution as the original one. In particular, if we apply left preconditioning to the LS problem , then the preconditioned system becomes , and its minlength solution is given by
Similarly, the minlength solution to the right preconditioned system is given by
The following lemma states the necessary and sufficient conditions for or to hold. Note that these conditions holding certainly imply that and , respectively.
Lemma 2 (Left and right preconditioning (Meng et al. [32])
Given , and , we have

if and only if ,

if and only if .
Given this preconditioned problem, (13) (see below) bounds the number of itrations for certain iterative algorithms for the LS problem.
Just as with , for more general regression problems with matrix with full column rank, although its condition numbers and can be arbitrarily large, we can often find a matrix such that is wellconditioned. (This is not the from a QR decomposition of , unless , but some other matrix .) In this case, the regression Problem (9) is equivalent to the following wellconditioned problem:
(11)  
subject to 
Clearly, if is an optimal solution to (11), then is an optimal solution to (9), and vice versa; however, (11) may be easier to solve than (9) because of better conditioning.
Since we want to reduce the condition number of a problem instance via preconditioning, it is natural to ask what the best possible outcome would be in theory. For
, an orthogonal matrix,
e.g., the matrix computed from a QR decomposition, has . More generally, for the norm condition number , we have the following existence result.Lemma 3
Given a matrix with full column rank and , there exist a matrix such that .
This is a direct consequence of John’s theorem [36] on ellipsoidal rounding of centrally symmetric convex sets. For the condition number , we have the following lemma.
Lemma 4
Given a matrix with full column rank and , there exist a matrix such that .
Note that Lemmas 3 and 4 are both existential results. Unfortunately, except the case when , no polynomialtime algorithm is known that can provide such preconditioning for general matrices. Below, in Section 4, we will discuss two practical approaches for norm preconditioning: via ellipsoidal rounding and via subspace embedding, as well as subspacepreserving sampling algorithms built on top of them.
3.5 Traditional solvers for regression
Least squares is a classic problem in linear algebra. It has a long history, tracing back to Gauss, and it arises in numerous applications. A detailed survey of numerical algorithms for least squares is certainly beyond the scope of this work. In this section, we briefly describe some wellknown direct methods and iterative methods that compute the minlength solution to a possibly rankdeficient least squares problem, and we refer readers to Björck [37] for additional details.
Direct methods
It is well known that the minlength solution of a least squares problem can be computed using the singular value decomposition (SVD). Let
be the compact SVD, where , , and , i.e., only singular vectors corresponding to the nonzero singular values are calculated. We have . The matrix is the MoorePenrose pseudoinverse of , denoted by , which is defined and unique for any matrix. Hence we can simply write . The SVD approach is accurate and robust to rankdeficiency.Another way to solve a least squares problem is using complete orthogonal factorization. If we can find orthonormal matrices and , and a matrix , such that , then the minlength solution is given by . We can treat SVD as a special case of complete orthogonal factorization. In practice, complete orthogonal factorization is usually computed via rankrevealing QR factorizations, making a triangular matrix. The QR approach is less expensive than SVD, but it is slightly less robust at determining the rank.
A third way to solve a least squares problem is by computing the minlength solution to the normal equation , namely
(12) 
It is easy to verify the correctness of (12) by replacing by its compact SVD . If , a Cholesky factorization of either (if ) or (if ) solves (12). If , we need the eigensystem of or to compute . The normal equation approach is the least expensive among the three direct approaches we have mentioned, but it is also the least accurate one, especially on illconditioned problems. See Chapter 5 of Golub and Van Loan [38] for a detailed analysis. A closely related direct solver is the seminormal equation method. It is often useful when the factor of the QR decomposition is known; see [39] for more details.
For sparse least squares problems, by pivoting ’s columns and rows, we may find a sparse factorization of , which is preferred to a dense factorization for more efficient storage. For sparse direct methods, we refer readers to Davis [40].
Iterative methods
Instead of direct methods, we can use iterative methods to solve (10). If all the iterates are in and if converges to a minimizer, it must be the minimizer having minimum length, i.e., the solution to Problem (1). This is the case when we use a Krylov subspace method starting with a zero vector. For example, the conjugate gradient (CG) method on the normal equation leads to the minlength solution (see Paige and Saunders [41]). In practice, CGLS [42], LSQR [43] are preferable because they are equivalent to applying CG to the normal equation in exact arithmetic but they are numerically more stable. Other Krylov subspace methods such as LSMR [44] can also solve (10) as well. The Chebyshev semiiterative method [45] can also be modified to solve LS problems.
Importantly, however, it is in general hard to predict the number of iterations for CGlike methods. The convergence rate is affected by the condition number of . A classical result [46, p.187] states that
(13) 
where for any , and where is the condition number of under the norm. Estimating is generally as hard as solving the LS problem itself, and in practice the bound does not hold in any case unless reorthogonalization is used. Thus, the computational cost of CGlike methods remains unpredictable in general, except when is very wellconditioned and the condition number can be well estimated.
3.6 Traditional solvers for and more general regression
While regression can be solved with direct methods such as SVD and QR, the solution of general regression has to rely on iterative methods due to the lack of analytical solution. In particular, and regression problems can be formulated as linear programs and solved by linear programming solvers, and general regression problems can be formulated as convex programs and hence solved by general convex solvers. This, however, comes at the cost of increased complexity, compared to the case. For example, it is easy to see that all regression problems are convex due to the convexity of vector norms. Therefore, standard convex solvers, e.g., gradientbased methods [47], interiorpoint methods (IPMs) [48], and interiorpoint cuttingplane methods (IPCPMs)[49] can be used to solve regression problems. Discussing those convex solvers is beyond the scope of the work. We refer readers to the monographs mentioned above or Boyd and Vandenberghe [50] for a general introduction.
When or , the problem is still convex but not smooth. Subgradient methods [51] or gradient methods with smoothing [52] can be used to handle nonsmoothness, while another solution is via linear programming. In particular, an regression problem specified by and is equivalent to the following linear program:
minimize  
subject to  
and an regression problem specified by and is equivalent to the following:
minimize  
subject to  
where indicates a vector of length with all ones. As a linear programming problem, an or regression problem can be solved by any linear programming solver, using the simplex method [53] or IPMs. Similar to the case for least squares, the condition number affects the performance of regression solvers, e.g., on the convergence rate for subgradient [51] or gradient method [54], on the search of an initial feasible point for IPMs [55], and on the initial search region for ellipsoid methods and IPCPMs [49]. Generally speaking, a smaller condition number makes the problem easier to solve.
Another popular way to solve regression problems is via iteratively reweighted least squares (IRLS) [56], which solves a sequence of weighted least squares problems and makes the solutions converge to an optimal solution of the original regression problem. At step , it solves the following weighted least squares problem:
where is a diagonal matrix with positive diagonals , . Let
be an identity matrix and choose
until converges. The choice of is often smoothed to avoid dividing by zero in practice. It is not hard to show that if converges, it converges to an optimal solution of the regression problem. However, the convergence theory of IRLS only exists under certain assumptions and the convergence rate is much harder to derive. See Burrus [57] for a survey of related work.
4 Rounding, embedding, and sampling regression problems
Preconditioning, ellipsoidal rounding, and lowdistortion subspace embedding are three core technical tools underlying RandNLA regression algorithms. In this section, we will describe in detail how these methods are used for regression problems, with an emphasis on tradeoffs that arise when applying these methods in parallel and distributed environments. Recall that, for any matrix with full column rank, Lemmas 3 and 4 above show that there always exists a preconditioner matrix such that is wellconditioned, for regression, for general . For , such a matrix can be computed in time as the “R” matrix from a QR decomposition, although it is of interest to compute other such preconditioner matrices that are nearly as good more quickly; and for and other values of , it is of interest to compute a preconditioner matrix in time that is linear in and lowdegree polynomial in . In this section, we will discuss these and related issues.
In particular, in Sections 4.1 and 4.2, we discuss practical algorithms to find such matrices, and we describe the tradeoffs between speed (e.g., FLOPS, number of passes, additional space/time, etc.) and conditioning quality. The algorithms fall into two general families: ellipsoidal rounding (Section 4.1) and subspace embedding (Section 4.2). We present them roughly in the order of speed (in the RAM model), from slower ones to faster ones. We will discuss practical tradeoffs in Section 5. For simplicity, here we assume , and hence ; and if is sparse, we assume that . Hereby, the degree of depends on the underlying algorithm, which may range from to .
Before diving into the details, it is worth mentioning a few highlevel considerations about subspace embedding methods. (Similar considerations apply to ellipsoidal rounding methods.) Subspace embedding algorithms involve mapping data points, e.g., the columns of an matrix, where to a lowerdimensional space such that some property of the data, e.g., geometric properties of the point set, is approximately preserved; see Definition 7 for definition for lowdistortion subspace embedding matrix. As such, they are critical building blocks for developing improved random sampling and random projection algorithms for common linear algebra problems more generally, and they are one of the main technical tools for RandNLA algorithms. There are several properties of subspace embedding algorithms that are important in order to optimize their performance in theory and/or in practice. For example, given a subspace embedding algorithm, we may want to know:

whether it is dataoblivious (i.e., independent of the input subspace) or dataaware (i.e., dependent on some property of the input matrix or input space),

the time and storage it needs to construct an embedding,

the time and storage to apply the embedding to an input matrix,

the failure rate, if the construction of the embedding is randomized,

the dimension of the embedding, i.e., the number of dimensions being sampled by sampling algorithms or being projected onto by projection algorithms,

the distortion of the embedding, and

how to balance the tradeoffs among those properties.
Some of these considerations may not be important for typical theoretical analysis but still affect the practical performance of implementations of these algorithms.
After the discussion of rounding and embedding methods, we will then show in Section 4.3 that ellipsoidal rounding and subspace embedding methods (that show that the norms of the entire subspace of vectors can be wellpreserved) can be used in one of two complementary ways: one can solve an regression problem on the rounded/embedded subproblem; or one can use the rounding/embedding to construct a preconditioner for the original problem. (We loosely refer to these two complementary types of approaches as lowprecision methods and highprecision methods, respectively. The reason is that the running time complexity with respect to the error parameter for the former is , while the running time complexity with respect to for the latter is .) We also discuss various ways to combine these two types of approaches to improve their performance in practice.
Since we will introduce several important and distinct but closelyrelated concepts in this long section, in Figure 1 we provide an overview of these relations as well as of the structure of this section.
4.1 Ellipsoidal rounding and fast ellipsoid rounding
In this subsection, we will describe ellipsoidal rounding methods. In particular, we are interested in the ellipsoidal rounding of a centrally symmetric convex set and its application to norm preconditioning. We start with a definition.
Definition 6 (Ellipsoidal rounding)
Let be a convex set that is fulldimensional, closed, bounded, and centrally symmetric with respect to the origin. An ellipsoid is a rounding of if it satisfies , for some , where means shrinking by a factor of .
Finding an ellipsoidal rounding with a small factor for a given convex set has many applications such as in computational geometry [58], convex optimization [59], and computer graphics [60]. In addition, the norm condition number naturally connects to ellipsoidal rounding. To see this, let and assume that we have a rounding of : . This implies
If we let , then we get
Therefore, we have . So a rounding of leads to a preconditioning of .
Recall the wellknown result due to John [36] that for a centrally symmetric convex set there exists a rounding. It is known that this result is sharp and that such rounding is given by the LöwnerJohn (LJ) ellipsoid of , i.e., the minimalvolume ellipsoid containing . This leads to Lemma 3 above. Unfortunately, finding an rounding is a hard problem. No constantfactor approximation in polynomial time is known for general centrally symmetric convex sets, and hardness results have been shown [59].
To state algorithmic results, suppose that is described by a separation oracle and that we are provided an ellipsoid that gives an rounding for some . In this case, we can find a rounding in polynomial time, in particular, in calls to the oracle; see Lovász [59, Theorem 2.4.1]. (Polynomial time algorithms with better have been proposed for special convex sets, e.g., the convex hull of a finite point set [61] and the convex set specified by the matrix norm [54].) This algorithmic result was used by Clarkson [51] and then by Dasgupta et al. [35] for regression. Note that, in these works, only rounding is actually needed, instead of rounding.
time  # passes  # calls to oracle  

ER [51, 35]  
Fast ER [19]  
Singlepass ER [62] 
Recent work has focused on constructing ellipsoid rounding methods that are much faster than these more classical techniques but that lead to only slight degredation in preconditioning quality. See Table 3 for a summary of these results. In particular, Clarkson et al. [19] follow the same construction as in the proof of Lovász [59] but show that it is much faster (in calls to the oracle) to find a (slightly worse) rounding of a centrally symmetric convex set in that is described by a separation oracle.
Lemma 5 (Fast ellipsoidal rounding (Clarkson et al. [19]))
Given a centrally symmetric convex set , which is centered at the origin and described by a separation oracle, and an ellipsoid centered at the origin such that for some , it takes at most calls to the oracle and additional time to find a rounding of .
By applying Lemma 5 to the convex set , with the separation oracle described via a subgradient of and the initial rounding provided by the “R” matrix from the QR decomposition of , one immediately improves the running time of the algorithm used by Clarkson [51] and by Dasgupta et al. [35] from to while maintaining an conditioning.
Corollary 1
Given a matrix with full column rank, it takes at most time to find a matrix such that .
Unfortunately, even this improvement for computing a conditioning is not immediately applicable to very large matrices. The reason is that such matrices are usually distributively stored on secondary storage and each call to the oracle requires a pass through the data. We could group calls together within a single pass, but this would still need passes. Instead, Meng and Mahoney [62] present a deterministic singlepass conditioning algorithm that balances the costperformance tradeoff to provide a conditioning of [62]. This algorithm essentially invoke the fast ellipsoidal rounding (Lemma 5) method on a smaller problem which is constructed via a singlepass on the original dataset. Their main algorithm is stated in Algorithm 1, and the main result for Algorithm 1 is the following.
Lemma 6 (Onepass conditioning (Meng and Mahoney [62]))
Algorithm 1 is a conditioning algorithm, and it runs in time. It needs to compute a rounding on a problem with size by which needs calls to the separation oracle on the smaller problem.
Remark 1
Solving the rounding problem of size in Algorithm 1 requires RAM, which might be too much for very largescale problems. In such cases, one can increase the block size from to, e.g., . A modification to the proof of Lemma 6 shows that this gives us a conditioning algorithm that only needs RAM and FLOPS for the rounding problem.
Remark 2
One can replace SVD computation in Algorithm 1 by a fast randomized subspace embedding (i.e., a fast lowrank approximation algorithm as described in [12, 1] and that we describe below). This reduces the overall running time to ), and this is an improvement in terms of FLOPS; but this would lead to a nondeterministic result with additional variability due to the randomization (that in our experience substantially degrades the embedding/conditioning quality in practice). How to balance those tradeoffs in real applications and implementations depends on the underlying application and problem details.
4.2 Lowdistortion subspace embedding and subspacepreserving embedding
In this subsection, we will describe in detail subspace embedding methods. Subspace embedding methods were first used in RandNLA by Drineas et al. in their relativeerror approximation algorithm for regression (basically, the metaalgorithm described in Section 2.1)
Comments
There are no comments yet.