1 Introduction
Quantile regression is a method to estimate the quantiles of the conditional distribution of a response variable, expressed as functions of observed covariates [8], in a manner analogous to the way in which Leastsquares regression estimates the conditional mean. The Least Absolute Deviations regression (i.e., regression) is a special case of quantile regression that involves computing the median of the conditional distribution. In contrast with regression and the more popular or Leastsquares regression, quantile regression involves minimizing asymmetricallyweighted absolute residuals. Doing so, however, permits a much more accurate portrayal of the relationship between the response variable and observed covariates, and it is more appropriate in certain nonGaussian settings. For these reasons, quantile regression has found applications in many areas, e.g., survival analysis and economics [2, 10, 3]. As with regression, the quantile regression problem can be formulated as a linear programming problem, and thus simplex or interiorpoint methods can be applied [9, 15, 14]. Most of these methods are efficient only for problems of small to moderate size, and thus to solve very largescale quantile regression problems more reliably and efficiently, we need new computational techniques.
In this paper, we provide a fast algorithm to compute a relativeerror approximate solution to the overconstrained quantile regression problem. Our algorithm constructs a lowdistortion subspace embedding of the form that has been used in recent developments in randomized algorithms for matrices and largescale data problems, and our algorithm runs in time that is nearly linear in the number of nonzeros in the input data.
In more detail, recall that a quantile regression problem can be specified by a (design) matrix
, a (response) vector
, and a parameter , in which case the quantile regression problem can be solved via the optimization problem(1) 
where , for , where
(2) 
for , is the corresponding loss function. In the remainder of this paper, we will use to denote the augmented matrix , and we will consider . With this notation, the quantile regression problem of Eqn. (1) can equivalently be expressed as a constrained optimization problem with a single linear constraint,
(3) 
where and is a unit vector with the first coordinate set to be . The reasons we want to switch from Eqn. (1) to Eqn. (3) are as follows. Firstly, it is for notational simplicity in the presentation of our theorems and algorithms. Secondly, all the results about lowdistortion or subspace embedding in this paper holds for any ,
In particular, we can consider in some specific subspace of . For example, in our case, . Then, the equation above is equivalent to the following,
Therefore, using notation with in some constraint is a more general form of expression. We will focus on very overconstrained problems with size .
Our main algorithm depends on a technical result, presented as Lemma 9, which is of independent interest. Let be an input matrix, and let be a random sampling matrix constructed based on the following importance sampling probabilities,
where is the elementwise norm, and where is the th row of an wellconditioned basis for the range of (see Definition 3 and Proposition 1). Then, Lemma 9 states that, for a sampling complexity that depends on but is independent of ,
will be satisfied for every .
Although one could use, e.g., the algorithm of [6] to compute such a wellconditioned basis and then “read off” the norm of the rows of , doing so would be much slower than the time allotted by our main algorithm. As Lemma 9 enables us to leverage the fast quantile regression theory and the algorithms developed for regression, we provide two sets of additional results, most of which are built from the previous work. First, we describe three algorithms (Algorithm 1, Algorithm 2, and Algorithm 3) for computing an implicit representation of a wellconditioned basis; and second, we describe an algorithm (Algorithm 4) for approximating the norm of the rows of the wellconditioned basis from that implicit representation. For each of these algorithms, we prove qualityofapproximation bounds in quantile regression problems, and we show that they run in nearly “inputsparsity” time, i.e., in time, where is the number of nonzero elements of , plus lowerorder terms. These lowerorder terms depend on the time to solve the subproblem we construct; and they depend on the smaller dimension but not on the larger dimension . Although of less interest in theory, these lowerorder terms can be important in practice, as our empirical evaluation will demonstrate.
We should note that, of the three algorithms for computing a wellconditioned basis, the first two appear in [13] and are stated here for completeness; and the third algorithm, which is new to this paper, is not uniformly better than either of the two previous algorithms with respect to either condition number or the running time. (In particular, Algorithm 1 has slightly better running time, and Algorithm 2 has slightly better conditioning properties.) Our new conditioning algorithm is, however, only slightly worse than the better of the two previous algorithms with respect to each of those two measures. Because of the tradeoffs involved in implementing quantile regression algorithms in practical settings, our empirical results show that by using a conditioning algorithm that is only slightly worse than the best previous conditioning algorithms for each of these two criteria, our new conditioning algorithm can lead to better results than either of the previous algorithms that was superior by only one of those criteria.
Given these results, our main algorithm for quantile regression is presented as Algorithm 5. Our main theorem for this algorithm, Theorem 1, states that, with constant probability, this algorithm returns a approximate solution to the quantile regression problem; and that this solution can be obtained in time, plus the time for solving the subproblem (whose size is , where , independent of , when ).
We also provide a detailed empirical evaluation of our main algorithm for quantile regression, including characterizing the quality of the solution as well as the running time, as a function of the high dimension , the lower dimension , the sampling complexity , and the quantile parameter . Among other things, our empirical evaluation demonstrates that the output of our algorithm is highly accurate, in terms of not only objective function value, but also the actual solution quality (by the latter, we mean a norm of the difference between the exact solution to the full problem and the solution to the subproblem constructed by our algorithm), when compared with the exact quantile regression, as measured in three different norms. More specifically, our algorithm yields 2digit accuracy solution by sampling only, e.g., about of a problem with size .^{2}^{2}2We use this notation throughout; e.g., by , we mean that and . Our new conditioning algorithm outperforms other conditioningbased methods, and it permits much larger small dimension than previous conditioning algorithms. In addition to evaluating our algorithm on moderatelylarge data that can fit in RAM, we also show that our algorithm can be implemented in MapReducelike environments and applied to computing the solution of terabytesized quantile regression problems.
The best previous algorithm for moderately large quantile regression problems is due to [15] and [14]. Their algorithm uses an interiorpoint method on a smaller problem that has been preprocessed by randomly sampling a subset of the data. Their preprocessing step involves predicting the sign of each , where and are the th row of the input matrix and the th element of the response vector, respectively, and is an optimal solution to the original problem. When compared with our approach, they compute an optimal solution, while we compute an approximate solution; but in worstcase analysis it can be shown that with high probability our algorithm is guaranteed to work, while their algorithm do not come with such guarantees. Also, the sampling complexity of their algorithm depends on the higher dimension , while the number of samples required by our algorithm depends only on the lower dimension ; but our sampling is with respect to a carefullyconstructed nonuniform distribution, while they sample uniformly at random.
For a detailed overview of recent work on using randomized algorithms to compute approximate solutions for leastsquares regression and related problems, see the recent review [12]. Most relevant for our work is the algorithm of [6] that constructs a wellconditioned basis by ellipsoid rounding and a subspacepreserving sampling matrix in order to approximate the solution of general regression problems, for , in roughly ; the algorithms of [16] and [5] that use the “slow” and “fast’ versions of Cauchy Transform to obtain a lowdistortion embedding matrix and solve the overconstrained regression problem in and time, respectively; and the algorithm of [13] that constructs lowdistortion embeddings in “input sparsity” time and uses those embeddings to construct a wellconditioned basis and approximate the solution of the overconstrained regression problem in time. In particular, we will use the two conditioning methods in [13], as well as our “improvement” of those two methods, for constructing norm wellconditioned basis matrices in nearly inputsparsity time. In this work, we also demonstrate that such wellconditioned basis in sense can be used to solve overconstrained quantile regression problem.
2 Background and Overview of Conditioning Methods
2.1 Preliminaries
We use to denote the elementwise norm for both vectors and matrices; and we use to denote the set . For any matrix , and denote the th row and the th column of , respectively; and denotes the column space of . For simplicity, we assume has full column rank; and we always assume that . All the results hold for by simply switching the positions of and .
Although , defined in Eqn. 2, is not a norm, since the loss function does not have the positive linearity, it satisfies some “good” properties, as stated in the following lemma:
Lemma 1.
Suppose that . Then, for any , the following hold:

;

;

; and

.
Proof.
It is trivial to prove every equality or inequality for in one dimension. Then by the definition of for vectors, the inequalities and equalities hold for general and . ∎
To make our subsequent presentation selfcontained, here we will provide a brief review of recent work on subspace embedding algorithms. We start with the definition of a lowdistortion embedding matrix for in terms of , see e.g., [13].
Definition 1 (Lowdistortion Subspace Embedding).
Given , is a lowdistortion embedding of if and for all ,
where and are lowdegree polynomials of .
The following stronger notion of a distortion subspacepreserving embedding will be crucial for our method. In this paper, the “measure functions” we will consider are and .
Definition 2 (distortion Subspacepreserving Embedding).
Given and a measure function of vectors , is a distortion subspacepreserving matrix of if and for all ,
Furthermore, if is a sampling matrix (one nonzero element per row in ), we call it a distortion subspacepreserving sampling matrix.
In addition, the following notion, originally introduced by [4], and stated more precisely in [6], of a basis that is wellconditioned for the norm will also be crucial for our method.
Definition 3 (conditioning and wellconditioned basis).
Given , is conditioned if and for all . Define as the minimum value of such that is conditioned. We will say that a basis of is a wellconditioned basis if is a polynomial in , independent of .
For a lowdistortion embedding matrix for , we next state a fast construction algorithm that runs in “inputsparsity” time by applying the Sparse Cauchy Transform. This was originally proposed as Theorem 2 in [13].
Lemma 2 (Fast construction of Lowdistortion Subspace Embedding Matrix, from [13]).
Given with full column rank, let , where has each column chosen independently and uniformly from the standard basis vector of , and where
is a diagonal matrix with diagonals chosen independently from Cauchy distribution. Set
with sufficiently large. Then, with a constant probability, we have(4) 
In addition, can be computed in time.
Remark. This result has very recently been improved. In [17], the authors show that one can achieve a distortion subspace embedding matrix with embedding dimension in time by replacing Cauchy variables in the above lemma with exponential variables. Our theory can also be easily improved by using this improved lemma.
Next, we state a result for the fast construction of a distortion subspacepreserving sampling matrix for , from Theorem 5.4 in [5], with , as follows.
Lemma 3 (Fast construction of Sampling Matrix, from Theorem 5.4 in [5]).
Given a matrix and a matrix such that is a wellconditioned basis for with condition number , it takes time to compute a sampling matrix with such that with a constant probability, for any ,
We also cite the following lemma for finding a matrix , such that is a wellcondition basis, which is based on ellipsoidal rounding proposed in [5].
Lemma 4 (Fast Ellipsoid Rounding, from [5]).
Given an matrix , by applying an ellipsoid rounding method, it takes at most time to find a matrix such that .
Finally, two important ingredients for proving subspace preservation are nets and tail inequalities. Suppose that is a point set and is a metric on . A subset is called a net for some if for every there is a such that . It is wellknown that the unit ball of a dimensional subspace has a net with size at most [1]
. Also, we will use the standard Bernstein inequality to prove concentration results for the sum of independent random variables.
Lemma 5 (Bernstein inequality, [1]).
Let be independent random variables with zeromean. Suppose that , for , then for any positive number , we have
2.2 Conditioning methods for regression problems
Before presenting our main results, we start here by outlining the theory for conditioning for overconstrained (and ) regression problems.
The concept of a wellconditioned basis (recall Definition 3) plays an important role in our algorithms, and thus in this subsection we will discuss several related conditioning methods. By a conditioning method, we mean an algorithm for finding, for an input matrix , a wellconditioned basis, i.e., either finding a wellconditioned matrix or finding a matrix such that is wellconditioned. There exist many approaches that have been proposed for conditioning. The two most important properties of these methods for our subsequent analysis are: (1) the condition number ; and (2) the running time to construct (or ). The importance of the running time should be obvious; but the condition number directly determines the number of rows that we need to select, and thus it has an indirect effect on running time (via the time required to solve the subproblem). See Table 1 for a summary of the basic properties of the conditioning methods that will be discussed in this subsection.
name  running time  type  

SC [16]  QR  
FC [5]  QR  
Ellipsoid rounding [4]  ER  
Fast ellipsoid rounding [5]  ER  
SPC1 [13]  QR  
SPC2 [13]  + ER_small  QR+ER  
SPC3 (proposed in this article)  + QR_small  QR+QR 
In general, there are three basic ways for finding a matrix such that is wellconditioned: those based on the QR factorization; those based on Ellipsoid Rounding; and those based on combining the two basic methods.

Via QR Factorization (QR). To obtain a wellconditioned basis, one can first construct a lowdistortion embedding matrix. By Definition 1, this means finding a , such that for any ,
(5) where and is independent of and the factors and here will be lowdegree polynomials of (and related to and of Definition 3). For example, could be the Sparse Cauchy Transform described in Lemma 2. After obtaining , by calculating a matrix such that has orthonormal columns, the matrix is a wellconditioned basis with . See Theorem 4.1 in [13] for more details. Here, the matrix
can be obtained by a QR factorization (or, alternately, the Singular Value Decomposition). As the choice of
varies, the condition number of , i.e., , and the corresponding running time will also vary, and there is in general a tradeoff among these.For simplicity, the acronyms for these types of conditioning methods will come from the name of the corresponding transformations: SC stands for Slow Cauchy Transform from [16]; FC stands for Fast Cauchy Transform from [5]; and SPC1 (see Algorithm 1) will be the first method based on the Sparse Cauchy Transform (see Lemma 2). We will call the methods derived from this scheme QRtype methods.

Via Ellipsoid Rounding (ER). Alternatively, one can compute a wellconditioned basis by applying ellipsoid rounding. This is a deterministic algorithm that computes a rounding of a centrally symmetric convex set . By rounding here we mean finding an ellipsoid , satisfying , which implies , . With a transformation of the coordinates, it is not hard to show the following,
(6) From this, it is not hard to show the following inequalities,
This directly leads to a wellconditioned matrix with . Hence, the problem boils down to finding a rounding with small in a reasonable time.
By Theorem 2.4.1 in [11], one can find a rounding in polynomial time. This result was used by [4] and [6]. As we mentioned in the previous section, Lemma 4, in [5], a new fast ellipsoid rounding algorithm was proposed. For an matrix with full rank, it takes at most time to find a matrix such that is a wellconditioned basis with . We will call the methods derived from this scheme ERtype methods.

Via Combined QR+ER Methods. Finally, one can construct a wellconditioned basis by combining QRlike and ERlike methods. For example, after we obtain such that is a wellconditioned basis, as described in Lemma 3, one can then construct a distortion subspacepreserving sampling matrix in time. We may view that the price we pay for obtaining is very low in terms of running time. Since is a sampling matrix with constant distortion factor and since the dimension of is independent of , we can apply additional operations on that smaller matrix in order to obtain a better condition number, without much additional running time, in theory at least, if , for some lowdegree .
Since the bottleneck for ellipsoid rounding is its running time, when compared to QRtype methods, one possibility is to apply ellipsoid rounding on . Since the bigger dimension of only depends on , the running time for computing via ellipsoid rounding will be acceptable if . As for the condition number, for any general subspace embedding satisfying Eqn. (5), i.e., which preserves the norm up to some factor determined by , including , if we apply ellipsoid rounding on , then the resulting may still satisfy Eqn. (6) with some . In detail, viewing as a vector in , from Eqn. (5), we have
In Eqn. (6), replace with , combining the inequalities above, we have
With appropriate scaling, one can show that a wellconditioned matrix with . Especially, when has constant distortion, say , the condition number is preserved at sampling complexity , while the running time has been reduced a lot, when compared to the vanilla ellipsoid rounding method. (See Algorithm 2 (SPC2) below for a version of this method.)
A second possibility is to view as a sampling matrix satisfying Eqn. (5) with . Then, according to our discussion of the QRtype methods, if we compute the QR factorization of , we may expect the resulting to be a wellconditioned basis with lower condition number . As for the running time, QR factorization on a smaller matrix will be inconsequential, in theory at least. (See Algorithm 3 (SPC3) below for a version of this method.)
In the remainder of this subsection, we will describe three related methods for computing a wellconditioned basis that we will use in our empirical evaluations. Recall that Table 1 provides a summary of these three methods and the other methods that we will use.
We start with the algorithm obtained when we use Sparse Cauchy Transform from [13] as the random projection in a vanilla QRtype method. We call it SPC1 since we will describe two of its variants below. Our main result for Algorithm 1 is given in Lemma 6. Since the proof is quite straightforward, we omit it here.
Lemma 6.
Given with full rank, Algorithm 1 takes time to compute a matrix such that with a constant probability, is a wellconditioned basis for with .
Next, we summarize the two Combined Methods described above in Algorithm 2 and Algorithm 3. Since they are variants of SPC1, we call them SPC2 and SPC3, respectively. Algorithm 2 originally appeared as first four steps of Algorithm 2 in [13]. Our main result for Algorithm 2 is given in Lemma 7; since the proof of this lemma is very similar to the proof of Theorem 7 in [13], we omit it here. Algorithm 3 is new to this paper. Our main result for Algorithm 3 is given in Lemma 8.
Lemma 7.
Given with full rank, Algorithm 2 takes time to compute a matrix such that with a constant probability, is a wellconditioned basis for with .
Lemma 8.
Given with full rank, Algorithm 3 takes time to compute a matrix such that with a constant probability, is a wellconditioned basis for with .
Proof.
By Lemma 2, in Step 1, is a lowdistortion embedding satisfying Eqn. (5) with , and . As a matter of fact, as we discussed in Section 2.2, the resulting in Step 2 is a wellconditioned basis with . In Step 3, by Lemma 3, the sampling complexity required for obtaining a distortion sampling matrix is . Finally, if we view as a lowdistortion embedding matrix with and , then the resulting in Step 4 will satisfy that is a wellconditioned basis with .
For the running time, it takes time for completing Step 1. In Step 2, the running time is . As Lemma 3 points out, the running time for constructing in Step 3 is . Since the large dimension of is a lowdegree polynomial of , the QR factorization of it costs time in Step 4. Overall, the running time of Algorithm 3 is . ∎
Both Algorithm 2 and Algorithm 3 have additional steps (Steps 3 & 4), when compared with Algorithm 1, and this leads to some improvements, at the cost of additional computation time. For example, in Algorithm 3 (SPC3), we obtain a wellconditioned basis with smaller when comparing to Algorithm 1 (SPC1). As for the running time, it will be still , since the additional time is for constructing sampling matrix and solving a QR factorization of a matrix whose dimensions are determined by . Note that when we summarize these results in Table 1, we explicitly list the additional running time for SPC2 and SPC3, in order to show the tradeoff between these SPCderived methods. We will evaluate the performance of all these methods on quantile regression problems in Section 4 (except for FC, since it is similar to but worse than SPC1, and ellipsoid rounding, since on the full problem it is too expensive).
Remark. For all the methods we described above, the output is not the wellconditioned matrix , but instead it is the matrix , the inverse of which transforms into .
Remark. As we can see in Table 1, with respect to conditioning quality, SPC2 has the lowest condition number
, followed by SPC3 and then SPC1, which has the worst condition number. On the other hand, with respect to running time, SPC1 is the fastest, followed by SPC3, and then SPC1, which is the slowest. (The reason for this ordering of the running time is that SPC2 and SPC3 need additional steps and ellipsoid rounding takes longer running time that doing a QR decomposition.)
3 Main Theoretical Results
In this section, we present our main theoretical results on distortion subspacepreserving embeddings and our fast randomized algorithm for quantile regression.
3.1 Main technical ingredients
In this subsection, we present the main technical ingredients underlying our main algorithm for quantile regression. We start with a result which says that if we sample sufficiently many (but still only ) rows according to an appropriatelydefined nonuniform importance sampling distribution (of the form given in Eqn. (7) below), then we obtain a distortion embedding matrix with respect to the loss function of quantile regression. Note that the form of this lemma is based on ideas from [6, 5].
Lemma 9 (Subspacepreserving Sampling Lemma).
Given , let be a wellconditioned basis for with condition number . For , define
(7) 
and let be a random diagonal matrix with with probability , and 0 otherwise. Then when and
with probability at least , for every ,
(8) 
Proof.
Since is a wellconditioned basis for the range space of , to prove Eqn. (8) it is equivalent to prove the following holds for all ,
(9) 
To prove that Eqn. (9) holds for any , firstly, we show that Eqn. (9) holds for any fixed ; and, secondly, we apply a standard net argument to show that (9) holds for every .
Assume that is conditioned with . For , let . Then since . Let be a random variable, and we have
Therefore, . Note here we only consider such that since otherwise we have
, and the corresponding term will not contribute to the variance. According to our definition,
. Consider the following,Hence,
Also,
Applying the Bernstein inequality to the zeromean random variables gives
Since , setting to and plugging all the results we derive above, we have
Let’s simplify the exponential term on the right hand side of the above expression:
Therefore, when , with probability at least ,
(10) 
where will be specified later.
We will show that, for all ,
(11) 
By the positive linearity of , it suffices to show the bound above holds for all with .
Next, let and construct a net of , denoted by , such that for any , there exists a that satisfies . By [1], the number of elements in is at most . Hence, with probability at least , Eqn. (10) holds for all .
We claim that, with suitable choice , with probability at least , will be a distortion embedding matrix of . To show this, firstly, we state a similar result for from Theorem 6 in [6] with as follows.
Lemma 10 ( Subspacepreserving Sampling Lemma).
Given , let be an conditioned basis for . For , define
and let be a random diagonal matrix with with probability , and 0 otherwise. Then when and
with probability at least , for every ,
(12) 
Note here we change the constraint and the original theorem to above. One can easily show that the result still holds with such setting. If we set and the failure probability to be at most , the construction of defined above satisfies conditions of Lemma 10 when the expected sampling complexity . Then our claim for holds. Hence we only need to make sure with suitable choice of , we have .
For any with , we have
where we take , and the expected sampling size becomes
When , we will have . Hence the claim for holds and Eqn. (11) holds for every .
Since the proof is involved with two random events with failure probability at most , by a simple union bound, Eqn. (9) holds with probability at least . Our results follows since . ∎
Remark. It is not hard to see that for any matrix satisfying Eqn. (8), the rank of is preserved.
Remark. Given such a subspacepreserving sampling matrix, it is not hard to show that, by solving the subsampled problem induced by , i.e., by solving , then one obtains a approximate solution to the
Comments
There are no comments yet.