Quantile Regression for Large-scale Applications

05/01/2013 ∙ by Jiyan Yang, et al. ∙ LinkedIn berkeley college Stanford University 0

Quantile regression is a method to estimate the quantiles of the conditional distribution of a response variable, and as such it permits a much more accurate portrayal of the relationship between the response variable and observed covariates than methods such as Least-squares or Least Absolute Deviations regression. It can be expressed as a linear program, and, with appropriate preprocessing, interior-point methods can be used to find a solution for moderately large problems. Dealing with very large problems, (e.g.), involving data up to and beyond the terabyte regime, remains a challenge. Here, we present a randomized algorithm that runs in nearly linear time in the size of the input and that, with constant probability, computes a (1+ϵ) approximate solution to an arbitrary quantile regression problem. As a key step, our algorithm computes a low-distortion subspace-preserving embedding with respect to the loss function of quantile regression. Our empirical evaluation illustrates that our algorithm is competitive with the best previous work on small to medium-sized problems, and that in addition it can be implemented in MapReduce-like environments and applied to terabyte-sized problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Quantile regression is a method to estimate the quantiles of the conditional distribution of a response variable, expressed as functions of observed covariates [8], in a manner analogous to the way in which Least-squares regression estimates the conditional mean. The Least Absolute Deviations regression (i.e., regression) is a special case of quantile regression that involves computing the median of the conditional distribution. In contrast with regression and the more popular or Least-squares regression, quantile regression involves minimizing asymmetrically-weighted absolute residuals. Doing so, however, permits a much more accurate portrayal of the relationship between the response variable and observed covariates, and it is more appropriate in certain non-Gaussian settings. For these reasons, quantile regression has found applications in many areas, e.g., survival analysis and economics [2, 10, 3]. As with regression, the quantile regression problem can be formulated as a linear programming problem, and thus simplex or interior-point methods can be applied [9, 15, 14]. Most of these methods are efficient only for problems of small to moderate size, and thus to solve very large-scale quantile regression problems more reliably and efficiently, we need new computational techniques.

In this paper, we provide a fast algorithm to compute a relative-error approximate solution to the over-constrained quantile regression problem. Our algorithm constructs a low-distortion subspace embedding of the form that has been used in recent developments in randomized algorithms for matrices and large-scale data problems, and our algorithm runs in time that is nearly linear in the number of nonzeros in the input data.

In more detail, recall that a quantile regression problem can be specified by a (design) matrix

, a (response) vector

, and a parameter , in which case the quantile regression problem can be solved via the optimization problem

(1)

where , for , where

(2)

for , is the corresponding loss function. In the remainder of this paper, we will use to denote the augmented matrix , and we will consider . With this notation, the quantile regression problem of Eqn. (1) can equivalently be expressed as a constrained optimization problem with a single linear constraint,

(3)

where and is a unit vector with the first coordinate set to be . The reasons we want to switch from Eqn. (1) to Eqn. (3) are as follows. Firstly, it is for notational simplicity in the presentation of our theorems and algorithms. Secondly, all the results about low-distortion or -subspace embedding in this paper holds for any ,

In particular, we can consider in some specific subspace of . For example, in our case, . Then, the equation above is equivalent to the following,

Therefore, using notation with in some constraint is a more general form of expression. We will focus on very over-constrained problems with size .

Our main algorithm depends on a technical result, presented as Lemma 9, which is of independent interest. Let be an input matrix, and let be a random sampling matrix constructed based on the following importance sampling probabilities,

where is the element-wise -norm, and where is the -th row of an well-conditioned basis for the range of (see Definition 3 and Proposition 1). Then, Lemma 9 states that, for a sampling complexity that depends on but is independent of ,

will be satisfied for every .

Although one could use, e.g., the algorithm of [6] to compute such a well-conditioned basis and then “read off” the -norm of the rows of , doing so would be much slower than the time allotted by our main algorithm. As Lemma 9 enables us to leverage the fast quantile regression theory and the algorithms developed for regression, we provide two sets of additional results, most of which are built from the previous work. First, we describe three algorithms (Algorithm 1, Algorithm 2, and Algorithm 3) for computing an implicit representation of a well-conditioned basis; and second, we describe an algorithm (Algorithm 4) for approximating the -norm of the rows of the well-conditioned basis from that implicit representation. For each of these algorithms, we prove quality-of-approximation bounds in quantile regression problems, and we show that they run in nearly “input-sparsity” time, i.e., in time, where is the number of nonzero elements of , plus lower-order terms. These lower-order terms depend on the time to solve the subproblem we construct; and they depend on the smaller dimension but not on the larger dimension . Although of less interest in theory, these lower-order terms can be important in practice, as our empirical evaluation will demonstrate.

We should note that, of the three algorithms for computing a well-conditioned basis, the first two appear in [13] and are stated here for completeness; and the third algorithm, which is new to this paper, is not uniformly better than either of the two previous algorithms with respect to either condition number or the running time. (In particular, Algorithm 1 has slightly better running time, and Algorithm 2 has slightly better conditioning properties.) Our new conditioning algorithm is, however, only slightly worse than the better of the two previous algorithms with respect to each of those two measures. Because of the trade-offs involved in implementing quantile regression algorithms in practical settings, our empirical results show that by using a conditioning algorithm that is only slightly worse than the best previous conditioning algorithms for each of these two criteria, our new conditioning algorithm can lead to better results than either of the previous algorithms that was superior by only one of those criteria.

Given these results, our main algorithm for quantile regression is presented as Algorithm 5. Our main theorem for this algorithm, Theorem 1, states that, with constant probability, this algorithm returns a -approximate solution to the quantile regression problem; and that this solution can be obtained in time, plus the time for solving the subproblem (whose size is , where , independent of , when ).

We also provide a detailed empirical evaluation of our main algorithm for quantile regression, including characterizing the quality of the solution as well as the running time, as a function of the high dimension , the lower dimension , the sampling complexity , and the quantile parameter . Among other things, our empirical evaluation demonstrates that the output of our algorithm is highly accurate, in terms of not only objective function value, but also the actual solution quality (by the latter, we mean a norm of the difference between the exact solution to the full problem and the solution to the subproblem constructed by our algorithm), when compared with the exact quantile regression, as measured in three different norms. More specifically, our algorithm yields 2-digit accuracy solution by sampling only, e.g., about of a problem with size .222We use this notation throughout; e.g., by , we mean that and . Our new conditioning algorithm outperforms other conditioning-based methods, and it permits much larger small dimension than previous conditioning algorithms. In addition to evaluating our algorithm on moderately-large data that can fit in RAM, we also show that our algorithm can be implemented in MapReduce-like environments and applied to computing the solution of terabyte-sized quantile regression problems.

The best previous algorithm for moderately large quantile regression problems is due to [15] and [14]. Their algorithm uses an interior-point method on a smaller problem that has been preprocessed by randomly sampling a subset of the data. Their preprocessing step involves predicting the sign of each , where and are the -th row of the input matrix and the -th element of the response vector, respectively, and is an optimal solution to the original problem. When compared with our approach, they compute an optimal solution, while we compute an approximate solution; but in worst-case analysis it can be shown that with high probability our algorithm is guaranteed to work, while their algorithm do not come with such guarantees. Also, the sampling complexity of their algorithm depends on the higher dimension , while the number of samples required by our algorithm depends only on the lower dimension ; but our sampling is with respect to a carefully-constructed nonuniform distribution, while they sample uniformly at random.

For a detailed overview of recent work on using randomized algorithms to compute approximate solutions for least-squares regression and related problems, see the recent review [12]. Most relevant for our work is the algorithm of [6] that constructs a well-conditioned basis by ellipsoid rounding and a subspace-preserving sampling matrix in order to approximate the solution of general regression problems, for , in roughly ; the algorithms of [16] and [5] that use the “slow” and “fast’ versions of Cauchy Transform to obtain a low-distortion embedding matrix and solve the over-constrained regression problem in and time, respectively; and the algorithm of [13] that constructs low-distortion embeddings in “input sparsity” time and uses those embeddings to construct a well-conditioned basis and approximate the solution of the over-constrained regression problem in time. In particular, we will use the two conditioning methods in [13], as well as our “improvement” of those two methods, for constructing -norm well-conditioned basis matrices in nearly input-sparsity time. In this work, we also demonstrate that such well-conditioned basis in sense can be used to solve over-constrained quantile regression problem.

2 Background and Overview of Conditioning Methods

2.1 Preliminaries

We use to denote the element-wise norm for both vectors and matrices; and we use to denote the set . For any matrix , and denote the -th row and the -th column of , respectively; and denotes the column space of . For simplicity, we assume has full column rank; and we always assume that . All the results hold for by simply switching the positions of and .

Although , defined in Eqn. 2, is not a norm, since the loss function does not have the positive linearity, it satisfies some “good” properties, as stated in the following lemma:

Lemma 1.

Suppose that . Then, for any , the following hold:

  1. ;

  2. ;

  3. ; and

  4. .

Proof.

It is trivial to prove every equality or inequality for in one dimension. Then by the definition of for vectors, the inequalities and equalities hold for general and . ∎

To make our subsequent presentation self-contained, here we will provide a brief review of recent work on subspace embedding algorithms. We start with the definition of a low-distortion embedding matrix for in terms of , see e.g.[13].

Definition 1 (Low-distortion Subspace Embedding).

Given , is a low-distortion embedding of if and for all ,

where and are low-degree polynomials of .

The following stronger notion of a -distortion subspace-preserving embedding will be crucial for our method. In this paper, the “measure functions” we will consider are and .

Definition 2 (-distortion Subspace-preserving Embedding).

Given and a measure function of vectors , is a -distortion subspace-preserving matrix of if and for all ,

Furthermore, if is a sampling matrix (one nonzero element per row in ), we call it a -distortion subspace-preserving sampling matrix.

In addition, the following notion, originally introduced by [4], and stated more precisely in [6], of a basis that is well-conditioned for the norm will also be crucial for our method.

Definition 3 (-conditioning and well-conditioned basis).

Given , is -conditioned if and for all . Define as the minimum value of such that is -conditioned. We will say that a basis of is a well-conditioned basis if is a polynomial in , independent of .

For a low-distortion embedding matrix for , we next state a fast construction algorithm that runs in “input-sparsity” time by applying the Sparse Cauchy Transform. This was originally proposed as Theorem 2 in [13].

Lemma 2 (Fast construction of Low-distortion Subspace Embedding Matrix, from [13]).

Given with full column rank, let , where has each column chosen independently and uniformly from the standard basis vector of , and where

is a diagonal matrix with diagonals chosen independently from Cauchy distribution. Set

with sufficiently large. Then, with a constant probability, we have

(4)

In addition, can be computed in time.

Remark. This result has very recently been improved. In [17], the authors show that one can achieve a distortion subspace embedding matrix with embedding dimension in time by replacing Cauchy variables in the above lemma with exponential variables. Our theory can also be easily improved by using this improved lemma.

Next, we state a result for the fast construction of a -distortion subspace-preserving sampling matrix for , from Theorem 5.4 in [5], with , as follows.

Lemma 3 (Fast construction of Sampling Matrix, from Theorem 5.4 in [5]).

Given a matrix and a matrix such that is a well-conditioned basis for with condition number , it takes time to compute a sampling matrix with such that with a constant probability, for any ,

We also cite the following lemma for finding a matrix , such that is a well-condition basis, which is based on ellipsoidal rounding proposed in [5].

Lemma 4 (Fast Ellipsoid Rounding, from [5]).

Given an matrix , by applying an ellipsoid rounding method, it takes at most time to find a matrix such that .

Finally, two important ingredients for proving subspace preservation are -nets and tail inequalities. Suppose that is a point set and is a metric on . A subset is called a -net for some if for every there is a such that . It is well-known that the unit ball of a -dimensional subspace has a -net with size at most  [1]

. Also, we will use the standard Bernstein inequality to prove concentration results for the sum of independent random variables.

Lemma 5 (Bernstein inequality, [1]).

Let be independent random variables with zero-mean. Suppose that , for , then for any positive number , we have

2.2 Conditioning methods for regression problems

Before presenting our main results, we start here by outlining the theory for conditioning for overconstrained (and ) regression problems.

The concept of a well-conditioned basis (recall Definition 3) plays an important role in our algorithms, and thus in this subsection we will discuss several related conditioning methods. By a conditioning method, we mean an algorithm for finding, for an input matrix , a well-conditioned basis, i.e., either finding a well-conditioned matrix or finding a matrix such that is well-conditioned. There exist many approaches that have been proposed for conditioning. The two most important properties of these methods for our subsequent analysis are: (1) the condition number ; and (2) the running time to construct (or ). The importance of the running time should be obvious; but the condition number directly determines the number of rows that we need to select, and thus it has an indirect effect on running time (via the time required to solve the subproblem). See Table 1 for a summary of the basic properties of the conditioning methods that will be discussed in this subsection.

name running time type
SC [16] QR
FC [5] QR
Ellipsoid rounding [4] ER
Fast ellipsoid rounding [5] ER
SPC1 [13] QR
SPC2 [13] + ER_small QR+ER
SPC3 (proposed in this article) + QR_small QR+QR
Table 1: Summary of running time, condition number, and type of conditioning methods proposed recently. QR and ER refer, respectively, to methods based on the QR factorization and methods based on Ellipsoid Rounding, as discussed in the text. QR_small and ER_small denote the running time for applying QR factorization and Ellipsoid Rounding, respectively, on a small matrix with size independent of .

In general, there are three basic ways for finding a matrix such that is well-conditioned: those based on the QR factorization; those based on Ellipsoid Rounding; and those based on combining the two basic methods.

  • Via QR Factorization (QR). To obtain a well-conditioned basis, one can first construct a low-distortion embedding matrix. By Definition 1, this means finding a , such that for any ,

    (5)

    where and is independent of and the factors and here will be low-degree polynomials of (and related to and of Definition 3). For example, could be the Sparse Cauchy Transform described in Lemma 2. After obtaining , by calculating a matrix such that has orthonormal columns, the matrix is a well-conditioned basis with . See Theorem 4.1 in [13] for more details. Here, the matrix

    can be obtained by a QR factorization (or, alternately, the Singular Value Decomposition). As the choice of

    varies, the condition number of , i.e., , and the corresponding running time will also vary, and there is in general a trade-off among these.

    For simplicity, the acronyms for these types of conditioning methods will come from the name of the corresponding transformations: SC stands for Slow Cauchy Transform from [16]; FC stands for Fast Cauchy Transform from [5]; and SPC1 (see Algorithm 1) will be the first method based on the Sparse Cauchy Transform (see Lemma 2). We will call the methods derived from this scheme QR-type methods.

  • Via Ellipsoid Rounding (ER). Alternatively, one can compute a well-conditioned basis by applying ellipsoid rounding. This is a deterministic algorithm that computes a -rounding of a centrally symmetric convex set . By -rounding here we mean finding an ellipsoid , satisfying , which implies , . With a transformation of the coordinates, it is not hard to show the following,

    (6)

    From this, it is not hard to show the following inequalities,

    This directly leads to a well-conditioned matrix with . Hence, the problem boils down to finding a -rounding with small in a reasonable time.

    By Theorem 2.4.1 in [11], one can find a -rounding in polynomial time. This result was used by [4] and [6]. As we mentioned in the previous section, Lemma 4, in [5], a new fast ellipsoid rounding algorithm was proposed. For an matrix with full rank, it takes at most time to find a matrix such that is a well-conditioned basis with . We will call the methods derived from this scheme ER-type methods.

  • Via Combined QR+ER Methods. Finally, one can construct a well-conditioned basis by combining QR-like and ER-like methods. For example, after we obtain such that is a well-conditioned basis, as described in Lemma 3, one can then construct a -distortion subspace-preserving sampling matrix in time. We may view that the price we pay for obtaining is very low in terms of running time. Since is a sampling matrix with constant distortion factor and since the dimension of is independent of , we can apply additional operations on that smaller matrix in order to obtain a better condition number, without much additional running time, in theory at least, if , for some low-degree .

    Since the bottleneck for ellipsoid rounding is its running time, when compared to QR-type methods, one possibility is to apply ellipsoid rounding on . Since the bigger dimension of only depends on , the running time for computing via ellipsoid rounding will be acceptable if . As for the condition number, for any general subspace embedding satisfying Eqn. (5), i.e., which preserves the norm up to some factor determined by , including , if we apply ellipsoid rounding on , then the resulting may still satisfy Eqn. (6) with some . In detail, viewing as a vector in , from Eqn. (5), we have

    In Eqn. (6), replace with , combining the inequalities above, we have

    With appropriate scaling, one can show that a well-conditioned matrix with . Especially, when has constant distortion, say , the condition number is preserved at sampling complexity , while the running time has been reduced a lot, when compared to the vanilla ellipsoid rounding method. (See Algorithm 2 (SPC2) below for a version of this method.)

    A second possibility is to view as a sampling matrix satisfying Eqn. (5) with . Then, according to our discussion of the QR-type methods, if we compute the QR factorization of , we may expect the resulting to be a well-conditioned basis with lower condition number . As for the running time, QR factorization on a smaller matrix will be inconsequential, in theory at least. (See Algorithm 3 (SPC3) below for a version of this method.)

In the remainder of this subsection, we will describe three related methods for computing a well-conditioned basis that we will use in our empirical evaluations. Recall that Table 1 provides a summary of these three methods and the other methods that we will use.

We start with the algorithm obtained when we use Sparse Cauchy Transform from [13] as the random projection in a vanilla QR-type method. We call it SPC1 since we will describe two of its variants below. Our main result for Algorithm 1 is given in Lemma 6. Since the proof is quite straightforward, we omit it here.

1: with full column rank.
2: such that is a well-conditioned basis with .
3:Construct a low-distortion embedding matrix of via Lemma 2.
4:Compute such that is a well-conditioned basis for via QR factorization of .
Algorithm 1 SPC1: vanilla QR type method with Sparse Cauchy Transform
Lemma 6.

Given with full rank, Algorithm 1 takes time to compute a matrix such that with a constant probability, is a well-conditioned basis for with .

Next, we summarize the two Combined Methods described above in Algorithm 2 and Algorithm 3. Since they are variants of SPC1, we call them SPC2 and SPC3, respectively. Algorithm 2 originally appeared as first four steps of Algorithm 2 in [13]. Our main result for Algorithm 2 is given in Lemma 7; since the proof of this lemma is very similar to the proof of Theorem 7 in [13], we omit it here. Algorithm 3 is new to this paper. Our main result for Algorithm 3 is given in Lemma 8.

1: with full column rank.
2: such that is a well-conditioned basis with .
3:Construct a low-distortion embedding matrix of via Lemma 2.
4:Construct such that is a well-conditioned basis for via QR factorization of .
5:Compute a -distortion sampling matrix of via Lemma 3.
6:Compute by ellipsoid rounding for via Lemma 4.
Algorithm 2 SPC2: QR + ER type method with Sparse Cauchy Transform
Lemma 7.

Given with full rank, Algorithm 2 takes time to compute a matrix such that with a constant probability, is a well-conditioned basis for with .

1: with full column rank.
2: such that is a well-conditioned basis with .
3:Construct a low-distortion embedding matrix of via Lemma 2.
4:Construct such that is a well-conditioned basis for via QR factorization of .
5:Compute a -distortion sampling matrix of via Lemma 3.
6:Compute via the QR factorization of .
Algorithm 3 SPC3: QR + QR type method with Sparse Cauchy Transform
Lemma 8.

Given with full rank, Algorithm 3 takes time to compute a matrix such that with a constant probability, is a well-conditioned basis for with .

Proof.

By Lemma 2, in Step 1, is a low-distortion embedding satisfying Eqn. (5) with , and . As a matter of fact, as we discussed in Section 2.2, the resulting in Step 2 is a well-conditioned basis with . In Step 3, by Lemma 3, the sampling complexity required for obtaining a -distortion sampling matrix is . Finally, if we view as a low-distortion embedding matrix with and , then the resulting in Step 4 will satisfy that is a well-conditioned basis with .

For the running time, it takes time for completing Step 1. In Step 2, the running time is . As Lemma 3 points out, the running time for constructing in Step 3 is . Since the large dimension of is a low-degree polynomial of , the QR factorization of it costs time in Step 4. Overall, the running time of Algorithm 3 is . ∎

Both Algorithm 2 and Algorithm 3 have additional steps (Steps 3 & 4), when compared with Algorithm 1, and this leads to some improvements, at the cost of additional computation time. For example, in Algorithm 3 (SPC3), we obtain a well-conditioned basis with smaller when comparing to Algorithm 1 (SPC1). As for the running time, it will be still , since the additional time is for constructing sampling matrix and solving a QR factorization of a matrix whose dimensions are determined by . Note that when we summarize these results in Table 1, we explicitly list the additional running time for SPC2 and SPC3, in order to show the tradeoff between these SPC-derived methods. We will evaluate the performance of all these methods on quantile regression problems in Section 4 (except for FC, since it is similar to but worse than SPC1, and ellipsoid rounding, since on the full problem it is too expensive).

Remark. For all the methods we described above, the output is not the well-conditioned matrix , but instead it is the matrix , the inverse of which transforms into .

Remark. As we can see in Table 1, with respect to conditioning quality, SPC2 has the lowest condition number

, followed by SPC3 and then SPC1, which has the worst condition number. On the other hand, with respect to running time, SPC1 is the fastest, followed by SPC3, and then SPC1, which is the slowest. (The reason for this ordering of the running time is that SPC2 and SPC3 need additional steps and ellipsoid rounding takes longer running time that doing a QR decomposition.)

3 Main Theoretical Results

In this section, we present our main theoretical results on -distortion subspace-preserving embeddings and our fast randomized algorithm for quantile regression.

3.1 Main technical ingredients

In this subsection, we present the main technical ingredients underlying our main algorithm for quantile regression. We start with a result which says that if we sample sufficiently many (but still only ) rows according to an appropriately-defined non-uniform importance sampling distribution (of the form given in Eqn. (7) below), then we obtain a -distortion embedding matrix with respect to the loss function of quantile regression. Note that the form of this lemma is based on ideas from [6, 5].

Lemma 9 (Subspace-preserving Sampling Lemma).

Given , let be a well-conditioned basis for with condition number . For , define

(7)

and let be a random diagonal matrix with with probability , and 0 otherwise. Then when and

with probability at least , for every ,

(8)
Proof.

Since is a well-conditioned basis for the range space of , to prove Eqn. (8) it is equivalent to prove the following holds for all ,

(9)

To prove that Eqn. (9) holds for any , firstly, we show that Eqn. (9) holds for any fixed ; and, secondly, we apply a standard -net argument to show that (9) holds for every .

Assume that is -conditioned with . For , let . Then since . Let be a random variable, and we have

Therefore, . Note here we only consider such that since otherwise we have

, and the corresponding term will not contribute to the variance. According to our definition,

. Consider the following,

Hence,

Also,

Applying the Bernstein inequality to the zero-mean random variables gives

Since , setting to and plugging all the results we derive above, we have

Let’s simplify the exponential term on the right hand side of the above expression:

Therefore, when , with probability at least ,

(10)

where will be specified later.

We will show that, for all ,

(11)

By the positive linearity of , it suffices to show the bound above holds for all with .

Next, let and construct a -net of , denoted by , such that for any , there exists a that satisfies . By [1], the number of elements in is at most . Hence, with probability at least , Eqn. (10) holds for all .

We claim that, with suitable choice , with probability at least , will be a -distortion embedding matrix of . To show this, firstly, we state a similar result for from Theorem 6 in [6] with as follows.

Lemma 10 ( Subspace-preserving Sampling Lemma).

Given , let be an -conditioned basis for . For , define

and let be a random diagonal matrix with with probability , and 0 otherwise. Then when and

with probability at least , for every ,

(12)

Note here we change the constraint and the original theorem to above. One can easily show that the result still holds with such setting. If we set and the failure probability to be at most , the construction of defined above satisfies conditions of Lemma 10 when the expected sampling complexity . Then our claim for holds. Hence we only need to make sure with suitable choice of , we have .

For any with , we have

where we take , and the expected sampling size becomes

When , we will have . Hence the claim for holds and Eqn. (11) holds for every .

Since the proof is involved with two random events with failure probability at most , by a simple union bound, Eqn. (9) holds with probability at least . Our results follows since . ∎

Remark. It is not hard to see that for any matrix satisfying Eqn. (8), the rank of is preserved.

Remark. Given such a subspace-preserving sampling matrix, it is not hard to show that, by solving the sub-sampled problem induced by , i.e., by solving , then one obtains a -approximate solution to the