FrequentDirectionsRidgeRegression
Details on using Frequent Directions for Ridge Regression
view repo
We provide a deterministic space-efficient algorithm for estimating ridge regression. For n data points with d features and a large enough regularization parameter, we provide a solution within ε L_2 error using only O(d/ε) space. This is the first o(d^2) space algorithm for this classic problem. The algorithm sketches the covariance matrix by variants of Frequent Directions, which implies it can operate in insertion-only streams and a variety of distributed data settings. In comparisons to randomized sketching algorithms on synthetic and real-world datasets, our algorithm has less empirical error using less space and similar time.
READ FULL TEXT VIEW PDFDetails on using Frequent Directions for Ridge Regression
Linear regression is one of the canonical problems in machine learning. Given pairs with each and , we can accumulate them into a matrix
and vector
. The goal is to findIt has a simple solution where is the pseudoinverse of .
The most common robust variant, ridge regression, uses a regularization parameter to add a squared regularizer on . Its goal ^{1}^{1}1We hereafter drop the from when is apparent, and we only use norms, so . is
This also has simple solutions as
where is the
-dimensional identity matrix. The regularization and using
makes regression robust to noise, improves generalization, and avoids ill-conditioning so the solution does not require the psuedoinverse.However, this problem is difficult under very large data settings because the inverse operation and standard matrix multiplication will take time, which is under our assumption . And this can also be problematic if the size of , at space, exceeds memory.
As a central task in data analysis, significant effort has gone into improving the running time of least squares regression and linear ridge regression. Most improvements are in the form of sketching methods using projection or sampling.
Sarlós [15] initiated the formal study of using Random Projections (RP) for regression to efficiently reduce dimensions to dimensions (still ) preserving the norm of the
dimension subspace vectors with high probability. Clarkson and Woodruff
[3] extended this technique to runtime depending on the number-of-non-zeroes, for sparse inputs, with CountSketch (CS).In non-streaming settings, there are other extensions. This includes Lu et. al. [12, 2]
who used different choices of the random linear transforms (e.g. SRHT). McCurdy
[14] and Cohen et. al.[4] proposed deterministic and streaming (ridge) leverage score sampling (LSS, RLSS) respectively. While these approaches may offer more compact sketches, the computation of leverage score depends on , which is also the key part for the solution of ridge regression. So to make these efficient, or potentially streaming, they rely on sketching techniques or analysis. Recently, Wang et. al. [16] re-analyzed the quality of these previous linear ridge regression sketches from two related views: the optimization view (errors on objective function) and the statistics view (bias and variance of the solutions
).Although some of these sketches can be implemented for streaming settings, none of the above methods have guarantee for the case . Thus they do not provide meaningful space improvements since ridge regression can be solved in a stream using space exactly (simply maintain the covariance matrix and -dimensional vector ). However, modern data sets can be huge with both and very large. In contrast, our methods are the first to use space for this core data analysis problem, under some assumptions. Furthermore, our methods can provide solution in the streaming and distributed models without affecting the total running time.
We make the observation, that if the goal is to approximate the solution to ridge regression, instead of ordinary least squares regression, and the regularization parameter is large enough, then a Frequent Directions based sketch can preserve
-relative error with only roughly rows. We formalize and prove this (see Theorems 3 and 4), argue this cannot be improved, and demonstrate empirically that indeed the FD-based sketch can significantly outperform random projection based sketches.Liberty introduced Frequent Directions (FD) in 2013, then together with Ghashami et. al. improved the analysis [11, 6].
It considers a tall matrix (with ) row by row in a stream. It uses limited space to compute a short sketch matrix , such that the covariance error is relatively small compared to the optimal solution :
(1) |
The algorithm maintains a sketch matrix
representing the approximate right singular values of
, scaled by the singular values. Specifically, it appends a batch of new rows to , computes the SVD of , subtracts the squared th singular value from all squared singular values (or marks down to ), and then updates as the reduced first singular values and right singular vectors. After each update, has at most rows. After all rows of , for allThe running time is and required space is . By setting , it achieves the relative covariance error, in time and in space .
Recently, Luo et. al. [13] proposed Robust Frequent Direction (RFD). They slightly extend FD by maintaining an extra value , which is half of the sum of all squared th singular values. Adding back to the covariance matrix results in a more robust solution and less error. For all :
It has same running time and running space with FD in terms of . To guarantee the same error, RFD needs almost a factor fewer rows .
Huang [9] proposed a more complicated variety to separate from in the running time. The idea is two level sketching: not only sketch , but also sketch the removed part into via sampling. Note that for a fixed , has a fixed number of rows, only increases the number of rows to reduce the error bound, and the computation of is faster and more coarse than that of . With high probability, for a fixed , the sketch achieves the error in (1) in time using space . By setting , the running time is and the space is
The Frequent Directions sketch has other nice properties. It can be extended to have runtime depend only on the number of nonzeros for sparse inputs [7, 9]. Moreover, it applies to distributed settings where data is captured from multiple locations or streams. Then these sketches can be “merged” together [6] without accumulating any more error than the single stream setting. These properties apply directly to the sketches we propose.
We know of no previous applications of the FD sketch to linear regression problems. The main challenge is that FD approximates the high norm directions of (i.e., measured with direction/unit vector as , but drops the low norm directions. However, linear regression needs to recover times the inverse of . So if is aligned with the low norm part of , then FD provides a poor approximation.
Ridge regression, however, with regularizer ensures that all directions of have norm at least in all directions, regardless of or its sketch .
Figure 1 Illustrates the effect on the eigenvalue distribution for some , and how it is affected by a ridge term and FD. The ridge term increases the values everywhere, and FD decreases the values everywhere. In principle, if these effects are balanced just right they should cancel out – at least for the high rank part of . In particular, robust frequent directions attempts to do this implicitly – it automatically picks a good choice of regularizer as half of the amount of the shrinkage induced by FD.
Recall that rows of and elements of are given in pairs in the stream, we want to approximate for a given within space , where . Let , which can be exactly maintained using space . But needs space , so we use Frequent Directions (FD) or Robust Frequent Directions (RFD) to approximate it by a sketch (which is a matrix and possibly also some auxiliary information). Then the optimal solution and its approximation of is
Algorithm 1 shows the general algorithm framework. It processes a consecutive batch of rows of (denoted ) and elements of (denoted ) each step. xFD refers to a sketching step of some variant of Frequent Directions. Line 5 computes on the fly, it is not a part of Frequent Directions. Line 7 computes the solution coefficients using only the sketch of and at the end. This simply supplements the Frequent Directions algorithm with information to compute the ridge regression solution.
The main part of our analysis is the upper bound of the coefficients error
Lemma 1 shows the key structural result, translating the sketch covariance error to the upper bound of ridge regression coefficients error.
For any , , consider an optimal solution , and an approximate solution . Then
To simplify the equations, let , , then , and so .
The third equality can be validated backwards by simple algebra.
Plug in the definitions of , , and use , we can get
(2) |
By the extreme value condition of the general Rayleigh Quotient, we have
(3) |
Note that for all unitary , we have
Here refer to the minimal eigenvalue of a matrix. Thus
(4) |
Note that Lemma 1 is tight. We show this by an example settings of and , which achieve equality.
With the same settings as those in Lemma 1, if , and for any , then
In the proof of Lemma 1, we have shown that
Using the definitions , , and ,
Similarly for the right hand side
Now we consider Algorithm 2 (FDrr), using FD as xFD in Algorithm 1. Specifically, it uses the Fast Frequent Directions algorithm in [6]. We explicitly store the first singular values and singular vectors instead of to be able to compute the the solution efficiently. Note that in the original FD algorithm, .
Line 4 and 5 are what FD actually does in each step. It appends new rows to the current sketch , calls svd to calculate the singular values and right singular vectors , then reduces the rank to .
Line 8 and 9 are how we compute the solution . Explicitly inverting that matrix is not only expensive but also would use space, which exceeds the space limitation . The good news is that contains the eigenvectors of , the corresponding eigenvalues for , and the remaining eigenvalues are . So we can separately compute in the subspace spanned by and its null space.
Let and be output of Algorithm 2 . If
then
It also holds that for any , and for any . The running time is and requires space .
Line 6 computes in time using space . Thus
Line 8 and 9 compute the solution in time using space . Let be a set of orthonormal basis of the null space of . Then
Thus
The rest of Algorithm 2 is equivalent to a normal FD algorithm with . Thus
And satisfies
Together with Lemma 1 and , we have
By setting and solving or , we get the guarantee for coefficients error.
The running time and required space of a FD algorithm is and . Therefore the total running time is , and the running space is . ∎
Note that the only two approximations in the analysis of Theorem 3 arise from Lemma 1 and in the Frequent Directions bound. Both bounds are individually tight (see Lemma 2, and Theorem 4.1 in [6]), so while this is not a complete lower bound, it indicates this analysis approach cannot be asymptotically improved, and it is unlikely any algorithm can asymptotically improve upon these results.
We can also write the space directly for this algorithm to achieve as . Note that this holds for all choices of , so the space is actually . So when (for an identified best choice of ) then this uses space, and if this holds for a constant , then the space is . This identifies the “regularizer larger than tail” case as when this algorithm is in theory appropriate. Empirically we will see next that it works well more generally.
If we use RFD instead of FD, we have in addition to . Then the approximation of is
We approximate by
Line 6 in Algorithm 3 is added to maintain . The remainder of the algorithm is the same as Algorithm 2.
Theorem 4 and its proof is also very similar.
Let and be output of Algorithm 3 with input , if
then
It also holds that for any , and for any . The running time is and requires space .
In this section, we compare new algorithms FDrr and RFDrr with other FD-based algorithms and some randomized algorithms on synthetic and real-world datasets. We focus only on algorithms which work in a stream.
This uses generic (scaled) {-1,+1} random projections [15]
. For each batch of data, construct a random matrix
, set and . Output at the end.This is the sparse version of RPrr using the CountSketch [3]. The random matrix are all zeros except for one -1 or 1 in each column with a random location.
This is the naive streaming ridge regression which computes and cumulatively (a batch size of ). In each step it computes where is an outer product of row vectors, and . Then it outputs at the end. This algorithm use space and has no error in or . We use this algorithm to computer the best ridge parameter for all sketching algorithms and the optimal ridge coefficients . The best is then fixed for each dataset, and acts as a known hyper-parameter for all the other sketching algorithm. The ridge coefficients is then used to compute the coefficients error of all sketching algorithms.
We use three main datasets that all have dimension , training data size , and test data size .
Two synthetic data-sets are low rank (LR) and high rank (HR), determined by an effective rank parameter . It is set and respectively, which is 10 percentage plus 50 percentage of dimension , so that it is at least one. This effective rank parameter is then used as the number of non-zero coefficients
and the number of major standard deviations of a multivariate normal distribution for generating input points
. Each row vector of are generated by normal distribution with standard deviations for , so the maximal standard deviation is . Figure 2 shows the singular value distributions of the low rank and high rank synthetic datasets’ inputs, these singular values have been normalized by their first singular values, and the indices has been normalized by . The linear model coefficients have first entries non-zero, they are generated by another standard normal distribution, then normalized to unit vector, such that the gradient of the linear model is 1. A relative large Gaussian noise with standard deviation 4 is added to the outputs, i.e. . Finally, we rotate data points by a discrete cosine transform.This is derived from the temperature sequence recorded hourly from 1997 to 2019 at an international airport^{2}^{2}2https://www.ncdc.noaa.gov/ details withheld for anonymity.. We compute the difference sequence between hourly temperatures, and then shingle this data, so is consecutive differences starting at the th difference, and is the next (the th) difference between temperatures. Then the TEMP dataset matrix is a set of randomly chosen (without replacement) such shingles. Its singular value distribution is shown in Figure 2.
We first run rr on training datasets with different s, then choose the ones which best minimize using a held out test dataset . The best s for the low rank LR and high rank HR datasets are 4096 and 32768 respectively, the best for the TEMP dataset is 32768. These values are used for the further experiments.
Since the value is only used to compute the solution or (storing separate from in RFDrr), so this choice could also be made at the time of calculating the solution using a stored test set. To avoid this extra level of error into the evaluation process, we simply use this pre-computed value.
We run all these 6 algorithms with different choices of on these three main datasets. They are implemented in python using numpy, and are relatively straightforward. For completeness, we will release de-anonymized code for reproducibility after double-blind peer review. We first train them on the training sets, query their coefficients, then compute the coefficients errors with rr and prediction errors with outputs. We repeat all these experiments 10 times and show the mean results.
In Figure 3 we show the running time of the high rank dataset by training time, solution query (computation of the coefficients) time, their sum, and training time + query time simulating making a query every batch. The other datasets are the same size, and have the same runtimes. We can see that FD based algorithms are slower then randomized algorithms during training, but much faster during query solutions since the sketch sizes are smaller and more processed. They maintain the SVD results of the sketch so the matrix inversion is mostly precomputed. Note that this precomputation is not available in the two-level 2LFDrr either, hence this also suffers from higher query time. When we add the training time and queries, then iSVDrr, RFDrr, and FDrr are the fastest for below about (past ). Note that in this plot the number of batches and hence queries decreases as increases, and as a result for small the algorithms with cost dominated by queries (CSrr, RPrr, and 2LFDrr) have their runtime initially decrease. All algorithms are generally faster than rr – the exception is the random projection algorithms (CSrr and RPrr) which are a bit slower for query time, and these becomes worse as becomes greater than .
Let be the coefficients solutions of RR and one of these six sketching algorithms; we compute the coefficients error by
Let be the optimal output values and the predicted values by sketching algorithm; we compute the prediction error by
Figure 4 shows these errors versus space in terms of , and (training + query) time in seconds.
For the high rank data (top row), all FD-based algorithms (FDrr, RFDrr, 2LFDrr, as well as iSVDrr) have far less error than the random projection algorithms (RPrr and CSrr). For very small size RFDrr does slightly worse than the other FD variants, likely because it adds too much bias because the “tail” is too large with small .
For the low rank data and real-world TEMP data the errors are more spread out, yet the FD-based algorithms still do significantly better as a function of space (). Among these RFDrr (almost) always has the least error (for small ) or matches the best error (for larger ). The only one that sometimes improves upon RFDrr, and is otherwise the next best is the huersistic iSVDrr which has no guarantees, and likely will fail for adversarial data [5]. In terms of the time, the random projection algorithms can be a bit faster (say seconds instead of seconds), but then achieve more coefficient error. In particular, RFDrr always can achieve the least coefficient error, and usually the least coefficient error for any given allotment of time. For prediction error as a function of time (the rightmost column of Figure 4), the results are more muddled. Many algorithms can achieve the minimum error (nearly matched by rr) in the nearly best runtime (about seconds). The FD-based algorithms are roughly at this optimal points for all parameters tried above , and hence consistently achieves these results in small space and time.
The previous experiments are run on fixed size matrices, because it is very expensive to calculate the optimal values and (and to measure error) using rr as becomes too large.
In Figure 5 we show the runtime of the algorithms as both and increase to large sizes. We fix . When we vary we fix , and when we vary we fix . As expected, the runtimes all scale linearly as grows, or the sum of two linear times for (training+query) time. As grows, FD-based algorithms (not including 2LFDrr) overcome RP-based algorithms (as well as rr and 2LFDrr) even with one query. The query time for the latter increase too fast, cubic on , but is linear for FD-based algorithms.
We provide the first streaming sketch algorithms that can apply the provably and practically efficient Frequent Directions sketch towards regression. This only works for ridge regression, and when the regularizer is sufficiently large. Although FD only approximates the high-rank terms of the data, the ridge regularization ensures that the low-rank parts cannot be of too much importance. So in addition to the robustness to noise and better generalization, the ridge term also makes regression easier to sketch, similar to prior observations [14, 4]. This results in the first streamable sketch with space. Moreover, our experiments demonstrate that while these FD-based algorithms have larger training time than random projection ones, their query time is far more efficient, and for reasonable space and time constraints, our new algorithms have less empirical error.
Principal Component Regression (PCR) is a related to our proposed algorithms, but lacks the stability guarantees we provide. In particular, PCR identifies the top principal components of and performs regression using, [, ], the projection onto the span of . For this to be effective, these components must include the only directions meaningfully correlated with . However, when the top singular vectors of are all similar, which of the corresponding top singular vectors are in the top is not stable. If one of the meaningful directions (perhaps there is a single meaningful direction) among the top- is not retained in an approximation of the top-, then while the norms of is preserved using a sketch , the result of the regression may be quite different. Hence, this algorithm is not stable in the same was as rr, and cannot have approximation guarantees in the strong form similar to ours.
Incremental singular value decomposition of uncertain data with missing values
. In Computer Vision — ECCV 2002, pp. 707–720. Cited by: §4.1.Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence
, UAI’15, pp. 201–210. Cited by: §1.1.Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Computing
, STOC ’13, pp. 81–90. Cited by: §1.1, §4.1.Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2016)
, Leibniz International Proceedings in Informatics (LIPIcs), Vol. 60, pp. 7:1–7:18. Cited by: §1.1, §5.