Ridge Regression (RR) is one of the most widely applied penalized regression algorithms in machine learning problems. Supposeis the design matrix and is the
response vector, ridge regression tries to solve the problem
which has an explicit solution
However, for modern problems with huge design matrix , computing (2) costs FLOPS. When one can consider the dual formulation of (1) which also has an explicit solution as mentioned in (Lu et al., 2013; Saunders et al., 1998) and the cost is FLOPS. In summary, trying to solve (1) exactly costs FLOPS which can be very slow.
There are faster ways to approximate (2) when computational cost is the concern. One can view RR as an optimization problem and use Gradient Descent (GD) which takes FLOPS in every iteration. However, the convergence speed for GD depends on the spectral of and . When is ill conditioned, GD requires a huge number of iterations to converge which makes it very slow. For huge datasets, one can also apply stochastic gradient descent (SGD) (Zhang, 2004; Johnson and Zhang, 2013; Bottou, 2010), a powerful tool for solving large scale optimization problems.
Another alternative for regression on huge datasets is Principle Component Regression (PCR) as mentioned in (Artemiou and Li, 2009; Jolliffe, 2005), which runs regression only on the top principal components of the matrix. PCA for huge can be computed efficiently by randomized algorithms like (Halko et al., 2011a, b). The cost for computing top PCs of is FLOPS. The connection between RR and PCR is well studied by (Dhillon et al., 2013). The problem of PCR is that when a large proportion of signal sits on the bottom PCs, it has to regress on a lot of PCs which makes it both slow and inaccurate.
In this paper, we propose a two stage algorithm LING111LING is the Chinese of ridge which is a faster way of computing the RR solution (2). A detailed description of the algorithm is given in section 2. In section 3, we prove that LING has the same risk as RR under the fixed design setting. In section 4, we compare the performance of PCR, GD, SGD and LING in terms of prediction accuracy and computational efficiency on both simulated and real data sets.
2 The Algorithm
2.1 Description of the Algorithm
LING is a two stage algorithm. The intuition of LING is quite straight forward. Note that regressing on is essentially projecting onto the span of . Let denotes the top PCs (singular vectors) of and let denote the remaining PCs. The projection of onto the span of can be decomposed into two orthogonal parts, the projection onto and the projection onto . In the first stage, we pick a and the projection onto can be computed directly by which is exactly the same as running a PCR on top PCs. For huge , computing the top PCs exactly is very slow, so we use a faster randomized SVD algorithm for computing which is proposed by (Halko et al., 2011a) and described below. In the second stage, we first compute and which are the residual of and after projecting on . Then we compute the projection of onto the span of by solving the optimization problem with GD (Algorithm 3). Finally, since RR shrinks the projection of onto
via regularization, we also shrink the projections in both stages accordingly. Shrinkage in the first stage is performed directly on the estimated regression coefficients and shrinkage in the second stage is performed by adding a regularization term to the optimization problem mentioned above. Detailed description of LING is shown in Algorithm1.
LING can be regarded as a combination of PCR and GD. The first stage of LING is a very crude estimation of and the second stage adds a correction to the first stage estimator. Since we don’t need a very accurate estimator in the first stage it suffices to pick a very small in contrast with the PCs needed for PCR. In the second stage, the design matrix is a much better conditioned matrix than the original since the directions with largest singular value have been removed. As introduced in section 2.2, Algorithm 3 converges much faster with a better conditioned matrix. Hence GD in the second stage of LING converges faster than directly applying GD for solving equation 1. The above property guarantees that LING is both fast and accurate compared with PCR and GD. More details about on the computational cost will be discussed in section 2.2.
Algorithm 2 is essentially approximating the subspace of top left singular vectors by random projection. It provides a fast approximation of the top singular values and vectors for large when computing the exact SVD is very slow. Theoretical guarantees and more detailed explanations can be found in (Halko et al., 2011a). Empirically we find in the experiments, Algorithm 2 may occasionally generate a bad subspace estimator due to randomness which makes PCR perform badly. On the other hand, LING is much more robust since in the second stage it compensate for the signal missing in the first stage. In all the experiments, we set .
The shrinkage step (step 5) in Algorithm 1 is only necessary for theoretical purposes since the goal is to approximate Ridge Regression which shrinks the Least Square estimator over all directions. In practice shrinkage over the top PCs is not necessary. Usually the number of PCs selected (
) is very small. From the bias variance trade off perspective, the variance reduction gained from this shrinkage step is at mostunder the fixed design setting (Dhillon et al., 2013) which is a tiny number. Moreover, since the top singular values of are usually very large compared with (since large will introduce large bias), the shrinkage factor will be pretty close to for top singular values. We use shrinkage in Algorithm 1 because the risk of the shrinkage version of LING is exactly the same as RR as proved in section 3.
Algorithm 2 can be further simplified if we skip the shrinkage step mentioned in previous paragraph. Instead of computing the top PCs, the only thing we need to know is the subspace spanned by these PCs since the first stage is essentially projecting onto this subspace. In other words, we can replace in step 1, 2, 3 of Algorithm 1 with obtained in step 3 of Algorithm 2 and directly let . In the experiments of section 4 we use this simplified version.
2.2 Computational Cost
We claim that the cost of LING is where is the number of PCs used in the first stage and is the number of iterations of GD in the second stage. According to (Halko et al., 2011a), the dominating step in Algorithm 2 is computing and which costs FLOPS. Computing and cost less than . Computing costs . So the computational cost before the GD step is . For the GD stage, note that in every iteration never need to be constructed explicitly. While computing and , always multiply matrix and vector first gives a cost of for every iteration. So the cost for GD stage is . Add all pieces together the cost of LING is FLOPS.
Let be the number of iterations required for solving (1) directly by GD and be the number of PCs used for PCR. Easy to check that the cost for GD is FLOPS and the cost for PCR is . As mentioned in remark 1, the advantage of LING over GD and PCR is that and might have to be really large to achieve high accuracy but much smaller values of the pair , will work for LING.
Consider the case when the signal is widely spread among all PCs (the projection of onto the bottom PCs of is large) instead of concentrating on the top ones, needs to be large to make PCR perform well since all the signals on bottom PCs are discarded by PCR. But LING doesn’t need to include all the signals in the first stage regression since the signal left over will be estimated in the second GD stage. Therefore LING is able to recover most of the signal even with a small .
In order to understand the connection between accuracy and number of iterations in Algorithm 3 , we state the following theorem in A.Epelman (2007):
Let be a quadratic function where is a PSD matrix. Suppose achieves minimum at . Apply Algorithm 3 to solve the minimization problem. Let be the value after iterations, then the gap between and , the minimum of the objective function satisfies
where are the largest and smallest eigenvalue of the
are the largest and smallest eigenvalue of thematrix.
Theorem 1 shows that the sub optimality of the target function decays exponentially as the number of iterations increases and the speed of decay depends on the largest and smallest singular value of the PSD matrix that defines the quadratic objective function. If we directly apply GD to solve (1), Let . Assume reaches its minimal at . Let be the coefficient after iterations and let denote the singular value of . Apply theorem 1 we have
Similarly for the second stage of LING, Let . Assume reaches its minimal at . We have
In most real problems, the top few singular values of are much larger than the other singular values and . Therefore the constant obtained in (4) can be very close to 1 which makes GD algorithm converges very slowly. On the other hand, removing the top few PCs will make in (5) significantly smaller than 1. In other words, GD may take a lot of iterations to converge when solving (1) directly while the second stage of LING takes much less iterations to converge. This can also be seen in the experiments of section 4.
In this section we compute the risk of LING estimator under the fixed design setting. For simplicity, assume generated by Algorithm 2 give exactly the top left singular vectors and singular values of and GD in step 4 of Algorithm 1 converges to the optimal solution. Let be the SVD of where and . Here are top singular vectors and are bottom singular vectors. Let where contains top singular values denoted by and contains bottom singular values. Let (replace in
by a zero matrix of the same size).
3.1 The Fixed Design Model
Assume , comes from the fixed design model where are i.i.d Gaussian noise with variance . Note that , the fixed design model can also be written as
where and . We use the distance between (the best possible prediction given ) and
(the prediction by LING) as the loss function. The risk of LING can be written as
We can further decompose the risk into two terms:
because . Note that here the expectation is taken with respect to .
Let’s calculate the two terms in (6) separately. For the first term we have:
Here is the element of .
Let be the diagonal matrix with . So we have , .
Now consider the second term in (6).
The residual after the first stage can be represented by
and the optimal coefficient obtained in the second GD stage is
For simplicity, let .
where is the element of
(replace the top
block of the identity matrix with 0),
Add the two terms together finishes the proof. ∎
The risk of LING algorithm under fixed design setting is
This risk is the same as the risk of ridge regression provided by Lemma 1 in (Dhillon et al., 2013). Actually, LING gets exactly the same prediction as RR on the training dataset. This is very intuitive since on the training set LING is essentially decomposing the RR solution into the first stage shrinkage PCR predictor on top PCs and the second stage GD predictor over the residual spaces as explained in section 2.
In this section we compare the accuracy and computational cost (evaluated in terms of FLOPS) of 3 different algorithms for solving Ridge Regression: Gradient Descent with Optimal step size (GD), Stochastic Variance Reduction Gradient (SVRG) (Johnson and Zhang, 2013) and LING. Here SVRG is an improved version of stochastic gradient descent which achieves exponential convergence with constant step size. We also consider Principle Component Regression (PCR) (Artemiou and Li, 2009; Jolliffe, 2005) which is another common way for running large scale regression. Experiments are performed on 3 simulated models and 2 real datasets. In general, LING performs well on all 3 simulated datasets while GD, SVRG and PCR fails in some cases. For two real datasets, all algorithms give reasonable performance while SVRG and LING are the best. Moreover, both stages of LING only requires only a moderate amount of matrix multiplications each cost , much faster to run on matlab compared with SVRG which contains a lot of loops.
4.1 Simulated Data
Three different datasets are constructed based on the fixed design model where is of size . In the three experiments and are generated randomly in different ways (more details in following sections) and i.i.d Gaussian noise is added to to get . Then GD, SVRG, PCR and LING are performed on the dataset. For GD, we try different number of iterations . For SVRG, we vary the number of passes through data denoted by . The number of iterations SVRG takes equals the number of passes through data times sample size and each iteration takes FLOPS. The step size of SVRG is chosen by cross validation but this cost is not considered when evaluating the total computational cost. Note that one advantage of GD and LING is that due to the simple quadratic form of the target function, their step size can be computed directly from the data without cross validation which introduces extra cost. For PCR we pick different number of PCs (). For LING we pick top PCs in the first stage and try different number of iterations in the second stage. The computational cost and the risk of the four algorithms are computed. The above procedure is repeated over 20 random generation of , and . The risk and computational cost of the traditional RR solution (2) for every dataset is also computed as a benchmark.
The parameter set up for the three datasets are listed in table 1.
|Model 1||Model 2||Model 3|
4.1.1 Model 1
In this model the design matrix has a steep spectrum. The top 30 singular values of decay exponentially as where . The spectrum of is shown in figure 4. To generate , we fix the diagonal matrix with the designed spectrum and construct by where , are two random orthonormal matrices. The elements of are sampled uniformly from interval . Under this set up, most of the energy of the matrix lies in top PCs since the top singular values are much larger than the remaining ones so PCR works well. But as indicated by (4), the convergence of GD is very slow.
The computational cost and average risk of the four algorithms and also the RR solution (2) over 20 repeats are shown in figure 1. As shown in figure 1 both PCR and LING work well by achieving risk close to RR at less computational cost. SVRG is worse than PCR and LING but much better than GD.
4.1.2 Model 2
In this model the design matrix has a flat spectrum. The singular values of are sampled uniformly from . The spectrum of is shown in figure 6. To generate , we fix the diagonal matrix with the designed spectrum and construct by where , are two random orthonormal matrices. The elements of are sampled uniformly from interval . Under this set up, the signal are widely spread among all PCs since the spectrum of is relatively flat. PCR breaks down because it fails to catch the signal on bottom PCs. As indicated by (4), GD converges relatively fast due to the flat spectrum of .
The computational cost and average risk of the four algorithms and also the RR solution (2) over 20 repeats are shown in figure 2. As shown by the figure GD works best since it approaches the risk of RR at the the lowest computational cost. LING and SVRG also works by achieving reasonably low risk with less computational cost. PCR works poorly as explained before.
4.1.3 Model 3
This model presented a special case where both PCR and GD will break down. The singular values of are constructed by first uniformly sample from . The top sampled values are then multiplied by . The top sigular values of are shown in figure 6. To generate , we fix the diagonal matrix with the designed spectrum and construct by where is a random orthonormal matrix. The first and last elements of the coefficient vector are sampled uniformly from interval and other elements of remains . In this set up, has orthogonal columns which are the PCs, and the signal lies only on the top and bottom PCs. PCR won’t work since a large proportion of signal lies on the bottom PCs. On the other hand, GD won’t work as well since the top few spectral values are too large compared with other singular values, which makes GD converges very slowly.
The computational cost and risk of the four algorithms and also the RR solution (2) over 20 repeats are shown in figure 3. As shown by the figure LING works best in this set up. SVRG is slightly worse than LING but still approaching RR with less cost. In this case, GD converges slowly and PCR is completely off target as explained before.
4.2 Real Data
In this section we compare PCR, GD, SVRG and LING with the RR solution (2) on two real datasets.
4.2.1 Gisette Dataset
The first is the gisette data set (Guyon, 2003) from the UCI repository which is a bi-class classification task. Every row of the design matrix consists of pixel features of a single digit "4" or "9" and gives the class label. Among the samples, we use for training and for testing. The classification error rate for RR solution (2) is . Since the goal is to compare different algorithms for regression, we don’t care about achieving the state of the art accuracy for this dataset as long as regression works reasonably well. When running PCR, we pick top PCs and in GD we iterate times. For SVRG we try passes through the data. For LING we pick PCs in the first stage and try iterations in the second stage. The computational cost and average classification error of the four algorithms and also the RR solution (2) on test set over 6 different train test splits are shown in figure 7. The top singular values of are shown in figure 10. As shown in the figure, SVRG gets close to the RR error very fast. The two curves of LING with are slower than SVRG since some initial FLOPS are spent on computing top PCs but after that they approach RR error very fast. GD also converges to RR but cost more than the previous two algorithms. PCR performs worst in terms of error and computational cost.
4.2.2 Buzz in Social Media
The second dataset is the UCI buzz in social media dataset which is a regression task. The goal is to predict popularity (evaluated by the mean number of active discussion) of a certain topic on Twitter over a period. The original feature matrix contains some statistics about this topic over that period like number of discussions created and new authors interacting at the topic. The original feature dimension is 77. We add quadratic interactions to make it 3080. To save time, we only used a subset of samples. The samples are split into train and test. We use MSE on the test data set as the error measure. For PCR we pick PCs and in GD we iterate times. For SVRG we try passes through the dataset and for LING we pick in the first stage and iterations in the second stage. The computational cost and average MSE on test set over 5 different train test splits are shown in figure 8. The top singular values of are shown in figure 10. As shown in the figure, SVRG approaches MSE of RR very fast. LING spent some initial FLOPS for computing top PCs but after that converges fast. GD and PCR also achieves reasonable performance but suboptimal compared with SVRG and LING. The MSE of PCR first decays when we add more PCs into regression but finally goes up due to overfit.
In this paper we present a two stage algorithm LING for computing large scale Ridge Regression which is both fast and robust in contrast to the well known approaches GD and PCR. We show that under the fixed design setting LING actually has the same risk as Ridge Regression assuming convergence. In the experiments, LING achieves good performances on all datasets when compare with three other large scale regression algorithms.
We conjecture that same strategy can be also used to accelerate the convergence of stochastic gradient descent when solving Ridge Regression since the first stage in LING essentially removes the high variance directions of , which will lead to variance reduction for the random gradient direction generated by SGD.
- A.Epelman (2007) Marina A.Epelman. Rate of convergence of steepest descent algorithm. 2007.
- Artemiou and Li (2009) Andreas Artemiou and Bing Li. On principal components and regression: a statistical explanation of a natural phenomenon. Statistica Sinica, 19(4):1557–1565, 2009. ISSN 1017-0405.
- Bottou (2010) Léon Bottou. Large-Scale Machine Learning with Stochastic Gradient Descent. In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), pages 177–187, Paris, France, August 2010. Springer.
Dhillon et al. (2013)
Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar.
A risk comparison of ordinary least squares vs ridge regression.Journal of Machine Learning Research, 14:1505–1511, 2013.
- Guyon (2003) Isabelle Guyon. Design of experiments for the nips 2003 variable selection benchmark. 2003.
- Halko et al. (2011a) N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53(2):217–288, May 2011a. ISSN 0036-1445.
- Halko et al. (2011b) Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert. An algorithm for the principal component analysis of large data sets. SIAM J. Scientific Computing, 33(5):2580–2594, 2011b.
- Johnson and Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems (NIPS), 2013.
- Jolliffe (2005) Ian Jolliffe. Principal Component Analysis. Encyclopedia of Statistics in Behavioral Science. John Wiley & Sons, 2005.
- Lu et al. (2013) Yichao Lu, Paramveer S. Dhillon, Dean Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in Neural Information Processing Systems (NIPS), 2013.
- Saunders et al. (1998) G. Saunders, A. Gammerman, and V. Vovk. Ridge regression learning algorithm in dual variables. In Proc. 15th International Conf. on Machine Learning, pages 515–521. Morgan Kaufmann, San Francisco, CA, 1998.
- Zhang (2004) Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML 2004: Proceedings of the twenty-first International Conference on Machine Learning. OMNIPRESS, pages 919–926, 2004.