Consider the linear model
with the response vector, the design matrix and the noise term . The unknown parameter can be efficiently estimated by the method of least squares:
which is of computational complexity . When is fixed but , the least squares method would become intractable even though it is linearly fast.
Faster least squares approximation can be achieved by the randomized sketch proposed by Drineas et al. (2011)
. It relies on a proper random matrixwith . The ways to generate random matrix are divided into three main categories: subsampling, random projection and their hybrid. A widely known subsampling method is the algorithmic leveraging (LEV) proposed by Ma et al. (2015). The random projection approach includes the subsampled randomized Hadamard transformation (SRHT) (Drineas et al., 2011, Boutsidis and Gittens, 2013) and the Clarkson–Woodruff sketch (Clarkson and Woodruff, 2013). The hybrid approach is to combine subsampling and random projection, see e.g. McWilliams et al. (2014).
When a sketching matrix is fixed, there exist different sketching schemes, including the classical sketch (CS), the Hessian sketch (HS) and the iterative Hessian sketch (IHS). The widely adopted CS uses the sketched data pair for approximating by
Pilanci and Wainwright (2016) showed that is suboptimal in the sense that it has a substantially larger error than with respect to the ground truth . They introduced the HS estimator based on the partially sketched data ,
and furthermore the IHS estimator based on iterative sketched data ,
where , with the initial being provided. Unlike the CS and HS estimators, the IHS estimator is guaranteed to converge to upon some good event conditions (Pilanci and Wainwright, 2016).
The IHS can be interpreted as a first-order gradient descent method with a series of random preconditoners subject to the unit step length. The preconditioner is widely used to boost optimization algorithms (Knyazev and Lashuk, 2007, Gonen et al., 2016). However, the IHS is computationally intensive since in every iteration has to be evaluated for a new random sketching matrix . To speed up the IHS, Wang and Xu (2018) proposed the pwGradient method to improve the IHS by a fixed well-designed sketching matrix, in which case the IHS reduces to a first order method with a constant preconditioner. Meanwhile, Wang et al. (2017) proposed another accelerated version of IHS with the fixed sketching matrix, while adopting conjugate gradient descent. Note that all these sketch methods are based on randomized sketching matrices.
In this paper we propose to reformulate the IHS estimator by a linear combination of the initial and the full data estimator . Such reformulation enables us to find a sufficient isometric condition on the sketching matrices so that is guaranteed to converge to with an exponential rate. It then motivates us to propose a deterministic approach for improving the IHS based on -optimal subsampling and adaptive step lengths, which modifies the original second-order IHS method to be an adaptive first-order method. In summary, we improve the IHS method with the following three contributions:
Good initialization. A good initialization scheme can reduce the inner iteration rounds of IHS while still delivering the same precision. We suggest to initialize the IHS with the classical sketch based on the -optimal deterministic subsampling matrix. To our best knowledge, this is the first attempt to take initialization into account for improving the IHS method.
Improved preconditioner. It is critical to find a well-designed preconditioner so that it may be fixed during the iterative sketch process. We propose to construct the preconditioner from -optimal subsample and refine it by adding a ridge term. Unlike complicated random projection based methods, we obtain our preconditioner at a low cost by recycling the subsamples in initialization.
Adaptive step lengths. We modify the IHS to be an adaptive first-order method by using a fixed preconditioner subject to variable step lengths. The step lengths at each iteration are determined by the exact line search, which ensures the algorithm to enjoy the guaranteed convergence.
Through extensive experiments, our proposed method is shown to achieve the state-of-art performance in terms of both precision and speed when approximating both the ground truth and the full data estimator .
2 Reformulation of IHS
The original IHS algorithm is displayed in Algorithm 1. For simplicity, we omit the superscript from . As for the sketch matrix, we use SRHT as it is widely adopted in the literature. Following Tropp (2011) and Lu et al. (2013), when where is a positive integer, the SRHT sketch matrix is given by
where is an matrix with rows chosen uniformly without replacement from the standard bases of , is a normalized Walsh–Hadamard matrix of size , and is an diagonal matrix with i.i.d. Rademacher entries.
In what follows, we present a novel reformulation of IHS. At each step, the update formula in Algorithm 1 can be rewritten as
By (4), we can derive the following lemma by mathematical induction.
Given an initializer and a series of independent sketch matrices , for any positive integer , we have
where and .
Lemma 1 reveals that is a linear combination of the initial and the full data estimator , with the re-weighting matrices during the iterations. We can therefore study the convergence properties of by looking into or . The following theorem provides a sufficient isometric condition for the convergence guarantee.
Given an initial estimator and a set of sketch matrices , the IHS estimator converges to with the exponential rate,
provided that for any and and for any , we have that
Theorem 1 is meaningful in different ways. Let and restrict . It is easy to check that the good event described in Pilanci and Wainwright (2016) can be formulated to our condition. It also mimics the Johnson–Lindenstrauss lemma (Johnson and Lindenstrauss, 1984) when . Moreover, the inequality (6) implies that controls the convergent rate. Therefore the sketch matrix should nearly preserve the Euclidean distance to the origin for all points in .
The original IHS method requires calculating with time complexity repeatedly for each . Theorem 1 indicates that a fixed sketch matrix such can also ensure the convergence if the condition (7) is satisfied. This result provides enables us to reduce the computational cost for the original IHS method. For example, Wang and Xu (2018) proposed the pwGradient method to improve the IHS by constructing a well-designed sketch algorithm, which can be viewed as an application of Theorem 1 by letting , and .
In the meanwhile, we can also interpret the IHS approach based on a transformed space. Let the preconditioner , multiply to the both sides of (2), and denote and for . Then, we have
which corresponds to the gradient descent update when minimizing the following least squares objective
Based on this observation, Wang et al. (2017) proposed the acc-IHS method by fixing the preconditioner and replacing the gradient descent with the conjugate counterpart in the transformed parameter space. It is worth mentioning that both the pwGradient and acc-IHS methods belong to the randomized approach based on random projections, while the SRHT sketching needs to operate on all the entries of . We hence seek to improve the IHS method in a more efficient and deterministic way.
3 Adaptive IHS with A-Optimal Subsampling
We extend the concept of sketch matrix from randomized settings to deterministic settings, by introducing to indicate where an observation is selected, i.e., if sample or is included, otherwise. It is assumed that , which implies that the corresponding sketch matrix satisfies . It is our objective to find a good subject to certain optimality criterion.
3.1 -Optimal Classical Sketch
The convergence of also depends on as shown in (6), which motivates us to find a good initializer. We achieve this goal by proposing an -optimal estimator under the classical sketch scheme. Suppose that we select a subset of observations. The least squares estimator based on the subdata and the corresponding covariance matrix are given by
Let . Following the -optimality criterion in experimental design (Pukelsheim, 1993), we seek the subdata as indicated by
that minimizes the averaged variance of, which is proportional to the trace of . Formally, our goal can be formulated as the following discrete optimization problem,
However, it is NP-hard to solve it exactly, so we turn to derive a lower bound of , which leads to Algorithm 2.
For a given ,
where the equality holds if the eigenvalues of matrix are all equal.
Let denote positive eigenvalues of . According to the Geometric-Harmonic mean inequality, we have
. According to the Geometric-Harmonic mean inequality, we have
where the inequality holds if . As , the proof is complete.
By Theorem 2, we can seek an approximately -optimal design by maximizing . Our subsampling approach is described as follows.
The time for norm calculation is , while sorting requires on average operations Martınez (2004). For step 3, it costs to obtain . In total, the time complexity of Algorithm 2 is . The time can be further reduced to when it comes to . Such a scenario is common for massive data. From the time complexity perspective, our algorithm is as efficient as the -optimality based method in Wang et al. (2018), while ours is more suitable for parallel computation.
In practice, the data matrix is centered and scaled initially. Therefore, our algorithm tends to choose the extreme samples with the farthest distances to the center, which is consistent with the conclusion in Wang et al. (2018).
As shown in Figure 1, the performance of the -optimal method uniformly dominates that of randomized sketch matrices and classical sketching scheme. Furthermore, the MSE of our approach decreases with the increase of . In this regard, our -optimal estimator serves as a good initialization for further enhancements.
3.2 Improved Preconditioner
Note that is mainly determined by a set of matrices . Rather than specifying ’s by repeatedly sketching in IHS or fixing in pwGradient and acc-IHS, we make a compromise by defining , that is,
where . We consider the -optimal Hessian sketch in order to specify a deterministic preconditioner . Similarly, we can obtain the estimator of Hessian sketch and its covariance matrix based on an optimality criterion,
where . The next theorem provides a lower bound for the trace of the covariance matrix.
For a given ,
where the equality holds if the eigenvalues of matrix are all equal and hat matrix is of full rank.
Let , . From the Geometric–Harmonic mean inequality of the matrix eigenvalues, we have
where the inequality (3) is resulted from in the Lowener ordering.
Theorem 3 shows that the samples selected for initialization can be recycled for constructing the preconditioner. We also find that adding a ridge term can further improve our preconditioner
The rationale is demonstrated by Figure 2. Note that the effectiveness of the precondtioner is measured by Benzi (2002) which should be close to 1. And can be visualized by the contour plot of in the two-dimensional cases. Specifically, the flat degree of the contour plot indicates the size of the condition number. The larger the condition number, the more circular the contour plot. The optimal transformation is as shown by Figure 2(b). Observing Figures 2(a) and 2(c), is still close to the , i.e., the preconditioner fails. After ridging, our preconditioner shrinks the major radius in Figure 2(c) and render the transformation to reach optimality as shown by Figure 2(d). Empirically, can neither be too large nor too small, and we suggest it to be in a range around . As in acc-IHS and pwGradient, we also fix our preconditioner to reduce the computational cost.
3.3 Exact Line Search
for . To determine adaptively, we note that it actually corresponds to the learning rate of the first order method in the transformed space. Recall that the learning rate in (9) is fixed as unit 1, which may be suboptimal and does not lead to the largest descent for every iteration. We hence fulfill the potential of IHS by exact line search. Specifically, let
be the update direction at the th iteration, the optimal step lengths are determined by adaptively minimizing the univariate function . It is easy to show that
where . Moreover, with the above , the first order method becomes steepest descent in the transformed space with the guaranteed convergence Nocedal and Wright (2006), and thus it also converges in the original space.
In summary, we put forward the adaptive approach as -optimal IHS in Algorithm 3.
4 Numerical Experiments
4.1 Data Generation
We generate data by following the experimental setups of Wang et al. (2018). All data are generated from the linear model with the true parameter being i.i.d variates and . Let be the covariance matrix where its entry is for , and is the indicator function. We consider four multivariate distributions for covariates .
A multivariate normal distribution.
A multivariate log-normal distribution.
A multivariate distribution with degrees of freedom .
A mixture distribution composed of five different distributions , , , , with equal proportions, where represents
elements from independent uniform distributions.
To remove the effect of the intercept, we center the data. Unless particularly stated, we set and present the results averaged over replications.
4.2 Preconditioner Evaluation
To assess the quality of the preconditioner, we define a function for any positive definite matrix , where measures the ratio of improvement with respect to the condition number.
Ideally, for a good preconditioner, the value should be close to 1. Letting be or , we compare the ridged preconditioner to the SRHT scheme. In practice, we suggest to be chosen around . Table 1 presents the averaged results for different . Regarding , our preconditioner uniformly dominates in all cases with a suitable value of . Such superiority is significant especially for the cases with , i.e., is relatively large.
4.3 Comparative Study
We compare our deterministic -optimal IHS not only to the original IHS Pilanci and Wainwright (2016), but also two other improved IHS methods: acc-IHS Wang et al. (2017) and pwGradient Wang and Xu (2018). Two MSE criteria based on Monte Carlo simulations are used:
where denotes the estimator of the th iteration in the th round. In the experiments, we set the original IHS as the upper bound of the precision by conducting the -dimensional sketch in every iteration, i.e., a total sketch size is , and keep the aggregate sketch size of other methods as .
Figures 3 and 4 illustrate the advantages of our method in detail. Firstly, the superiority of initialization can be thoroughly reflected by our flat curve in Figure 3 and a significantly lower intercept in Figure 4. The dominance of our preconditioner can also be concluded from Figure 4. Specifically, as an indicator of the convergence rate, the slope corresponding to our method is at least at the same level of pwGradient and at best achieves the upper bound represented by IHS. To further validate our speed advantage, we present the total time for each method to achieve the precision in and the corresponding number of iterations in Table 2. The time to calculate the product in the update formula is omitted in all methods. Table 2 illustrates that our method can attain the same precision with shorter time and fewer iterations.
We reformulate the IHS method as an adaptive first-order optimization method, by using the idea of optimal design of experiments for subdata selection. To our best knowledge, this is the first attempt to improve the IHS method in a deterministic manner while maintaining a descent speed and precision. The numerical experiments confirm the superiority of the proposed approach.
There are several open problems worth of further investigation, for example, the theoretical properties of the ridged preconditioner according to the conditions derived in Theorem 1. Other than using the -optimal design, it is also interesting to investigate the -optimal or other types of optimal designs for the purpose of subdata selection, in particular when the data are heterogeneously distributed.
- Benzi (2002) Benzi, M. (2002). Preconditioning Techniques for Large Linear Systems: A Survey. Journal of Computational Physics, 182(2):418–477.
- Boutsidis and Gittens (2013) Boutsidis, C. and Gittens, A. (2013). Improved Matrix Algorithms via the Subsampled Randomized Hadamard Transform. SIAM Journal on Matrix Analysis and Applications, 34(3):1301–1340.
Clarkson and Woodruff (2013)
Clarkson, K. L. and Woodruff, D. P. (2013).
Low Rank Approximation and Regression in input Sparsity Time.
Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, pages 81–90. ACM.
- Drineas et al. (2011) Drineas, P., Mahoney, M. W., Muthukrishnan, S., and Sarlós, T. (2011). Faster Least Squares Approximation. Numerische Mathematik, 117(2):219–249.
Gonen et al. (2016)
Gonen, A., Orabona, F., and Shalev-Shwartz, S. (2016).
Solving Ridge Regression Using Sketched Preconditioned SVRG.In
International Conference on Machine Learning, pages 1397–1405.
- Johnson and Lindenstrauss (1984) Johnson, W. B. and Lindenstrauss, J. (1984). Extensions of Lipschitz Mappings into a Hilbert Space. Contemporary Mathematics, 26(189-206):1.
- Knyazev and Lashuk (2007) Knyazev, A. V. and Lashuk, I. (2007). Steepest Descent and Conjugate Gradient Methods with Variable Preconditioning. SIAM Journal on Matrix Analysis and Applications, 29(4):1267–1280.
- Lu et al. (2013) Lu, Y., Dhillon, P., Foster, D. P., and Ungar, L. (2013). Faster Ridge Regression via the Subsampled Randomized Hadamard Transform. In Advances in Neural Information Processing Systems, pages 369–377.
- Ma et al. (2015) Ma, P., Mahoney, M. W., and Yu, B. (2015). A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research, 16(1):861–911.
- Martınez (2004) Martınez, C. (2004). Partial Quicksort. In Proc. 6th ACMSIAM Workshop on Algorithm Engineering and Experiments and 1st ACM-SIAM Workshop on Analytic Algorithmics and Combinatorics, pages 224–228.
- McWilliams et al. (2014) McWilliams, B., Krummenacher, G., Lucic, M., and Buhmann, J. M. (2014). Fast and Robust Least Squares Estimation in Corrupted Linear Models. In Advances in Neural Information Processing Systems, pages 415–423.
- Nocedal and Wright (2006) Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer.
- Pilanci and Wainwright (2016) Pilanci, M. and Wainwright, M. J. (2016). Iterative Hessian sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares. The Journal of Machine Learning Research, 17(1):1842–1879.
- Pukelsheim (1993) Pukelsheim, F. (1993). Optimal Design of Experiments, volume 50. SIAM.
- Tropp (2011) Tropp, J. A. (2011). Improved Analysis of the Subsampled Randomized Hadamard Transform. Advances in Adaptive Data Analysis, 3(1–2):115–126.
- Wang and Xu (2018) Wang, D. and Xu, J. (2018). Large Scale Constrained Linear Regression Revisited: Faster Algorithms via Preconditioning. arXiv preprint arXiv:1802.03337.
Wang et al. (2018)
Wang, H., Yang, M., and Stufken, J. (2018).
Information-Based Optimal Subdata Selection for Big Data Linear Regression.Journal of the American Statistical Association, pages 1–13.
Wang et al. (2017)
Wang, J., Lee, J. D., Mahdavi, M., Kolar, M., Srebro, N., et al. (2017).
Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-Dimensional Data.Electronic Journal of Statistics, 11(2):4896–4944.