1 Introduction
Given data instances , each of dimension , and responses , it is interesting to find a model such that . If , there will not in general exist a solution to the linear system, so we instead seek to find a model such that . This can be formulated as the least squares regression (LSR) problem:
(1) |
Suppose , in general it takes time and space to compute
using the Cholesky decomposition, the QR decomposition, or the singular value decomposition (SVD)
[9].LSR is perhaps one of the most widely used method in data processing, however, solving LSR for big data is very time and space expensive. In the big-data problems where and are both large, the time complexity and space complexity make LSR prohibitive. So it is of great interest to find efficient solution to the LSR problem. Fortunately, when , one can use a small portion of the instances instead of using the full data to approximately compute , and the computation cost can thereby be significantly reduced. Random sampling based methods [5, 11, 12] and random projection based methods [2, 6] have been applied to make LSR more efficiently solved.
Formally speaking, let be a random sampling/projection matrix, we solve the following problem instead of (1):
(2) |
This problem can be solved in only time and space. If the random sampling/projection matrix is constructed using some special techniques, then it is ensured theoretically that and that
(3) |
hold with high probability. There are two criteria to evaluate random sampling/projection based LSR.
-
Running Time. That is, the total time complexity in constructing and computing .
1.1 Contributions
The leverage scores based sampling is an important random sampling technique widely studied and empirically evaluated in the literature [2, 4, 5, 8, 11, 13, 12, 15, 16]. When applied to accelerate LSR, error analysis of the leverage scores based sampling is available in the literature. It was shown in [4] showed that by using the leverage scores based sampling with replacement, when
the inequality (3) holds with high probability. Later on, [5] showed that by using the leverage scores based sampling without replacement,
is sufficient to make (3) hold with high probability. In Theorem 3.1 we show that (3) holds with high probability when
using the same leverage scores based sampling without replacement. Our results are described in Theorem 3.1. Our proof techniques are based on the previous work [5, 6], and our proof is self-contained.
Though uniform sampling can have very bad worst-case performance when applied to the LSR problem [11], it is still the simplest and most efficient strategy for column sampling. Though the uniform sampling is uniformly worse than the leverage score based sampling from an algorithmic perspective, it is not true from a statistical perspective [11]. Furthermore, when the leverage scores of are very homogeneous, the uniform sampling has virtually the same performance as the leverage scores based sampling [11]. So uniform sampling is still worth of study. We provide in Theorem 3.2 an error bound for the uniform sampling based LSR. We show that when
the uniform sampling based LSR attains a bound with positive probability. Here denotes the matrix coherence of .
2 Preliminaries and Previous Work
For a matrix , we let be its -th row, be its -th column, be its Frobenius norm, and be its spectral norm. We let be an identity matrix and let
be an all-zero matrix with proper size.
We let the thin singular value decomposition of be
Here , , and are of sizes , , and , and the singular values are in non-increasing order. We let be an
column orthogonal matrix such that
. The condition number of is defined by .Based on SVD, the (row) statistical leverage scores of is defined by
where is the -th row of . It is obvious that . Exactly computing the leverages scores costs time, which is as expensive as exactly solving the LSR problem (1). Fortunately, if is a skinny matrix, the leverages scores can be highly efficiently computed within arbitrary accuracy using the techniques of [2, 3].
There are many ways to construct the random sampling/projection matrix , and below we describe some of them.
-
Uniform Sampling. The sampling matrix is constructed by sampling rows of the identity matrix uniformly at random. This method is the simplest and fastest, but in the worst case its performance is very bad [11].
-
Subsampled Randomized Hadamard Transform (SRHT). The random projection matrix is called SRHT [1, 6, 14] if
-
is a subset of rows from the identity matrix, where the rows are chosen uniformly at random and without replacement;
-
is a normalized Walsh–Hadamard matrix;
-
is an random diagonal matrix with each diagonal entry independently chosen to be or with equal probability.
SRHT is a fast version of the Johnson-Lindenstrauss transform. The performance of SRHT based LSR is analyzed in [6].
-
-
Sparse Embedding Matrices. The sparse embedding matrix enables random projection performed in time only linear in the number of nonzero entries of [2]. The random linear map is defined by
-
is a random map so that for each , for with probability ;
-
is a binary matrix with , and all remaining entries ;
-
is the same to the matrix of SRHT.
Sparse embedding matrices based LSR is guaranteed theoretically in [2].
-
3 Main Results
We provide in Theorem 3.1 an improved error bound for the leverage scores sampling based LSR.
Theorem 3.1 (The Leverage Score Based Sampling)
We show in Theorem 3.2 an error bound for the uniform sampling based LSR.
Theorem 3.2 (Uniform Sampling)
Use the uniform sampling without replacement to sample
rows of and compute the approximate LSR problem (2) to obtain . Then with probability the following inequalities hold:
where is defined by .
Since computing costs only time, so one can repeat the procedure times and choose the solution that attains the minimal error . In this way, the error bounds hold with probability which can be arbitrarily high.
4 Proof
In Section 4.1 we list some of the previous work that will be used in our proof. In Section 4.2 we prove Theorem 3.1. We prove the theorem by using the techniques in the proof of Lemma 1 and 2 of [5] and Lemma 1 and 2 of [6]. For the sake of self-contain, we repeat some of the proof of [5] in our proof. In Section 4.3 we prove Theorem 3.2 using the techniques in [6, 7, 14].
4.1 Key Lemmas
Lemma 1 is a deterministic error bound for the sampling/projection based LSR, which will be used to prove both of Theorem 3.1 and Theorem 3.2
. The random matrix multiplication bounds in Lemma
2 will be used to prove Theorem 3.1. The matrix variable tail bounds in Lemma 3 will be used to prove Theorem 3.2.Lemma 1 (Deterministic Error Bound, Lemma 1 and 2 of [6])
Proof
Lemma 2 (Theorem 7 of [5])
Suppose , , and , and we let be the sampling matrix computed by Algorithm 1 taking and as input, then
Lemma 3 (Theorem 2.2 of [14])
Let be a finite set of positive semidefinite matrices with dimension , and suppose that
Sample uniformly at random from without replacement. We define and . Then for any and , the following inequalities hold:
4.2 Proof of Theorem 3.1
Proof
We first bound the term as follows. Applying a singular value inequality in [10], we have that for all
Since the leverage scores of are also the leverage scores of , it follows from Lemma 2 that
It then follows from Markov’s inequality that the inequality
holds with probability at least . When
(4) |
the inequality
(5) |
holds with probability at least .
Now we bound the term . Since , we have that
Since the leverage scores of are also the leverage scores of , it follows from Lemma 2 that
It follows from the Markov’s inequality that the following inequality holds with probability at least :
(6) |
Thus when
(7) |
it follows from (5), (6), and the union bound that the inequality
(8) |
holds with probability at least . We let , , , and let be defined in Lemma 1. When
it follows from (4), (7), (8), and Lemma 1 that with probability at least the following inequality holds:
Since , the theorem follows directly from Lemma 1.
4.3 Proof of Theorem 3.2
Proof
We first follow some of the techniques of [7] to bound the two terms
We let be the -th column of , and let be matrices sampled i.i.d. from uniformly at random without replacement. Obviously, . We accordingly define
where is the row matrix coherence of , and define
Then we apply Lemma 3 and obtained the following inequality:
where , , and are arbitrary real numbers. We set
(9) |
it then follows that with probability at least , both of the following two inequalities hold:
(10) |
Now we seek to bound the term . Let be an index set of cardinality with each element chosen from uniformly at random without replacement, and let , then we have that
Thus with probability at least the following inequality holds:
(11) |
Finally, it follows from inequalities (10, 11) and Lemma 1 that
Here the first two inequalities hold deterministically, and the third inequality holds with probability at least .
We set , , , and . Since when is close to , it follows from (9) that when , the inequality
holds with probability at least .
Setting , , , and , we conclude that when , the inequality holds with probability at least . Then the theorem follows directly from Lemma 1.
References
- [1] C. Boutsidis and A. Gittens. Improved matrix algorithms via the subsampled randomized hadamard transform. SIAM Journal on Matrix Analysis and Applications, 34(3):1301–1340, 2013.
-
[2]
K. L. Clarkson and D. P. Woodruff.
Low rank approximation and regression in input sparsity time.
In
Annual ACM Symposium on theory of computing (STOC)
. ACM, 2013. -
[3]
P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff.
Fast approximation of matrix coherence and statistical leverage.
Journal of Machine Learning Research
, 13:3441–3472, 2012. - [4] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm. ACM, 2006.
- [5] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, Sept. 2008.
- [6] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós. Faster least squares approximation. Numerische Mathematik, 117(2):219–249, 2011.
- [7] A. Gittens. The spectral norm error of the naive Nyström extension. arXiv preprint arXiv:1110.5305, 2011.
- [8] A. Gittens and M. W. Mahoney. Revisiting the nyström method for improved large-scale machine learning. In International Conference on Machine Learning (ICML), 2013.
- [9] G. H. Golub and C. F. van Loan. Matrix computations. The Johns Hopkins, 1996.
- [10] R. A. Horn and C. R. Johnson. Topics in matrix analysis. 1991. Cambridge University Presss, Cambridge.
- [11] P. Ma, M. W. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning (ICML), 2014.
- [12] M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.
- [13] M. W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.
- [14] J. A. Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(01–02):115–126, 2011.
- [15] S. Wang and Z. Zhang. Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling. Journal of Machine Learning Research, 14:2729–2769, 2013.
-
[16]
S. Wang and Z. Zhang.
Efficient algorithms and error analysis for the modified nyström
method.
In
International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2014.
Comments
There are no comments yet.