). Read: you have a small sample of data points and those data points have missing features. This is a situation that one is confronted with all too often in machine learning. For example, with user-recommendation data, one does not have all the ratings of any given user. Or in a privacy preserving setting, a client may not want to give you all entries in the data matrix. In such a setting, our goal is to show that if the samples that you do get are chosen carefully, the top-PCA features of the data can be recovered within some provable error bounds.
More formally, the data matrix is ( data points in dimensions). Often, real data matrices have low effective rank, so let be the best rank- approximation to with being small. is obtained by projecting onto the subspace spanned by its top- principal components. In order to approximate this top- principal subspace, we adopt the following strategy. Select a small number, , of elements from and produce a sparse sketch ; use the sparse sketch to approximate the top- singular subspace. In Section 4, we give the details of the algorithm and the theoretical guarantees on how well we recover the top- principal subspace. The key quantity that one must control to recover a close approximation to PCA is how well the sparse sketch approximates the data in the operator norm. That is, if is small then you can recover PCA effectively.
Problem: sparse sampling of data elements
Given and ,
sample a small number of elements to obtain a
sparse sketch for which
Our main result addresses the problem above. In a nutshell, with only partially observed data that have been carefully selected, one can recover an approximation to the top- principal subspace. An additional benefit is that computing our approximation to the top- subspace using iterated multiplication can benefit computationally from sparsity. To construct , we use a general randomized approach which independently samples (and rescales) elements from using probability to sample element . We analyze in detail the case to get a bound on . We now make our discussion precise, starting with our notation.
We use bold uppercase (e.g., ) for matrices and bold lowercase (e.g.,
) for column vectors. The-th row of is , and the -th column of is . Let denote the set .
is the expectation of a random variable; for a matrix, denotes the element-wise expectation. For a matrix , the Frobenius norm is , and the spectral (operator) norm is . We also have the and norms: and (the number of non-zero entries in ). The
-th largest singular value ofis . For symmetric matrices , if and only if is positive semi-definite. is the identity and is the natural logarithm of . We use to denote standard basis vectors whose dimensions will be clear from the context.
Two popular sampling schemes are ( Achlioptas and McSherry (2001); Achlioptas et al. (2013)) and ( Achlioptas and McSherry (2001); Drineas and Zouzias (2011)). We construct as follows: if the -th entry is not sampled; sampled elements are rescaled to which makes the sketch
an unbiased estimator of, so . The sketch is sparse if the number of sampled elements is sublinear, . Sampling according to element magnitudes is natural in many applications, for example in a recommendation system users tend to rate a product they either like (high positive) or dislike (high negative).
Our main sparsification algorithm (Algorithm 1) receives as input a matrix and an accuracy parameter , and samples elements from in independent, identically distributed trials with replacement, according to a hybrid-probability distribution specified in equation (2). The algorithm returns , a sparse and unbiased estimator of , as a solution to (1).
1.2 Prior work
Achlioptas and McSherry (2001, 2007) pioneered the idea of sampling for element-wise sparsification. However, sampling on its own is not enough for provably accurate bounds for . As a matter of fact Achlioptas and McSherry (2001, 2007) observed that “small” entries need to be sampled with probabilities that depend on their absolute values only, thus also introducing the notion of sampling. The underlying reason for the need of sampling is the fact that if a small element is sampled and rescaled using sampling, this would result in a huge entry in
(because of the rescaling). As a result, the variance ofsampling is quite high, resulting in poor theoretical and experimental behavior. sampling of small entries rectifies this issue by reducing the variance of the overall approach.
Arora et al. (2006) proposed a sparsification algorithm that deterministically keeps large entries, i.e., entries of such that and randomly rounds the remaining entries using sampling. Formally, entries of that are smaller than are set to with probability and to zero otherwise. They used an -net argument to show that was bounded with high probability. Drineas and Zouzias (2011) bypassed the need for sampling by zeroing-out the small entries of (e.g., all entries such that for a matrix ) and then use sampling on the remaining entries in order to sparsify the matrix. This simple modification improves Achlioptas and McSherry (2007) and Arora et al. (2006), and comes with an elegant proof using the matrix-Bernstein inequality of Recht (2011). Note that all these approaches need truncation of small entries. Recently, Achlioptas et al. (2013) showed that sampling in isolation could be done without any truncation, and argued that (under certain assumptions) sampling would be better than sampling, even using the truncation. Their proof is also based on the matrix-valued Bernstein inequality of Recht (2011).
1.3 Our Contributions
We introduce an intuitive hybrid approach to element-wise matrix sparsification, by combining and sampling. We propose to use sampling probabilities of the form
for all 111combining and probabilities to avoid zeroing out step of sampling has recently been observed by Kundu and Drineas (2014).. We essentially retain the good properties of sampling that bias us towards data elements in the presence of small noise, while regularizing smaller entries using sampling. The proof of the quality-of-approximation result of Algorithm 1 (i.e. Theorem 1) uses the matrix-Bernstein Lemma 1. We summarize the main contributions below:
We give a parameterized sampling distribution in the variable that controls the balance between sampling and regularization. This greater flexibility allows us to achieve greater accuracy.
We derive the optimal hybrid- distribution, using Lemma 1 for arbitrary , by computing the optimal parameter which produces the desired accuracy with smallest sample size according to our theoretical bound.
Our result generalizes the existing results because setting in our bounds reproduces the result of Achlioptas et al. (2013) who claim that sampling is almost always better than sampling. Our results show that which means that the hybrid approach is best.
We give a provable algorithm (Algorithm 2) to implement hybrid- sampling without knowing a priori, i.e., we need not ‘fix’ the distribution using some predetermined value of at the beginning of the sampling process. We can set at a later stage, yet we can realize hybrid- sampling. We use Algorithm 2 to propose a pass-efficient element-wise sampling model using only one pass over the elements of the data , using memory. Moreover, Algorithm 3
gives us a heuristic to estimatein one-pass over the data using memory.
Finally, we propose the Algorithm 4 which provably recovers PCA by constructing a sparse unbiased estimator of (centered) data using our optimal hybrid- sampling.
Experimental results suggest that our optimal hybrid distribution (using ) requires strictly smaller sample size than and sampling (with or without truncation) to solve (1). Also, we achieve significant speed up of PCA on sparsified synthetic and real data while maintaining high quality approximation.
1.3.1 A Motivating Example for Hybrid- Sampling
The main motivation for introducing the idea of hybrid- sampling on elements of comes from achieving a tighter bound on using a simple and intuitive probability distribution on elements of . For this, we observe certain good properties of both and sampling for sparsification of noisy data (in practice, we experience data that are noisy, and it is perhaps impossible to separate “true” data from noise). We illustrate the behavior of and sampling on noisy data using the following synthetic example. We construct a binary data (Figure 1), and then perturb it by a random Gaussian matrix N whose elements.
We denote this perturbed data matrix by . First, we note that and sampling work identically on binary data . However, Figure 2 depicts the change in behavior of and sampling sparsifying . Data elements and noise in are the elements with non-zero and zero values in , respectively. We sample indices in i.i.d. trials according to and probabilities separately to produce sparse sketch . Figure 2 shows that elements of , produced by sampling, have controlled variance but most of them are noise. On the other hand, sampling is biased towards data elements, although small number of sampled noisy elements create large variance due to rescaling. Our hybrid- sampling benefits from this bias of towards data elements, as well as, regularization properties of .
We parameterize our distribution using the variable that controls the balance between sampling and regularization. We derive an expression to compute , the optimal , corresponding to the smallest sample size that we need in order to achieve a given accuracy in (1). Setting , we reproduce the result of Achlioptas et al. (2013). However, may be smaller than 1, and the bound on sample size , using , is guaranteed to be tighter than that of Achlioptas et al. (2013).
2 Main Result
We present the quality-of-approximation result of our main algorithm (Algorithm 1). We define the sampling operator in (3) that extracts elements from a given matrix . Let be a multi-set of sampled indices , for . Then,
Algorithm 1 randomly samples (in i.i.d. trials) elements of a given matrix , according to a probability distribution over the elements of . Let the ’s be as in eqn. (2). Then, we can prove the following theorem.
is the smallest singular value of . Moreover, we can find (optimal corresponding to the smallest ) and (the smallest ), by solving the following optimization problem in (6):
The functional form in (5) comes from the Matrix-Bernstein inequality in Lemma 1, with and being functions of and . This gives us a flexibility to optimize the sample size with respect to in (5), which is how we get the optimal . For a given matrix , we can easily compute and for various values of . Given an accuracy and failure probability , we can compute corresponding to the tightest bound on . Note that, for we reproduce the results of Achlioptas et al. (2013) (which was expressed using various matrix metrics). However, may be smaller than 1, and is guaranteed to produce tighter comparing to extreme choices of (e.g. for sampling). We illustrate this by the plot in Figure 3.
2.1 Proof of Theorem 1
In this section we provide a proof of Theorem 1 following the proof outline of Drineas and Zouzias (2011); Achlioptas et al. (2013). We use the following non-commutative matrix-valued Bernstein bound of Recht (2011) as our main tool to prove Theorem 1. Using our notation we rephrase the matrix Bernstein bound.
[Theorem 3.2 of Recht (2011)] Let be independent, zero-mean random matrices in . Suppose
and for all . Then, for any ,
holds, subject to a failure probability at most
For all we define the matrix as follows:
It now follows that
We can bound for all . We define the following quantity:
Using our notation, and using probabilities of the form (2), for all ,
Proof: Using probabilities of the form (2), and because is never sampled,
Using (8), we obtain the bound.
Next we bound the spectral norm of the expectation of .
Using our notation, and using probabilities of the form (2), for all ,
Proof: Recall that and to derive
Sampling according to probabilities of eqn. (2), and because is never sampled, we get, for ,
Note that, is a diagonal matrix with all entries non-negative, and is a postive semi-definite matrix. Therefore,
Similarly, we can obtain
We can now apply Theorem 1 with
to conclude that holds subject to a failure probability at most
Bounding the failure probability by , and setting we complete the proof.
3 One-pass Hybrid- Sampling
Here we discuss the implementation of -hybrid sampling in one pass over the input matrix using memory, that is, a streaming model. We know that both and sampling can be done in one pass using memory (see Algorithm SELECT p. 137 of Drineas et al. (2006) ). In our hybrid sampling, we want parameter to depend on data elements, i.e., we do not want to ‘fix’ it prior to the arrival of data stream. Here we give an algorithm (Algorithm 2) to implement a one-pass version of the hybrid sampling without knowing a priori.
We note that steps 2-5 of Algorithm 2 access the elements of only once, in parallel, to form independent multisets , , , and . Step 6 computes and in parallel in one pass over . Subsequent steps do not need to access anymore. Interestingly, we set in step 7 when the data stream is gone. Steps 10-16 sample elements from and based on the in step 7, and produce sparse matrix based on the sampled entries in random multiset . Theorem 2 shows that Algorithm 2 indeed samples elements from according to the hybrid- probabilities in eqn (2).
Using the notations in Algorithm 2, for , ,
where and .
Proof: Here we use the notations in Theorem 2. Note that -th elements of and are sampled independently with and probabilities, respectively. We consider the following disjoint events:
Let us denote the events and . Clearly, . Since the elements and are sampled independently, we have
We note that may be dependent on the elements of and (in Algorithm 3), but is independent of elements of and . Therefore, events and are independent of the events , . Thus,
Note that, Theorem 2 holds for any arbitrary in line 7 of Algorithm 2, i.e., Algorithm 3 is not essential for correctness of Theorem 2. We only need to be independent of elements of and . However, we use Algorithm 3 to get an iterative estimate of (Section 3.1) in one pass over . In this case, we need additional independent multisets and to ‘learn’ the parameter . Algorithm 2 (without Algorithm 3) requires a memory twice as large required by or sampling. Using Algorithm 3 this requirement is four times as large. However, in both the cases the asymptotic memory requirement remains the same .
3.1 Iterative Estimate of
We obtain independent random multiset of triples and , each containing elements from in one pass, in Algorithm 2
. We can create a sparse random matrix, as shown in step 11 in Algorithm 3, that is an unbiased estimator of . We use this as a proxy for to estimate the quantities we need in order to solve the optimization problem in (9).
where, for all
We note that . We can compute the quantities and , for a fixed , using memory. We consider to be the given accuracy.
4 Fast Approximation of PCA
Here, we discuss a provable algorithm (Algorithm 4) to speed up computation of PCA applying element-wise sampling. We sparsify a given centered data to produce a sparse unbiased estimator by sampling elements in i.i.d. trials according to our hybrid- distribution in (2). Computation of rank-truncated SVD on sparse data is fast, and we consider the right singular vectors of as the approximate principal components of . Naturally, more samples produce better approximation. However, this reduces sparsity, and consequently we lose the speed advantage.
The first inequality of Theorem 3 bounds the approximation of projected data onto the space spanned by top approximate PCA’s. The second and third inequalities measure the quality of as a surrogate for and the quality of projection of sparsified data onto approximate PCA’s, respectively.
Proofs of first two inequalities of Theorem 3 follow from Theorem 5 and Theorem 8 of Achlioptas and McSherry (2001), respectively. The last inequality follows from the triangle inequality. The last two inequalities above are particularly useful in cases where is inherently low-rank and we choose an appropriate for approximation, for which is small.
In this section we perform various element-wise sampling experiments on synthetic and real data to show how well the sparse sketches preserve the structure of the original data, in spectral norm. Also, we show results on the quality of the PCA’s derived from sparse sketches.
5.1 Algorithms for Sparse Sketches
We use Algorithm 1 as a prototypical algorithm to produce sparse sketches from a given matrix via various sampling methods. Note that, we can plug-in any element-wise probability distribution in Algorithm 1 to produce (unbiased) sparse matrices. We construct sparse sketches via our optimal hybrid-() sampling, along with other sampling methods related to extreme choices of , such as, sampling for . Also, we use element-wise leverage scores (Chen et al. (2014)) for sparsification of low-rank data. Element-wise leverage scores are used in the context of low-rank matrix completion by Chen et al. (2014). Let be a matrix of rank , and its SVD if given by . Then, we define (row leverage scores), (column leverage scores), and element-wise leverage scores as follows:
Note that is a probability distribution on the elements of . Leverage scores become uniform if the matrix is full rank. We use in Algorithm 1 to produce sparse sketch of a low-rank data .
5.1.1 Experimental Design for Sparse Sketches
We compute the theoretical optimal mixing parameter by solving eqn (6) for various datasets. We compare this with the theoretical condition derived by Achlioptas et al. (2013) (for cases when sampling outperforms sampling). We verify the accuracy of by measuring the quality of the sparse sketches , for various sampling distributions. Let , , and denote the quality of sparse sketches produced via optimal hybrid sampling, sampling, and element-wise leverage scores , respectively. We compare , , and for various sample sizes for real and synthetic datasets.
5.2 Algorithms for Fast PCA
We compare three algorithms for computing PCA of the centered data. Let the actual PCA of the original data be . We use Algorithm 4 to compute approximate PCA via our optimal hybrid- sampling. Let us denote this approximate PCA by . Also, we compute PCA of a Gaussian random projection of the original data to compare the quality of . Let , where