Near-Optimal Entrywise Sampling for Data Matrices

11/19/2013 ∙ by Dimitris Achlioptas, et al. ∙ Yahoo! Inc. 0

We consider the problem of selecting non-zero entries of a matrix A in order to produce a sparse sketch of it, B, that minimizes A-B_2. For large m × n matrices, such that n ≫ m (for example, representing n observations over m attributes) we give sampling distributions that exhibit four important properties. First, they have closed forms computable from minimal information regarding A. Second, they allow sketching of matrices whose non-zeros are presented to the algorithm in arbitrary order as a stream, with O(1) computation per non-zero. Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. Lastly, and most importantly, under mild assumptions, our distributions are provably competitive with the optimal offline distribution. Note that the probabilities in the optimal offline distribution may be complex functions of all the entries in the matrix. Therefore, regardless of computational complexity, the optimal distribution might be impossible to compute in the streaming model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given an matrix , it is often desirable to find a sparser matrix that is a good proxy for

. Besides being a natural mathematical question, such sparsification has become a ubiquitous preprocessing step in a number of data analysis operations including approximate eigenvector computations

[AM01, AHK06, AM07], semi-definite programming [AHK05, d’A08], and matrix completion problems [CR09, CT10].

A fruitful measure for the approximation of by is the spectral norm of , where for any matrix its spectral norm is defined as . Randomization has been central in the context of matrix approximations and the overall problem is typically cast as follows: given a matrix and a budget , devise a distribution over matrices such that the (expected) number of non-zero entries in is at most and is as small as possible.

Our work is motivated by big data matrices that are generated by measurement processes. Each of the matrix columns correspond to an observation of attributes. Thus, we expect . Also we expect the total number of non-zero entries in to exceed available memory. We assume that the original data matrix is accessed in the streaming model where we know only very basic features of a priori and the actual non-zero entries are presented to us one at a time in an arbitrary order. The streaming model is especially important for tasks like recommendation engines where user-item preferences become available one by one in an arbitrary order. But, it is also important in cases when exists in durable storage and random access of its entries is prohibitively expensive.

We establish that for such matrices the following approach gives provably near-optimal sparsification. Assign to each element of the matrix a weight that depends only on the elements in its row . Take to be an (appropriate) distribution over the rows. Sample i.i.d. entries from using the distribution . Return which is the mean of matrices, each containing a single non zero entry in the selected location .

As we will see, this simple form of the probabilities falls out naturally from generic optimization considerations. The fact that each entry is kept with probability proportional to its magnitude, besides being interesting on its own right, has a remarkably practical implication. Every non-zero in the -th row of will take the form where is the number times was sampled. Note that since we sample with replacement might, in rare occasions, be more than . The result is a matrix which is representable in bits. This is because there is no reason to store floating point matrix entry values. We use bits to store all values and bits to store the non zero index offsets.111It is harmless to assume any value in the matrix is kept using bits of precision. Otherwise, truncating the trailing bits can be shown to be negligible. Note that and that some of these offsets might be zero. In a simple experiment, we measured the average number of bits per sample (total size of the sketch divided by the number of samples ). The results were between and bits per sample depending on the matrix and . It is important to note that the number of bits per sample is usually less than which is the minimal number of bit required to represent a pair . Our experiments show a reduction of disc space by a factor of between and relative to the compressed size of the file representing the sample matrix in the standard row-column-value list format.

Another insight of our work is that the distributions we propose are combinations of two L1-based distributions. Which distribution is more dominant is determined by the sampling budget. When the number of samples is small, is nearly linear in resulting in . However, as the number of samples grows, tends towards resulting in , a distribution we refer to as Row-L1 sampling. The dependence of the preferred distribution on the sample budget is also borne out in experiments, with sampling based on appropriately mixed distributions being consistently best. This highlights that the need to adapt the sampling distribution to the sample budget is a genuine phenomenon.

2 Measure of Error and Related Work

We measure the difference between and with respect to the L2 (spectral) norm as it is highly revealing in the context of data analysis. Let us define a linear trend in the data of

as any tendency of the rows to align with a particular unit vector

. To examine the presence of such a trend, we need only multiply with : the th coordinate of is the projection of the th row of onto . Thus, measures the strength of linear trend in , and measures the strongest linear trend in . Thus, minimizing minimizes the strength of the strongest linear trend of not captured by . In contrast, measuring the difference using any entry-wise norm, e.g., the Frobenius norm, can be completely uninformative. This is because the best strategy would be to always pick the largest matrix entries from , a strategy that can easily be “fooled”. As a stark example, when the matrix entries are , the quality of the approximation is completely independent of which elements of we keep. This is clearly bad; as long as contains even a modicum of structure certain approximations will be far better than others.

By using the spectral norm to measure error we get a natural and sophisticated target: to minimize is to make a near-rotation, having only small variations in the amount by which it stretches different vectors. This idea that the error matrix should be isotropic, thus packing as much Frobenius norm as possible for its L2 norm, motivated the first work on element-wise sampling of matrices by Achlioptas and McSherry [AM07]. Concretely, to minimize it is natural to aim for a matrix

that is both zero-mean, i.e., an unbiased estimator of

, and whose entries are formed by sampling the entries of (and, thus, of ) independently. In the work of [AM07],

is a matrix of i.i.d. zero-mean random variables. The study of the spectral characteristics of such matrices goes back all the way to Wigner’s famous semi-circle law 

[Wig58]. Specifically, to bound in [AM07] a bound due to Alon Krivelevich and Vu [AKV02] was used, a refinement of a bound by Juhász [Juh81] and Füredi and Komlós [FK81]. The most salient feature of that bound is that it depends on the maximum

entry-wise variance

of , and therefore the distribution optimizing the bound is the one in which the variance of all entries in is the same. In turn, this means keeping each entry of independently with probability (up to a small wrinkle discussed below).

Several papers have since analyzed L2-sampling and variants [NDT09, NDT10, DZ11, GT09, AM07]. An inherent difficulty of L2-sampling based strategies is the need for a special handling of small entries. This is because when each item is kept with probability , the resulting entry in the sample matrix has magnitude . Thus, if an extremely small element is accidentally picked, the largest entry of the sample matrix “blows up”. In [AM07] this was addressed by sampling small entries with probability proportional to rather than . In the work of Gittens and Tropp  [GT09], small entries are not handled separately and the bound derived depends on the ratio between the largest and the smallest non-zero magnitude.

Random matrix theory has witnessed dramatic progress in the last few years and [AW02, RV07, Tro12a, Rec11] provide a good overview of the results. This progress motivated Drineas and Zouzias in [DZ11] to revisit L2-sampling but now using concentration results for sums of random matrices [Rec11], as we do here. (Note that this is somewhat different from the original setting of [AM07] since now is not one random matrix with independent entries, but a sum of many independent matrices since the entries are chosen with replacement.) Their work improved upon all previous L2-based sampling results and also upon the L1-sampling result of Arora, Hazan and Kale [AHK06], discussed below, while admitting a remarkably compact proof. The issue of small entries was handled in [DZ11] by deterministically discarding all sufficiently small entries, a strategy that gives the strongest mathematical guarantee (but see the discussion regarding deterministic truncation in the experimental section).

A completely different tack at the problem, avoiding random matrix theory altogether, was taken by Arora et al. [AHK06]. Their approximation keeps the largest entries in deterministically (specifically all where the threshold needs be known a priori) and randomly rounds the remaining smaller entries to or . They exploit the simple fact by noting that as a scalar quantity its concentration around its expectation can be established by standard Bernstein-Bennet type inequalities. A union bound then allows them to prove that with high probability, for every and . The result of [AHK06] admits a relatively simple proof. However, it also requires a truncation that depends on the desired approximation . Rather interestingly, this time the truncation amounts to keeping every entry larger than some threshold.

3 Our Approach

Following the discussion in Section 2 and in line with previous works, we: (i) measure the quality of by , (ii) sample the entries of independently, and (iii) require to be an unbiased estimator of

. We are therefore left with the task of determining a good probability distribution

from which to sample the entries of in order to get . As discussed in Section 2 prior art makes heavy use of beautiful results in the theory of random matrices. Specifically, each work proposes a specific sampling distribution and then uses results from random matrix theory to demonstrate that it has good properties. In this work we reverse the approach, aiming for its logical conclusion. We start from a cornerstone result in random matrix theory and work backwards to reverse-engineer near-optimal distributions with respect to the notion of probabilistic deviations captured by the inequality. The inequality we use it the Matrix-Bernstein inequality for sums of independent random matrices (see e.g., [Tro12b], Theorem 1.6).

Theorem 3.1 (Matrix Bernstein inequality).

Consider a finite sequence of i.i.d. random matrices, where and . Let .

For some fixed , let . For all ,

To get a feeling for our approach, fix any probability distribution over the non-zero elements of . Let be a random matrix with exactly one non-zero element, formed by sampling an element of according to and letting . Observe that for every , regardless of the choice of , we have , and thus is always an unbiased estimator of . Clearly, the same is true if we repeat this times taking i.i.d. samples and let our matrix be their average. With this approach in mind, the goal is now to find a distribution minimizing . Writing we see that is the operator norm of a sum of i.i.d. zero-mean random matrices , i.e., exactly the setting of Theorem 3.1. The relevant parameters are

(1)
(2)

Equations (1) and (2) mark the starting point of our work. Our goal is to find probability distributions over the elements of that optimize (1) and (2) simultaneously with respect to their functional form in Theorem 3.1, thus yielding the strongest possible bound on . A conceptual contribution of our work is the discovery that these distributions depend on the sample budget , a fact also borne out in experiments. The fact that minimizing the deviation metric of Theorem 3.1, i.e., , suffices to bring out this non-linearity can be viewed as testament to the theorem’s sharpness.

Theorem 3.1 is stated as a bound on the probability that the norm of the error matrix is greater than some target error given the number of samples . Nevertheless, in practice the target error is not known in advance, but rather is the quantity to minimize given the matrix , the number of samples , and the target confidence . Specifically, for any given distribution on the elements of , define

(3)

Our goal in the rest of the paper is to seek the distribution minimizing . Our result is an easily computable distribution which comes within a factor of 3 of and, as a result, within a factor of 9 in terms of sample complexity (in practice we expect this to be even smaller, as the factor of 3 comes from consolidating bounds for a number of different worst-case matrices). To put this in perspective note that the definition of does not place any restriction either on the access model for while computing , or on the amount of time needed to compute . In other words, we are competing against an oracle which in order to determine has all of in its purview at once and can spend an unbounded amount of computation to determine it.

In contrast, the only global information regarding we will require are the ratios between the L1 norms of the rows of the matrix. Trivially, the exact L1 norms of the rows (and therefore their ratios) can be computed in a single pass over the matrix, yielding a 2-pass algorithm. Moreover, standard concentration of measure arguments imply that these ratios can be estimated very well by sampling only a small number of columns. In our setting, it is in fact reasonable to expect that good estimates of these ratios are available a priori. This is because different rows correspond to different attributes and the ratios between the row norms reflect the ratios between the average absolute values of these features. For example, if the matrix corresponds to text documents, knowing the ratios amounts to knowing global word frequencies. Moreover these ratios do not need to be known exactly to apply the algorithm, as even rough estimates of them give highly competitive results. Indeed, even disregarding this issue completely and simply assuming that all ratios equal , yields an algorithm that appears quite competitive in practice, as demonstrated by our experiments.

4 Data Matrices and Statement of Results

Throughout and will denote the -th row and -th column of , respectively. Also, we use the notation and . Before we formally state our result we introduce a definition that expresses the class of matrices for which our results hold.

Definition 4.1.

An matrix is a Data matrix if:

  1. .

  2. .

  3. .

Regarding Condition 1, recall that we think of as being generated by a measurement process of a fixed number of attributes (rows), each column corresponding to an observation. As a result, columns have bounded L1 norm, i.e., . While this constant may depend on the type of object and its dimensionality, it is independent of the number of objects. On the other hand, grows linearly with the number of columns (objects). As a result, we can expect Definition 4.1 to hold for all large enough data sets. Regarding Condition 2, it is easy to verify that unless the values of the entries of exhibit an unbounded variance, grows as and Condition 2 follows from . Condition 3 if trivial. Out of the three conditions the essential one is Condition 1. The other two are merely technical and hold in all non-trivial cases where Condition 1 applies.

1:Input: Data matrix , sampling budget , acceptable failure probability
2:Set ComputeRowDistribution(, , )
3:Sample elements of with replacement, each having probability
4:For each sample , let entry of be and zero otherwise.
5:Output: .
6:
7:function ComputeRowDistribution(, , )
8:     Obtain such that for
9:     Set     and    
10:     Define
11:     Find such that
12:     return such that for
Algorithm 1 Construct a sketch for a data matrix

To simplify the exposition of algorithm 1, we describe it in a the non-streaming setting. About the complexity of Algorithm 1, steps 712 compute a distribution over the rows. Assuming step 8 can be implemented efficiently (or skipped altogether, see discussion at the bottom of Section 3) the running time of is independent of . Finding in step 11 can be done very efficiently by binary search because the function is strictly decreasing in . Conceptually, we see that the probability assigned to each element in Step 3 is simply the probability of its row times its intra-row weight .

Note that to apply Algorithm 1 the entries of must be sampled with replacement in the streaming model. A simple way to achieve this using operations per matrix element and active memory was presented in [DKM06]. In fact, though, it is possible to implement such sampling far more efficiently.

Theorem 4.2.

For any matrix , steps 3-5 in Algorithm 1 can be accomplished using active memory, space, and operations per non zero element of in the streaming model.

We are now able to state our main result.

Theorem 4.3.

If is a Data matrix (per Definition 4.1) and is the probability distribution defined in Algorithm 1, then , where is the minimizer of .

The proof of Theorem 4.3 is outlined in Section 5. To understand the implications of Theorem 4.3 and to compare our result with previous ones we must first define several matrix metrics.

Stable rank: Denoted as and defined as . This is a smooth analog for the algebraic rank, always bounded by it from above, and resilient to small perturbations of the matrix. For data matrices we expect it to be small (even constant) and to capture the “inherent dimensionality” of the data.

Numeric density: Denoted as and defined as , this is a smooth analog of the number of non-zero entries . For 0-1 matrices it equals , but when there is variance in the magnitude of the entries it is smaller.

Numeric row density: Denoted as and defined as . In practice, it is often close to the average numeric density of a single row, a quantity typically much smaller than .

Theorem 4.4.

Let be a data matrix meeting the conditions of Definition 4.1. Let be the matrix returned by Algorithm 1 for and

Then with probability at least .

The table below shows the corresponding number of samples in previous works for constant success probability, in terms of the matrix metrics defined above. The fourth column presents the ratio of the samples needed by previous results divided by the samples needed by our method. To simplify the expressions, we present the ratio between our bound and [AHK06] only when the result of [AHK06] gives superior bounds to [DZ11]. That is, we always compare our bound to the stronger of the two bounds implied by these works.

Citation Method Number of samples needed Improvement ratio of Theorem 4.4
[AM07] L1, L2
[DZ11] L2
[AHK06] L1
This paper Bernstein

Holding and the stable rank constant we readily see that our method requires roughly the samples needed by [AHK06]. In the comparison with [DZ11], the key parameter is the ratio . This quantity is typically much smaller than for data matrices but independent of . As a point of reference for the assumptions, in the experimental Section 6 we provide the values of all relevant matrix metrics for all the real data matrices we worked with, wherein the ratio is typically around . Considering this, one would expect that L2-sampling should experimentally fare better than L1-sampling. As we will see, quite the opposite is true. A potential explanation for this phenomenon is the relative looseness of the bound of [AHK06] for the performance of L1-sampling.

5 Proof of Theorem 4.4

We start by iteratively replacing the objective functions (1) and (2) with increasingly simpler functions. Each replacement will incur a (small) loss in accuracy but will bring us closer to a function for which we can give a closed form solution. Recalling the definitions of from Algorithm 1 and rewriting the requirement in (3) as a quadratic form in gives . Our first step is to observe that for any , the equation has one negative and one positive solution and that the latter is at least and at most . Therefore, if we define222Here and in the following, to lighten notation, we will omit all arguments, i.e., , from the objective functions we seeks to optimize, as they are readily understood from context. we see that . Our next simplification encompasses Conditions 32 of Definition 4.1. Let where

Lemma 5.1.

For every matrix satisfying Conditions 3 and 2 of Definition 4.1, for every probability distribution on the elements of , .

Proof.

We start by providing an two auxiliary lemmas.

Lemma 5.2.

For any , if and , then and , with equality holding in both cases if and only if .

Proof.

To prove we note that if , then changing to such that and can only reduce the maximum. In order for all to be equal it must be that for all , in which case .

The second claim follows from applying Jensen’s inequality to the convex function . Specifically, Jensen’s inequality shows that for any ,

This inequality is met for . ∎

Lemma 5.3.

For any matrix and any probability distribution on the elements of , we have and .

Proof.

Recall that contains one non-zero element , while all its other entries are 0. Therefore, and are both diagonal matrices where

Since the operator norm of a diagonal matrix equals its largest entry we see that

We will need to bound from below. Trivially, . Defining and , the below second and third inequalities are a result of Lemma 5.2

(4)

On the other hand, , where . Since ,

and, analogously, . Therefore, by the triangle inequality, and the claim now follows from (4).

Recall that contains one non-zero entry and that is the maximum of over all possible realizations of , i.e., choices of . Thus by the triangle inequality,

Since has one non-zero entry, we see that and, thus, . Applying Lemma 5.2 to with distribution yields . ∎

We are now ready to prove lemma 5.1. It suffices to prove that both and are bounded by . Lemma 5.3 yields the first inequality below and Condition 2 of Definition 4.1 the second. The third inequality holds for every matrix , with equality occurring when all rows have the same L1 norm.

Lemma 5.3 yields the first inequality below. The second inequality follows from rearranging the factors in the second inequality above. Condition 3 of Definition 4.1, i.e., , implies the third.

This allows us to optimize with respect to instead of . In minimizing we see that there is freedom to use different rows to optimize and . At a cost of a factor of 2, we will couple the two minimizations by minimizing where

(5)

Note that the maximization of in (and ) is coupled with that of the -related term by constraining the optimization to consider only one row (column) at a time. Clearly, .

Next we focus on , the first term in the maximization of . We first present a lemma analyzing the distribution minimizing it. The lemma provides two important insights. First, it leads to an efficient algorithm for finding a distribution minimizing and second, it is key in proving that for all data matrices satisfying Condition 1 of Definition 4.1, by minimizing we also minimize .

Lemma 5.4.

A minimizer to the function can be found, to precision in time logarithmic in . Specifically the function is minimized by where . To define let and define . Let be the unique solution to333Notice that the function is monotonically decreasing for hence the solution is indeed unique, and can be found via a binary search. . We set .

Proof.

To find the probability distribution that minimizes we start by writing , without loss of generality. That is, we decompose to a distribution over the rows of the matrix, i.e., , and a distribution within each row , i.e., , for all . We first prove that (surprisingly) the optimal has a closed form solution while the optimal is efficiently computable.

For any , writing  in terms of we see that is the maximum, over rows , of

(6)

Observe that since is fixed, the only variables in the above expression for each row are the . Lemma 5.2 implies that setting simultaneously minimizes both terms in (6). This means that for every fixed probability distribution , the minimizer of satisfies . Thus, we are left to determine

Unlike the intrarow optimization, the two summands in achieve their respective minima at different distributions . To get some insight into the tradeoff, let us first consider the two extreme cases. When , minimizing the maximum over requires equating all , i.e., , leading to the distribution we call “row-”, i.e., . When , equating the requires , leading to the “plain-” distribution . Nevertheless, since we wish to minimize the maximum over several functions, we can seek under which all functions are equal, i.e., such that there exists such that for all ,

Solving the resulting quadratic equation and selecting for the positive root yields equation (7), i.e.,

(7)

Since the quantities under the square root in (7) are all positive we see that it is always possible to find such that all equalities hold, and thus (7) does minimize for every matrix . Moreover, since the right hand side of (7) is strictly decreasing in , binary search finds the unique value of such that . ∎

We now prove that in order to minimize and thus approximately minimize the original function , it suffices, under appropriate conditions, to minimize .

Lemma 5.5.

For every matrix satisfying Condition 1 of Definition 4.1, .

Proof.

We begin with an auxiliary lemma.

Lemma 5.6.

For any two functions , if and , then .

Proof.

Thus, it suffices to evaluate at the distribution minimizing and check that . We know that is of the form for some distribution . Substituting this form of into gives (5). Condition 1 of Lemma 4.1, i.e., , allows us to pass from (9) to (10). Finally, to pass from (10) to (5) we note that the two maximizations over in (10) involve the same expression, thus externalizing the maximization has no effect.

(9)
(10)

We are now ready to prove our main Theorem.

Proof of Theorem 4.3.

Recall that above we proved that

Let be a minimizer of . According to Lemma 5.5, it is also a minimizer of . Hence, it holds that

as required ∎

We finish with the proof of Theorem 4.4 analyzing the value of .

Proof of Theorem 4.4.

We start by computing the value of as a function of , for the probability distribution minimizing . Recall that in deriving (7) we established that , where is such that , i.e.,

(12)

This yields the following quadratic equation in

(13)

Treating (13) as an equality and bounding the larger root of the resulting quadratic equation we get

(14)

The second equality is obtain by replacing with their corresponding expressions in Algorithm 1, line 9: and . Recall that to prove Theorem 4.3 we proved that if meets the conditions of Definition 4.1, then

It follows that for ,