Asymmetric Random Projections

06/22/2019 ∙ by Nick Ryder, et al. ∙ Amazon berkeley college 0

Random projections (RP) are a popular tool for reducing dimensionality while preserving local geometry. In many applications the data set to be projected is given to us in advance, yet the current RP techniques do not make use of information about the data. In this paper, we provide a computationally light way to extract statistics from the data that allows designing a data dependent RP with superior performance compared to data-oblivious RP. We tackle scenarios such as matrix multiplication and linear regression/classification in which we wish to estimate inner products between pairs of vectors from two possibly different sources. Our technique takes advantage of the difference between the sources and is provably superior to oblivious RPs. Additionally, we provide extensive experiments comparing RPs with our approach showing significant performance lifts in fast matrix multiplication, regression and classification problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The use of random projections (RPs) as a method for data compression has been well studied for the past few decades. RPs provide a theoretical backing for oblivious compression techniques and are employed in multiple scenarios such as sketching [AMS99], linear regression and other classification tasks [FM03], -nearest neighbors [AI06], fast approximate linear algebra [CW09], and more. One important property of RPs is that they are oblivious, meaning they can be determined without observing data. While this is useful for many applications, this restriction is unnecessary in several realistic scenarios.

In this paper we tackle the problem of obtaining a data-dependent random projection. We measure the quality of a random projection by its ability to preserve the inner product of a vector pair . Specifically, we view a random projection as a tool to obtain a random estimate of

by taking the inner product of their projected version. Under the requirement of unbiasedness we aim to minimize variance, as common when dealing with estimators. An oblivious approach must handle the worst case scenario for

. However, with access to data we find ourselves in a setting where and are random vectors coming from some distribution. This occurs in matrix multiplication where are columns of two matrices. It also occurs in linear regression or classification where is a random data point and is a regressor chosen at random from some prior distribution.

With being random, the variance associated with the random projection is now over the randomness of both the projection itself, and of . In what follows we analyze the optimal linear pre-processing for and that improves the performance of an oblivious random projection applied to the processed versions. By choosing this black-box approach we have the flexibility to use any of the well-known random projection methods, including sparse random projections [NN13] or FJLT [AC09], as long as they have associated to them JL-lemma guarantees [JL84]. In addition to the optimal pre-processing transformation we provide a linear-time (in the input size) counterpart. We analyze its performance and show that it is never inferior and often strictly better than performing no pre-processing.

We apply our technique to applications of oblivious random projections. We show how our methods can be used for approximate fast matrix multiplication in a straightforward way. Another application of our methods is for linear regression and classification on high dimensional data. Our data-dependent random projection gives rise to a novel technique that includes the standard way of using oblivious random projections for classification/regression, yet can tune itself to the input data and improve the quality of the model, with negligible computation overhead. We empirically test our algorithm with an extensive set of experiments. For approximate matrix multiplication tasks we achieve a reduction of 50% to 60% to the MSE with near-zero computational overhead compared with oblivious random projection, in multiple real datasets. For linear regression and classification, when compared to oblivious random projections we achieve a lift of 4% in accuracy for binary classification on the

RCV1 dataset, and a decrease of 61.2% to MSE for the regression task in the Slice Localization dataset.

2 Comparison with Previous Results

Random projections (RPs) have been originally proposed by Johnson and Lindenstrauss [JL84]. The projection they offer is linear, which in turn is useful for several applications such as sketching [AMS99], linear regression and other classification tasks [FM03], -nearest neighbors [AI06], fast approximate linear algebra [CW09], and more. These projections were simplified and improved with time [DG03, Ach03, AC09, NN13].

Applications of random projections typically use the fact that for a collection of input vectors, the collection of low dimensional vectors obtained via the random projection have the same pairwise distances, or inner products. For linear regression one can use the guarantee that is preserved for every data point and the optimal (unknown) regressor . This was used for developing a fast algorithm for linear regression [FM03, MM12]. This property was used for kernel based linear classification in [RR08] that choose random features from the kernel space, thereby removing the quadratic dependence on the example number in the run-time of SVM. Another application example is fast matrix multiplication. Given two matrices , we can view their matrix product as encoding all the pairwise inner products of the columns of with the columns of . From this perspective coming up with approximate fast matrix multiplication algorithms is equivalent to quickly estimating the inner products between these columns. One approach is to treat the columns as two separate data sets and compress them with the objective of minimizing the distortion of their inner products [Sar06, Woo14].

The RP methods mentioned above are oblivious to the data. Other techniques provide data-dependent solutions with improved guarantees. A classic example is PCA, or CCA for when inner products are applied to different sets of vectors. These methods provide a deterministic guarantee, but come with a heavy computational cost; even the approximate version of PCA, CCA (see e.g. [KL15]) are never (quasi-)linear in the input dimension (as opposed to FJLT [AC09]). Additionally, the bias of the error may be problematic when the objective is not minimizing the mean error. There are several other deterministic methods for dimensionality reductions, with different objectives than preserving inner products listed in the survey [CG15].

Another data dependent approach consists of storing the norms of the original vectors in addition to their random projections. In [LHC06] the authors compute the MLE of the inner product or distance based on the RP approximation and the norm. In [KH17] the norms are used to reduce the variance of the RP estimate. These methods are complementary to ours given that we modify the random projection rather than store additional information. One advantage of our technique is that it can be applied based on a sample of the data, rather than having a hard requirement of observing the entire data in advance. This is key in the application of linear regression / classification (Section 4.2). In [CEM15] the authors provide non-oblivious random projections obtained by distorting the space according to the covariance matrix of the data. Specifically, they propose to multiply the data matrix with a random projection, then orthogonalize the result. The estimates of inner products are no longer unbiased, but the authors show that this non-oblivious projection provides better guarantees for -means clustering, and -rank approximation. The authors of [SK13] use a mixture of PCA and random projections to obtain a data dependent projection that can potentially have superior guarantees to oblivious random projections.

3 Data Dependent Random Projections

In what follows we consider the following setup. There are two distributions over vectors of dimension that are known to us, either completely or via oracle access to i.i.d. samples. We wish to estimate inner products of the form where . We do so via a linear dimension reduction operator that transforms and to dimension vectors in a way that .

3.1 Oblivious Random Projections

Consider the case of an oblivious random projection. Here is a random matrix and we have

. We consider the random variable

as an estimate to . In what follows we provide an asymmetric pre-processing step for and

applied before the random projection. We analyze its guarantee for any random projection giving an unbiased estimate with a specific variance bound. This is formally defined here

Definition 3.1.

A valid random projection mapping dimensions into is one that for some constant independent of the data or its dimension, is such that

As a sanity check we mention that the standard random projections are indeed valid random projections. For completeness we prove this for obtained from i.i.d entries (the proof is deferred to Appendix A). A similar statement can be made for other random projections yet is outside the scope of this paper.

Lemma 3.2.

Let be a

matrix of i.i.d entries whose first 4 moments are

, with . We have that is an unbiased estimator of whose variance is at most .

3.2 Our Solution

Our technique follows from a simple observation. Consider an invertible matrix

. We choose a different projection for the vector and . Specifically, we set

where is the inverse transpose of . For these estimate it is easy to observe that remains an unbiased estimate of since . However, when we use the variance bound for valid random projections we get

meaning we replaced the term with . Notice that unless and have very close directions the term is the dominant one in the variance bound. Now, since our vectors are drawn from known distributions we can consider a matrix that minimizes that quantity when averaged over the possible draws. Specifically, we aim to minimize the function

It turns out that can be efficiently minimized by applying the technique of CCA (Canonical Correlation Analysis) on the covariance matrices.

Theorem 3.3.

Let be independent distributions over with second moments . If we decompose

with

the singular value decomposition of

, the minimizer of is

Letting

be the vectors of the square roots of the eigenvalues of

. We have

The above theorem provides the optimal solution to the problem of minimizing but its computational cost may be too steep. Although it is solvable in polynomial time, or even near bi-linear time, in the input and output dimension, in an approximate version, we could have a scenario where the input dimension is quite large and we cannot afford to have multiple passes over it. For this scenario we provide an alternative technique that does not achieve the global minimum of but admits a much simpler solution, and has guarantees that in many settings are sufficiently good. The idea in high level is to ignore all off-diagonal values of and solve the problem assuming they are zero. A very similar idea has proven itself in the field of optimization [DHS11], where the expensive step of normalizing via the covariance matrix is replaced with the analog step w.r.t the diagonal. Collecting those stats can easily be done using a single pass and the decomposition becomes trivial.

Theorem 3.4.

Let be distributions over with second moments . If we restrict to diagonal matrices to preprocess, then we can minimize with the following: Let element-wise square root of the diagonals of and let be the diagonal matrix whose ’th entry is111If the diagonal has a zero value, it means we can ignore that entry in the original data. Hence, we assume w.l.o.g that all diagonals are strictly positive . It holds that

The above theorem, coupled with the Cauchy-Schwartz inequality shows that the diagonal approach of

can only be better than taking the identity matrix, i.e. using an oblivious random projection. Although a pathological case can be constructed in which

, in the sections below we experiment with real data and see that this is rarely the case, meaning that there is a significant gap between and . The proofs of Theorems 3.3 and 3.4 are given in Appendix A

4 Applications

In this section we show how the developed tools can be used to speed up approximate matrix multiplications and improve the quality of linear regression or classification.

4.1 Fast Matrix Multiplication

One natural application in which we want to compress data from two different distributions arises in fast matrix multiplication (FMM). In this context, given two matrices , we have the th entry of is where is the th column of and the th column of . It follows that in order to compress the matrices for FMM it is sensible to compress their columns. We get the following simple re-scaling algorithm for FMM presented in Algorithm 1. Despite the simplicity of this variance scaling trick, in practice we see notably decreases in mean squared error from unscaled random projections on a variety of datasets. Details are in §5.

  Input: Two matrice ,
   diagonal matrix with
   diagonal matrix with
  
  
  With Random Projection , project the columns of to
  Output:
Algorithm 1 Fast Variance Scaling FMM

4.2 Linear regression and classification

Commonly in linear learning, either for regression or classification, the input dimension is quite large, possibly larger than the number of examples. A common approach for handling such cases, in order to mitigate both the danger of over-fitting and the large run-time, is to apply a random projection to the data and solve the regression problem on the lower dimension projected data. We note that these techniques are somewhat different than regularization based techniques, or methods aiming to find a sparse regressor. The advantage of this method has been established in previous works, and in a nutshell, comes from both having to learn a small number of parameters to begin with, hence obtaining faster run-time, and dealing with settings where the regressor is not necessarily sparse.

The analysis of this approach follows from observing that for a random projection , the optimal regressor and any data point we have

It follows that by solving the problem on the projected data, our loss is upper bounded by the loss of , which in turn is bounded due to the approximation guarantees of the random projection.

We cannot apply the asymmetric approach naively as we do not have access to the distribution of

. That being said, in most solutions to the problem one typically assumes an isotropic prior (translating to Ridge regression), meaning that

for some scalar . Taking this approach exactly dictates that we pre-process the inputs by multiplying them by , or taking the more practical approach, by the diagonal matrix where is the expected value of .

This approach however, depends too heavily on the prior assumption of that may not be correct. Taking this in to account we consider a more flexible approach by adding a hyper-parameter and performing a pre-processing step of multiplying the data by . Setting recovers the above approach, and recovers the approach of oblivious random projections. For the optimization procedure, it is possible to treat as a hyper parameter and use a solver for the linear learning problem. Another option is to use any gradient based solver on the joint space of the low dimensional regressor and . Specifically, we draw a fixed random projection mapping the input of dimension into a -dimensional space with , then minimize

Here are the ’th datapoint and label, are fixed as detailed above, are the parameters to be optimized, and

is the loss function (e.g. logistic loss).

Our experiments in section 5 show that a suitable value for this parameter can significantly improve the performance of regression and classification tasks. Prior to this work, we are aware of two commonly used values of . Oblivious random projections correspond to fixing . Applying the random projection on normalized data corresponds to setting . Our experiments show that often, a third, different value for is far better than these two options, demonstrating the effectiveness of this approach.

5 Experiments

We proceed to experiments with real and synthetic data. In all of our experiments, in order to reduce the noise coming from the randomness of random projections, we calculate empirical mean of 100 trials. Throughout we use a random projection matrix of i.i.d. signs [Ach03].

5.1 Fast Matrix Multiplication

For FMM we created a collection of pairs obtained either from synthetic data or real world public datasets. Due to space restrictions we defer the experiments on synthetic data to Appendix B.

For real data matrices we consider two dense and two sparse dataset obtained from UCI Machine Learning Repository

[Guy08, CF94, DKS18]. The first dense dataset, ARCENE, was obtained by merging three mass-spectrometry datasets which indicate the abundance of proteins in human sera having a given mass value. This dataset has 10000 features and 700 data points. The second dense dataset, Isolet, consists of 1559 data points with 616 features. In this dataset, 150 subjects spoke the name of each letter of the alphabet twice. The features include spectral coefficients; contour features, sonorant features, pre-sonorant features, and post-sonorant features. The first sparse dataset, TW_OC, is a dataset consisting of tweets with geolocation from Orange County CA area. The data coordinates correspond to the number of times a user visits a certain place and has 5000 data points with 11346 features. The second dataset, Go_SF, is similarly formatted and consists of check-ins from the app Gowalla, from the San Francisco area. It has 2593 data points with 7706 features.

We convert each of these four datasets into two pairs by splitting their feature sets in half. Thus if we have datapoints and features we end up with two matrices of size . We then either compare or . This corresponds to either compressing to compare the inner products of the data points represented by disjoint features, or to comparing half of the feature vectors against the other half. We refer to the these as data and feature. We end up with eight matrix pair corresponding to each (dataset, data/feature) tuple. An example to make things concrete: The ARCENE-data matrix pair refers to the pair (rather than ) obtained from the ARCENE dataset.

For each of the pairs we compute the exact product , and 3 approximate matrix products corresponding to three random projections . The first, oblivious is an oblivious random projection with i.i.d. signs. The second quick, contains a pre-processing component based only on the diagonal entries of the covariance of and . The third, optimal, contains the optimal pre-processing component based on CCA.

For each approximate matrix product we compute the Squared Error, namely for multiple values of target dimension. We repeated the experiment 100 times to account for the noise coming from the randomness of .

Dense Data

For the dense data, we proceed as detailed with the two sets ARCENE and Isolet and corresponding four matrix pairs. Both in the case of ARCENE-feature, and Isolet-feature all 3 methods are nearly identical, hence we do not report the exact numbers of the experiment. This occurs since the corresponding covariance matrices are very similar. For ARCENE-data and Isolet-data the 3 methods provide different results. For the ARCENE-data pair, quick random projections yield a 2.33 times decrease in MSE, while optimal projections yield a 157.9x decrease in MSE. For the Isolet-data pair, quick random projections yields a 1.123 times decrease in MSE, and optimal projections yield 2.47 times decrease in MSE. The plot of MSE compared to target dimension is given in Figure 1. For better visibility, the

-axis is the log of the MSE. The dotted lines represent one standard deviation. Recalling that the numbers reported are the mean of 100 trials, we expect a Guassian-like distribution of the measurement error.

Figure 1: Dense Data FMM. Left ARCENE-data. Right Isolet-data. X-axis is the target dimension, Y-axis is the log MSE, dotted line represent lower and upper confidence bounds according to a single standard deviation.
Sparse Data

For sparse datasets we see a significant advantage to our methods in both the data and feature matrix pairs. For Go_SF-data, quick projections yield 2% of the MSE of oblivious projections, while optimal projections yield .9% of the MSE of oblivious projections. For Go_SF-feature, quick projections yield 50.1% of the MSE of oblivious projections and optimal projections give 41.4% of oblivious projections. For both TW_OC-data and TW_OC-feature the quick and optimal distortions are nearly indistinguishable. For TW_OC-data we get .08% of the MSE of oblivious projections and for TW_OC-feature the MSE is 2.2% of the MSE of oblivious projections. The plots are given in Figure 2 in an analog format to Figure 1.

Figure 2: Sparse Data FMM. Top left is Go_SF-data. Top right is Go_SF-feature. Bottom left is TW_OC-data. Bottom right is TW_OC-feature. X-axis is the target dimension, Y-axis is the log MSE, dotted line represent lower and upper confidence bounds according to a single standard deviation.

5.2 Regression

5.2.1 Linear Regression

We used two data sets from the UCI Machine Learning Repository [Smi06, FG11]. Slice Localization was retrieved from a set of 53500 CT images from 74 different patients. The feature vector consists of two histograms in polar space describing the bone structure and the air inclusions inside the body; it has 54500 samples with 384 (dense) features. The E2006 dataset consists of 10K reports from thousands of publicly traded U.S. companies, published in 1996–2006 and stock return volatility measurements in the twelve-month period before and the twelve-month period after each report. The data is encoded using term frequency-inverse document frequency, resulting in a sparse dataset with 16087 samples and 150630 features.

In our experiment we apply the projection with a preprocessing step matching a scalar , as described above, to the datasets. Once the data is projected we solve the regression problem on the low dimensional space. Table 1 shows the mean square error (MSE) one standard deviation for with different values of and projection target dimension . For the Slice Localization dataset we observe near optimal performance at , giving significantly better results than the classic random projection corresponding to , cutting the MSE by a factor of more than 2. Even compared to corresponding to the strategy of normalizing the data before applying the RP, for some values of there is a statistically significant advantage for the best value, see e.g. .

For E2006 the optimal value for is interestingly around , giving empirical justification to setting as a (possibly trainable) parameter rather than fixing it as a constant. For negative values of the results were worse than those of positive values and are not reported. The improvement over RPs () is mild compared to the Slice datasets, giving e.g. 1.3% reduction in the MSE for as opposed to cutting it by half. Nevertheless, considering the standard deviation the improvement remains statistically significant.

E2006
0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0
266.5 265.4 265.5 265.5 265.3 265.5 265.5 265.6 265.5
1.0 0.3 0.3 0.3 0.4 0.2 0.3 0.3 0.3
265.7 265.1 265.0 264.6 264.6 264.7 264.6 264.9 264.7
0.5 0.4 0.4 0.4 0.6 0.4 0.6 0.6 0.6
265.4 264.6 264.0 263.8 264.1 263.7 263.4 263.5 263.1
0.6 0.6 0.5 0.6 0.6 0.5 0.6 0.7 0.5
265.3 264.4 263.7 263.3 263.2 263.1 262.9 262.7 262.6
0.7 0.6 0.7 0.5 0.7 0.6 0.6 0.7 0.5
265.2 264.0 263.7 262.8 262.6 262.4 262.3 261.8 261.9
0.7 0.5 0.9 0.7 0.7 0.6 0.6 0.5 0.5
Slice ()
-2.0 -1.75 -1.5 -1.25 -1.0 -0.75 -0.5 -0.25 0.0
23.9 24.5 21.8 22.9 23.2 21.4 22.9 39.8 66.1
1.0 0.9 1.0 1.4 1.7 1.0 2.5 7.5 18.6
21.3 20.7 19.6 19.9 18.7 19.2 18.9 24.8 45.4
1.3 1.2 1.2 1.4 2.1 1.6 2.8 4.9 8.7
18.9 19.2 18.4 17.6 16.3 15.5 13.5 25.1 38.7
0.9 0.8 0.7 1.0 1.5 1.9 1.0 2.1 4.2
18.1 17.1 17.5 17.2 15.5 14.1 13.1 19.5 33.6
0.5 0.2 0.6 0.7 1.2 1.9 1.2 3.8 7.9
17.0 17.0 16.2 15.0 12.9 11.5 14.0 15.5 29.7
0.3 0.3 1.1 0.6 0.7 0.5 1.2 1.7 7.8
Table 1: Linear Regression. Detailed in §5.2.1. Results contain the Mean Square Error on of the trained model for different target dimensions, denoted by , and values, for the E2006 and Slice datasets

5.2.2 Logistic Regression

Here we used two datasets with a classification task of logistic regression

[Kri09, AG13]. The first dataset if Cifar10, a well known dataset used for image classification. It consists of color images, resulting in a 3072 dimensional data space. We take 1979 samples and project them each 100 times. The second is Reuters Corpus Volume I (RCV1), an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. In our experiments we used 20242 stories each encoded using term frequency-inverse resulting in 47236 dimensions.

Our experiments are the same as in the linear regression case. We project each dataset to several dimensions, with several preprocessing steps corresponding to different scalars. The results are given in Table 2, where for every target dimension and value we report the accuracy score a single standard deviation. For Cifar10 the result is not very sensitive to the value of as long as it’s in the range of ; tuning does not provide a statistically significant improvement over the oblivious RP. For RCV1 however we see a clear gain of tuning providing, e.g. for a lift of in accuracy compared to oblivious RPs.

As with linear regression we see that different datasets and tasks correspond to different optimal values of justifying our proposal to tune it as a part of the learning process.

-1.5 -1.25 -1. -0.75 -0.5 -0.25 0. 0.25 0.5 0.75 1. 1.25 1.5
Cifar10
49.5 67.2 68.5 67.9 67.9 69.3 68.2 64.2 64.3 63.7 65.2 64.5 63.6
2.5 4.4 3.9 4.6 4.0 3.3 4.2 4.8 5.0 5.0 4.9 4.6 5.0
48.7 71.8 72.2 72.0 71.8 71.9 72.5 68.5 68.6 68.5 69.5 68.5 68.8
1.1 2.9 3.2 3.4 3.4 3.2 3.2 4.2 3.9 3.6 3.7 4.2 3.7
48.6 73.0 74.0 74.1 73.9 74.5 74.5 71.3 71.1 71.8 71.7 71.6 71.1
0.9 2.5 2.9 2.7 2.9 2.4 2.9 3.5 3.7 3.4 3.4 3.3 3.3
48.4 73.8 75.3 75.9 75.2 76.0 76.0 73.3 73.1 73.5 73.4 73.8 73.6
0.4 2.2 2.4 2.2 2.3 2.5 2.6 2.8 3.0 3.1 3.0 2.8 3.4
48.4 74.4 76.7 76.6 76.4 76.3 76.6 74.3 74.8 74.8 74.9 74.9 75.4
0.4 1.8 2.5 2.3 2.3 2.1 2.6 2.8 3.0 3.2 2.8 2.8 2.4
RCV1
50.2 50.4 49.9 56.6 56.0 55.4 56.7 57.9 58.6 56.9 56.9 56.9 56.9
1.5 1.6 1.6 0.7 0.9 1.3 1.8 2.0 1.6 0.0 0.1 0.1 0.0
49.8 50.0 50.2 57.1 56.6 57.2 58.2 61.7 60.9 56.9 56.9 56.9 56.9
1.0 1.4 1.3 0.8 0.8 1.3 2.0 2.7 2.5 0.0 0.1 0.1 0.1
50.3 50.1 50.8 56.5 56.2 58.4 61.2 62.6 60.8 56.9 56.9 56.9 56.9
1.4 1.1 1.1 1.0 0.8 1.5 1.9 2.1 2.0 0.1 0.0 0.0 0.0
50.6 51.0 50.5 56.7 56.7 59.1 61.2 65.2 60.9 56.9 56.9 56.9 56.9
1.2 1.7 1.1 0.9 1.1 1.6 1.2 1.6 2.0 0.0 0.0 0.0 0.1
50.2 50.6 51.2 56.9 57.4 59.0 62.3 65.9 60.8 56.9 56.9 56.9 56.9
1.5 1.3 1.6 1.0 1.2 1.7 2.3 1.2 1.5 0.0 0.0 0.1 0.1
Table 2: Logistic Regression. Detailed in §5.2.2. Results contain the accuracy on of the trained model for different target dimensions, denoted by , and values, for the Cifar10 and RCV1 datasets

6 Future Directions

In this paper we explore the simplest first step to looking at data dependent unbiased random projections. In doing this we restrict to linear projections, where each of the outputs are independent. An interesting idea to explore is what can we achieve if the output dimensions are dependent? Can we obtain stronger results with a non-linear pre-processing step? Can we achieve stronger results with a non-linear projection? Other than improved guarantees, the motivation for these methods come from them being applicable to the symmetric setting where both distributions are the same; a setting where our techniques fall back to the standard random projections.

References

  • [AC09] Nir Ailon and Bernard Chazelle. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM Journal on computing, 39(1):302–322, 2009.
  • [Ach03] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003.
  • [AG13] Massih-Reza Amini and Cyril Goutte. Reuters rcv1 rcv2 multilingual, multiview text categorization test collection data set. UCI Machine Learning Repository, 2013.
  • [AI06] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages 459–468. IEEE, 2006.
  • [AMS99] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137–147, 1999.
  • [CEM15] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu.

    Dimensionality reduction for k-means clustering and low rank approximation.

    In

    Proceedings of the forty-seventh annual ACM symposium on Theory of computing

    , pages 163–172. ACM, 2015.
  • [CF94] Ron Cole and Mark Fanty. Isolet data set. UCI Machine Learning Repository, 1994.
  • [CG15] John P Cunningham and Zoubin Ghahramani. Linear dimensionality reduction: Survey, insights, and generalizations. The Journal of Machine Learning Research, 16(1):2859–2900, 2015.
  • [CW09] Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 205–214. ACM, 2009.
  • [DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
  • [DHS11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • [DKS18] Moshe Lichman Dimitrios Kotzias and Padhraic Smyth. Repeat consumption matrices data set. UCI Machine Learning Repository, 2018.
  • [FG11] M. Schubert S. Poelsterl A. Cavallaro F. Graf, H.P. Kriegel. Relative location of ct slices on axial axis data set. UCI Machine Learning Repository, 2011.
  • [FM03] Dmitriy Fradkin and David Madigan. Experiments with random projections for machine learning. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 517–522. ACM, 2003.
  • [Guy08] Isabelle Guyon. Arcene data set. UCI Machine Learning Repository, 2008.
  • [JL84] William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
  • [KH17] Keegan Kang and Giles Hooker. Control variates as a variance reduction technique for random projections. In

    International Conference on Pattern Recognition Applications and Methods

    , pages 1–20. Springer, 2017.
  • [KL15] Zohar Karnin and Edo Liberty. Online pca with spectral bounds. In Conference on Learning Theory, pages 1129–1140, 2015.
  • [Kri09] Alex Krizhevsky. The cifar-10 dataset. Alex Krizhevsky’s Personal Webpage, 2009.
  • [LHC06] Ping Li, Trevor J Hastie, and Kenneth W Church. Improving random projections using marginal information. In

    International Conference on Computational Learning Theory

    , pages 635–649. Springer, 2006.
  • [MM12] Odalric-Ambrym Maillard and Rémi Munos. Linear regression with random projections. Journal of Machine Learning Research, 13(Sep):2735–2772, 2012.
  • [NN13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 117–126. IEEE, 2013.
  • [RR08] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
  • [Sar06] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. FOCS, pages 143—152, 2006.
  • [SK13] Jaydeep Sen and Harish Karnick. Informed weighted random projection for dimension reduction. In International Conference on Advanced Data Mining and Applications, pages 433–442. Springer, 2013.
  • [Smi06] Noah Smith. 10-k corpus. UCI Machine Learning Repository, 2006.
  • [Woo14] D. Woodruff. Sketching as a tool for numerical linear algebra. Found. Trends Theor. Comput. Sci., 10(1–2):1—157, 2014.

Appendix A Proofs of Section 3

Proof of Lemma 3.2.

We first observe that the estimate is indeed unbiased

We move on to compute its variance. To that end, we begin with the case of where the dimension reduction is with a dimensional vector and denote by its ’th element. We compute the second moment explicitly; to that end we denote by , . For our distribution we have .

We get that the variance of the estimate of is

For the case of being a matrix with target dimension , since everything is i.i.d, the estimate is simply an average of independent estimates of target dimension 1, and the variance is the same as above, divided by . ∎

Proof of Theorem 3.3.

Let . Using the independence of and we get

By Cauchy-Schwarz on the Frobenius Inner Product we get the universal lower bound:

We use the Frobenius Inner Product and the Cauchy-Schwartz inequality stating that . It implies that

Notice that this inequality happens for any factorization matrices and any invertible matrix . Furthermore, the lefthand size is independent of and the righthand side is independent of . It follows that if we find a specific triplet such that

we get that maximize the left expression and minimizes the right expression. Now, for two matrices it holds that only if for some scalar . It follows that w.l.o.g. our matrix is such that

or conversely

Such a matrix can be found via CCA

Lemma A.1 (Cca).

Given two positive definite matrices , we can pick matrix such that

If we decompose , , , then we can set

Furthermore this choice of is independent of the decomposition of or

We choose arbitrarily and set in a way that is symmetric and positive definite. This can be done by choosing an arbitrary , decomposing and setting . Since are orthogonal matrices we still have . We also get that

Using the terms above we can plug in the equation for , the optimizer of the CCA problem, and obtain

Since we chose in a way that is psd we get that and

as required.

Now that we obtained the minimizer we can compute the value of the expression. equals the sum of the eigenvalues of the matrix which are the same as its singular values since it is psd. That is, the value of is the sum of the elements of . Notice that these elements do not depend on our choice of as that choice only affects the rotations . With that in mind we can compute the values of by considering the decomposition , , where are the singular value decompositions of , we get that these values are exactly those obtained by multiplying the square roots of the eigenvalues of as required.

Proof of Theorem 3.4.

In what follows we use the fact that the Frobenius inner product of with a diagonal matrix only depends on the diagonal entries of . That is, if is diagonal then for any matrix we have where is the matrix that is equal to on the diagonal and zero elsewhere. Using the fact the is diagonal we get

This simple calculation shows us that restricting to diagonal pre-processing is equivalent to throwing away all of the off-diagonal information in our covariance matrices, and proceeding with the results of Theorem 3.3. The claim trivialy follows.

Appendix B Additional Experiments

b.1 FMM synthetic data

For synthetic matrices we define a few distributions over matrices. In the first, called diag, we first sample a dimensional vector of i.i.d Laplace variables. This determines a diagonal covariance matrix. Now we sample i.i.d. rows to construct the matrix. The second distribution, called uniform, we sample i.i.d. uniform variables in

as the eigenvalues of the covariance and a random rotation for the eigenvectors. With the covariance matrix ready we sample

i.i.d. rows. The third type called unifskew is obtained by averaging two independent matrices one drawn from diag and one from uniform. The pairs of synthetic matrices we consider are diag-diag, uniform-diag, uniform-unifskew, uniform-uniform. Every pair of matrices consists of two independently drawn matrices from the mentioned distribution.

For each of the pairs we compute the exact product , and 3 approximate matrix products corresponding to three random projections . The first, oblivious is an oblivious random projection with i.i.d. signs. The second quick, contains a pre-processing component based only on the diagonal entries of the covariance of and . The third, optimal, contains the optimal pre-processing component based on CCA. For each approximate matrix product we compute the Squared Error, namely for multiple values of target dimension. We repeated the experiment 100 times to account for the noise coming from the randomness of .

For each of the three distributions detailed above, diag, uniform, unifskew, we draw 1000 samples from . We then form a matrices using these as columns, and look at the squared error over 100 random projections. The first experiment we conduct is looking at the mean squared error for two matrices, both with columns drawn from the diag distribution. For this experiment quick and optimal are equivalent (since the covariance matrix is diagonal) and nets around a 3x decrease in MSE, regardless of the target dimension. When comparing diag against unif, we use the same methodology. Here we see quick yields approximately a 1.5x decrease in MSE over oblivious, and optimal yields another 1.5x decrease in MSE from quick. When comparing unif against unifskew we get the same decreases as the previous experiement across all target dimensions. Finally we compare unif against unif. Here the quick projections yield the same MSE as oblivious projections, while optimal projections yield a 2x decrease in MSE.

These results demonstrate what we expect: when drawing from distributions whose covariance are mostly concentrated on the diagonal, quick random projections perform almost as well as optimal while requiring significantly less preprocessing time. In figure 3 we plot the results of these experiments by plotting the mean squared error against different target dimensions. For each plot we also include dashed lines to show the standard deviation among the 100 random projections.

Figure 3: Synthetic Data FMM. Detailed in §B.1. Top left compares diag to diag. Top right compares unif to diag. Bottom left compares unif to unifskew. Bottom right compares unif to unif.