Effective and Sparse Count-Sketch via k-means clustering

11/24/2020 ∙ by Yuhan Wang, et al. ∙ Hong Kong Baptist University 0

Count-sketch is a popular matrix sketching algorithm that can produce a sketch of an input data matrix X in O(nnz(X))time where nnz(X) denotes the number of non-zero entries in X. The sketched matrix will be much smaller than X while preserving most of its properties. Therefore, count-sketch is widely used for addressing high-dimensionality challenge in machine learning. However, there are two main limitations of count-sketch: (1) The sketching matrix used count-sketch is generated randomly which does not consider any intrinsic data properties of X. This data-oblivious matrix sketching method could produce a bad sketched matrix which will result in low accuracy for subsequent machine learning tasks (e.g.classification); (2) For highly sparse input data, count-sketch could produce a dense sketched data matrix. This dense sketch matrix could make the subsequent machine learning tasks more computationally expensive than on the original sparse data X. To address these two limitations, we first show an interesting connection between count-sketch and k-means clustering by analyzing the reconstruction error of the count-sketch method. Based on our analysis, we propose to reduce the reconstruction error of count-sketch by using k-means clustering algorithm to obtain the low-dimensional sketched matrix. In addition, we propose to solve k-mean clustering using gradient descent with -L1 ball projection to produce a sparse sketched matrix. Our experimental results based on six real-life classification datasets have demonstrated that our proposed method achieves higher accuracy than the original count-sketch and other popular matrix sketching algorithms. Our results also demonstrate that our method produces a sparser sketched data matrix than other methods and therefore the prediction cost of our method will be smaller than other matrix sketching methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Matrix sketching Woodruff (2014) is a powerful dimensionality reduction method that can efficiently find a small matrix to replace the original large matrix while preserving most of its properties. For an input large data matrix where is the number of samples and is the number of features, matrix sketching methods generate a sketch of by multiplying it with a random sketching matrix

with certain properties. Compared with traditional dimensionality reduction methods (e.g., Principal Component Analysis (PCA)

Jolliffe (2011)), matrix sketching methods can obtain the sketched matrix very efficient with certain theoretical guarantees Woodruff (2014)

. Therefore, matrix sketching has gained significant research attention and has been used widely for handling high-dimensional data in machine learning

Mahoney (2011); Ailon and Chazelle (2006); Bojarski et al. (2017); Choromanski et al. (2017).

A typical way of applying matrix sketching in machine learning problem is sketch and solve Dahiya et al. (2018). For example, in a linear classification problem with training data where is a large input feature matrix and

is the corresponding label vector, a classification model can be trained by solving

where

denotes a loss function (e.g., hinge loss). By using matrix sketching, we can first obtain a sketched data matrix

by and then solve a much smaller problem . Then, the expensive computation on original large matrix can be replaced by computation on small matrix . This sketch and solve method has also been used to speedup other machine learning tasks, such as least squares regression Dobriban and Liu (2019), low-rank approximation Tropp et al. (2017); Clarkson and Woodruff (2017) and -means clustering Boutsidis et al. (2010); Liu et al. (2017).

Recent advances in randomized numerical linear algebra Martinsson and Tropp (2020)

has provided a solid theoretical foundation for matrix sketching. Various methods have been proposed to construct the random matrix

. The early method Dasgupta and Gupta (1999) constructs a dense random Gaussian matrix where each element in

is generated from a Gaussian distribution

. This method based on dense random Gaussian matrix requires time for computing the sketched matrix . Achlioptas (2003) proposed to generate a sparser random matrix where each element in is generated from {, 0, 1} following a discrete distribution. It will reduce the computation complexity from to . In recent years, two famous fast random projection matrices were proposed for efficiently computing the projection . The first one is the Subsampled Randomized Hadamard Transform (SRHT) which can achieve time for computing Tropp (2011); Ailon and Liberty (2009). The second method is called count-sketch Clarkson and Woodruff (2017) which can compute in time for any input which makes the count-Sketch method particularly suitable for sparse input data. In our paper, we focus on improving the count-sketch algorithm in the context of classification.

Count-sketch constructs the random matrix by a product of two matrices and , i.e., , where is a random diagonal matrix where each diagonal values is uniformly chosen from and is a very sparse matrix where each row has only one randomly selected entry equal to 1 and all other are 0. Previously Paul et al. (2014) applied count-sketch for linear SVM classification and showed that linear SVM trained on the sketched data matrix can ensure comparable generalization ability as in the original space in the case of classification. However, there are two main limitations of count-sketch: (1) It is a data-oblivious method where the generation of sketching matrix is totally independent of input data matrix and therefore the sketched matrix may not be effective for the subsequent classification algorithm; (2) The sketched data matrix will not maintain the same sparsity rate as the original input data . It could make the subsequent classification algorithm on the sketched data more computationally expensive than on the original data . Even though data-oblivious matrix sketching has been extensively studied, few studies focus on efficient data-dependent matrix sketching. Recently, Xu et al. (2017)

proposed to use the approximated singular value decomposition (SVD) as the projection subspace.

Lei and Lan (2020) proposed to improve SRHT by non-uniform sampling by exploiting data properties. However, both of them will produce a dense sketched matrix for sparse input data.

In this paper, we focus on addressing the aforementioned two limitations of count-sketch. To address the first limitation, we first show an interesting connection between count-sketch and -means clustering by analyzing the reconstruction error of count-sketch. Based on our analysis, we propose to reduce the reconstruction error of count-sketch by using -means clustering to obtain the low-dimensional sketched data matrix. To address the second limitation, we propose to get sparse cluster centers by optimizing -means objective function using gradient descent with - ball projection in each iteration. Finally, we compare our proposed methods with the other five popular matrix sketching algorithms on six real-life datasets. Our experimental results clearly demonstrate that our proposed data-dependent matrix sketching methods achieve higher accuracy than count-sketch and other random matrix sketching algorithms. Our results also show our method produces a sparser sketched data matrix than count-sketch and other matrix sketching methods. The prediction cost of our method is smaller than other matrix sketching methods.

Preliminaries

Randomized Matrix Sketching

Given a data matrix and a random sketching matrix with , a sketched matrix is produced by

(1)

Note that the matrix is randomly generated and is independent of the input data . As shown in the following Johnson-Lindenstrauss Lemma (JL lemma), randomized matrix sketching can preserve the pairwise distance of all data points using the sketched data matrix .

Lemma 1 (Johnson-Lindenstrauss Lemma (JL lemma) Johnson and Lindenstrauss (1984)).

For any and any integer n, let and be a random orthonormal matrix. Then for any set of points in , the following inequality about pairwise distance between any two data points and in

holds true with high probability:

Count-Sketch

Among various methods for constructing the sketching matrix , count-sketch (or called sparse embedding) is well suited for sparse input data since it can achieve time complexity for computing . Count-sketch Clarkson and Woodruff (2017) constructs the random matrix as where and are defined as follows,

  • is a diagonal matrix with each diagonal entry independently chosen to be 1 or with probability 0.5.

  • is a binary matrix with = 1, and all remaining entries 0. is a random map such that for any , , for with probability .

Note that random sketching matrix in count-sketch is a very sparse matrix where each row have only one nonzero entry. This nonzero entry is uniformly chosen and the value is either 1 or with probability 0.5. can be computed in time because each nonzero entry in is at most by multiplied by one nonzero entry in .

Methodology

Even though count-sketch has been successfully used for dimensionality reduction in linear SVM classification Paul et al. (2014), we argue that this data-oblivious method has two limitations: (1) The sketching matrix is randomly generated. It could result in bad sketched data when some important columns in are not sampled by using ; (2) Count-sketch will not preserve the sparsity rate of the original data.

When applying count-sketch for data classification, the first limitation could result in bad low-dimensional embedding and then produce a classification model with low accuracy. To illustrate this limitation, we show the classification accuracy of using count-sketch for dimensionality reduction on mnist dataset for ten different runs in Figure

1. As shown in Figure 1, count-sketch (the blue line with triangle markers) could produce low classification accuracy in some runs and also the accuracy is not stable. We also show the classification of our proposed method that will be introduced later in this figure (the red line with circle markers). We can see that our proposed method produces significantly better accuracy than count-sketch.

Figure 1: Classification Accuracy of Using Count-sketch on Different Runs

The second limitation of count-sketch is that, when used with sparse input data, the sketched matrix could be much denser than the original data. We checked the sparsity rate of mnist data before and after count-sketch. The original sparsity rate for mnist data is 80.78% and the sparsity rate is significantly decreased to 1.72% in the sketched data. Therefore, the sketched data could contain more nonzero values than the original data and make the subsequent classification algorithm slower. More examples can be found in the experiment section.

Connection between Count-Sketch and -means clustering

Since the construction of matrix and in count-sketch is oblivious to the input data matrix , it could produce a bad sketched matrix (e.g., some important columns in are not be sampled in ) and therefore results in low classification accuracy. In this paper, we seek to develop a data-dependent count-sketch method for addressing the two limitations of count-sketch. To motivate our method, we start by analyzing the reconstruction error of the original count-sketch method and show an interesting connection between count-sketch and -means clustering.

Let us define a diagonal scaling matrix as

(2)

Note that

equals to an identity matrix with size

. The reconstruction error of count-sketch can be represented as

(3)

where denotes the Frobenius norm of matrix which is defined as the square root of the sum of the squares of every elements in . Note that where the operator returns the sum of diagonal entries of an input matrix. As shown in the following Proposition 1, the reconstruction error of count-sketch as shown in (3) is equivalent to the objective function of applying -means clustering to cluster the columns of into clusters.

Proposition 1.

The reconstruction error of count-sketch is equivalent to the objective function of -means clustering on the columns of matrix product if we treat as a learnable variable which denotes the cluster membership of each column in .

Proof.

We first rewrite the reconstruction error as . Note that is a diagonal matrix with each diagonal entry either 1 or , therefore . Let us use to denote , we will have

(4)

Next, we will show that as follows,

(5)

Combining (4) and (5), the reconstruction error of count-sketch can be rewritten as

(6)

Based on the definition of matrix , is a indicator matrix which each row has only one non-zero entry. Therefore, can viewed as a cluster membership indicator matrix which corresponds to randomly assign columns of matrix into clusters. The non-zero element in -th row of denotes the -th column in is assigned to cluster . Note that the -th column of matrix product is the centroid of the cluster where the -th column belongs to. Therefore

(7)

where returns the index of the cluster that the -th column belongs to and is the centroid of that cluster. By treating as a learnable variable which denotes the cluster membership, the reconstruction error of count-sketch is the same as the objective function of -means algorithm on the columns of as shown in 7.

Our proposition 1 provides an interesting connection between count-Sketch and -means clustering. In the count-Sketch algorithm, the clustering membership indicator matrix is randomly generated which does not consider intrinsic data properties and it could result in bad embedding with high reconstruction error.

  Input: , ball radius , tolerance parameter
  Output: projected sparse vector
1:  if return
2:   = 0; = ; =
3:  while  or  do # Bisection to find
4:     
5:     
6:     if then else
7:  end while
8:  for  to  do #  ball projection
9:     
10:  end for
Algorithm 1 - ball projection Sculley (2010)

Improved count-sketch by -means and ball projection

As shown in (7), the reconstruction error of count-sketch can be improved by replacing the random cluster membership indicator matrix in the original count-sketch algorithm by a cluster membership indicator matrix produced by -means algorithm on the columns of . Motivated by this observation, we propose to use -means algorithm to learn the cluster membership indicator matrix from data for lower reconstruction error. Therefore, the new cluster centers returned by -means with = , which equals to , can be used as the new low-dimensional feature representation. And this new method will result in low reconstruction error than the original count-sketch method.

Apart from the reconstruction error, as mentioned earlier, another limitation of count-sketch is that it may not preserve the sparsity rate of the input data . In other words, the new data presentation could be dense even if the original data is highly sparse data. This limitation could make the subsequent algorithm on projected data be even slower than just using the original data without count-sketch. Therefore, instead of using the Lloyd’s classic -means algorithm Lloyd (1982), we would like to develop a new method to obtain very sparse cluster centers. We propose to obtain sparse cluster centers by optimizing the objective of -means as shown in (7) using gradient descent together with ball projection Duchi et al. (2008) in each update.

The gradient of the -means objective function with respect to the -th cluster center is

(8)

where is a binary function which return 1 if equals to (i.e., the -th column of belongs to the -th cluster), otherwise it returns 0. In other word, the computation of gradient only depends on columns that belongs to -th cluster in current iteration.

By using gradient descent, in each iteration, the cluster center can be updated as

(9)

where is the learning rate. However, directly using (9) will not produce sparse cluster centers.

To obtain sparse cluster centers, we will use - ball projection to make be a sparse vector. The - ball projection is proposed in Sculley (2010) which is approximated extension of exact ball projection Duchi et al. (2008). - is very effective at getting sparse cluster centers as shown in Sculley (2010). The basic idea of - ball projection is to use bisection to find a value that projects a dense vector to an ball with radius between and . After is found, - ball projection will map the -th entry in (denoted as ) to

(10)

As shown in (10), the resulting cluster centers s will be sparse vectors since will make an element to 0 if its absolute value is smaller than . The whole procedure of - ball projection is described in Algorithm 1.

By using Algorithm 1 in each iteration of optimizing -means objective function by gradient descent, we will get sparse cluster centers.

  Input: , reduced dimension , iteration , parameter , for ball projection;
  Output: low-dimensional data representation and the learnt cluster membership indicator matrix
1:  Generate a diagonal random sign matrix
2:  Compute
3:  Randomly pick columns from as the cluster centers
4:  for  to  do
5:

     Create all zero matrix

6:     for  to  do
7:        
8:        
9:     end for
10:     Update each cluster centers using (9)
11:     Obtain sparse cluster centers using Algorithm 1
12:  end for
13:  return and
Algorithm 2 Effective and Sparse Count-SKetch (ESCK)

Algorithm Implementation and Analysis

We summarize our proposed algorithm for improving the original count-sketch in algorithm Improved count-sketch by -means and ball projection and named it as Effective and Sparse Count-SKetch (ESCK). Our proposed algorithm first obtain as shown in step 1-2 which is the same as the original count-sketch. The contribution of our proposed algorithm is to replace the randomly generated cluster membership indicator matrix in count-sketch with the learned cluster membership indicator matrix . The sparse cluster centers will be used as the low-dimensional data representation. As shown from step 3 to 12, sparse cluster centers are obtained by using gradient descent with - ball projection to cluster columns into groups.

With respect to time complexity, step 2 only needs time because of is a diagonal matrix. The time complexity for Step 3 to 12 is upper bounded by where is the number of iterations. For sparse input data, the time complexity in each iteration for updating cluster centers will smaller than since both data and clusters are sparse. Empirically, the -means algorithm using gradient descent converges very fast and only a few iterations is needed. In our experiments, we will show that our proposed method is only several time slower than count-sketch but the classification accuracy obtained by our method is much larger than count-sketch and other methods. Note that our proposed method can also return the learned cluster membership indicator matrix , therefore, our algorithm can also be extended to inductive setting and generate the feature mapping for new unseen data by using which enjoys the same low computational complexity as the original count-Sketch.

Dataset # of # of # of sparsity
samples features classes rate
usps 9,298 256 10 0%
mnist 60,000 780 10 87.78%
gisette 7,000 5,000 2 0.85%
real-sim 72,309 20,958 2 99.75%
rcv1-binary 20,242 47,236 2 99.84%
rcv1-multi 15,564 47,236 53 99.86%
Table 1: Summary of Experimental Datasets
Performance usps mnist gisette real-sim rcv1-binary rcv1-multi
(=30) (=100) (=256) (r=256) (=256) (=256)
Gaussian Accuracy(%) 90.60 0.01 88.92 0.03 90.70 0.01 79.25 0.06 81.63 0.01 69.71 0.01
Sparsity rate 0% 0% 0% 0% 0% 0%
Prediction time(ms) 2.89ms 20.84ms 3.01ms 31.15ms 7.95ms 22.42ms
Achlioptas Accuracy(%) 90.170.03 87.700.02 89.80 0.03 76.850.06 81.860.01 67.560.07
Sparsity rate 0% 0.01% 0% 1.07% 0.03% 0.03%
Prediction time(ms) 3.12ms 54.98ms 3.98ms 41.92ms 11.71ms 88.76ms
Count-Sketch Accuracy(%) 90.74 0.01 87.66 0.01 90.37 0.02 77.21 0.06 80.190.02 69.38 0.06
Sparsity rate 0% 1.72% 0% 73.36% 73.65% 74.47%
Prediction time(ms) 2.98ms 50.86ms 4.46ms 11.96ms 4.46ms 24.93ms
SRHT Accuracy(%) 89.86 1.66 87.14 0.84 90.45 0.87 78.37 0.20 80.29 0.69 68.50 0.29
Sparsity rate 0% 0% 0% 0% 0% 0%
Prediction time(ms) 3.3ms 79.35ms 3.6ms 22.6ms 6.96ms 103ms
SRHT-topr Accuracy(%) 90.68 1.51 88.15 0.77 92.45 0.55 82.48 0.19 82.14 0.24 71.01 0.77
Sparsity rate 0% 0% 0% 0% 0% 0%
Prediction time(ms) 4.22ms 75.23ms 3.3ms 46.6ms 6.14ms 101ms
ESCK-full Accuracy(%) 90.90 0.02 90.600.02 93.25 0.02 88.680.07 92.910.01 78.99 0.01
sparsity rate 50.81% 43.10% 66.09% 89.57% 87.61% 88.44%
Prediction time(ms) 0.97ms   15.81ms 0.99ms 3.99ms 1.01ms 13.51ms
ESCK-miniBatch Accuracy(%) 91.90 0.01 90.500.02 94.45 0.03 88.250.08 90.010.02 77.13 0.01
sparsity rate 46.78% 40.29% 37.58% 97.47% 94.78% 95.37%
Prediction time(ms) 1.21ms 16.86ms 2.26ms 3.28ms 0.99ms 5.98ms
Table 2: Experimental Results of Different Random Matrix Sketching Methods
embedding time during training stage
usps mnist gisette real-sim rcv1-bianry rcv1-multi
Count-Sketch 1.2s 5.1s 2.6s 2.6s 1.9s 1.1s
ESCK-full 5.3s 26.4s 10.2s 38.5s 12.6s 8.5s
ESCK-miniBatch 1.1s 6.5s 4.5s 13.5s 5.3s 4.1s
Table 3: Comparison of the embedding during the training stage

Experiments

In this section, we compared our proposed algorithm with several different commonly-used random dimensionality reduction algorithms based on six real-life datasets. These six datasets are downloaded from LIBSVM websiteChang and Lin (2011). The summarization of these six datasets is shown in Table 1. The sparsity rate as shown in the last column is the fraction of zeros in each input data matrix . As shown in Table 1, there are four sparse datasets (mnist, real-sim, rcv1-binary, rcv1-multi) and two dense datasets (usps, gisette).

We evaluate the performance of following seven matrix sketching methods:

  • Gaussian: The sketching matrix is a random Gaussian Matrix Dasgupta and Gupta (1999)

  • Achlioptas: The sketching matrix is randomly generated from a discrete distribution and is sparser than Gaussian matrix Achlioptas (2003).

  • Count-Sketch: original oblivious count-sketch method Clarkson and Woodruff (2017)

  • SRHT : The sketching matrix is generated by the Subsampled Randomized Hadamard Transform (SRHT) Tropp (2011)

  • SRHT-topr An improved variant of SRHT which is data-dependent

  • ESCK-full: our proposed method that use full batch gradient descent with - ball projection to get the -means centers.

  • ESCK-miniBatch: our proposed method that uses mini-batch gradient descent with - ball projection to get the -means centers. This is more efficient than ESCK-full but with slightly lower accuracy.

Experimental Setting. For the two dense datasets (usps and gisette), the feature values are scaled to [,1] using min-max normalization. We use five-fold cross validation to evaluate the accuracy. The regularization parameter in SVM is chosen from . The parameter for - ball projection is fixed to and parameter is chosen from . Our experiments are performed on a desktop with Intel(R) Core(TM) i7-9700 CPU and @ 3.00GHz and 16.0 GB RAM.

Experimental Results. We report the classification accuracy, sparsity rate of the sketched matrix and prediction time of different algorithms in Table 2. The projected dimension for each dataset is given in the first row of this table. The results for different settings of projected dimension will be discussed later. The first four methods are data-oblivious random projection methods and the last three are data-dependent random projection methods. The best accuracy for each dataset is in bold and the second best accuracy for each dataset is in italic.

As shown in this Table, the data-dependent matrix sketching methods (i.e., SRHT-topr, ESCK-full and ESCK-miniBatch) get higher accuracy than data-independent matrix sketching methods. Among these six datasets, the proposed ESCK-full algorithm achieves the best accuracy on four datasets (i.e., mnist, real-sim, rcv1-binary, rcv1-multi) and the second best accuracy in two datasets (i.e., usps and gisette). Overall, our proposed method ESCK-full gets the best accuracy. The proposed method ESCK-miniBatch gets slightly lower accuracy than ESCK-full but gets higher accuracy than the other five matrix sketching methods. The results in Table 2 demonstrate that our proposed methods achieve better accuracy than other methods.

Algorithms Prediction Cost
Projection Cost Classification Cost
Gaussian
Achlioptas
Count-Sketch
SRHT
SRHT-topr
ESCK
Table 4: Prediction Costs for Different Algorithms on a Single Input Sample
Figure 2: Impact of Projection Dimension
(a) real-sim
(b) rcv1-binary
(c) mnist
Figure 3: Accuracies with Different Sparsity Rates

With respect to the sparsity rate of the sketched data, as expected, Gaussian, Achlioptas, SRHT and SRHT-topr will produce dense data even if the input data is sparse. The original count-sketch method and our proposed methods can produce sparse embedding for highly sparse input data. The sparsity rate of the sketched data produced by our proposed methods is higher than the count-sketch. Furthermore, our proposed method could result in sparse embedding for dense input data (e.g., usps and gisette). With respect to the prediction time, the prediction time of our methods is lower than other methods. The prediction cost for different algorithms is summarized in Table 4. Both count-sketch and our proposed ESCK are very efficient for prediction.

We also compare the embedding time of our proposed method with the original count-sketch during the training stage. The results are shown in Table 3. As expected, our proposed methods will be several times slower than the original count-sketch since we need to perform -means clustering on the columns of . ESCK-miniBatch is faster than ESCK-full.

Impact of Projection Dimension . In Figure 2, we show the experimental results of all algorithms with different projection dimension . As shown in this figure, our proposed method ESCK-full consistently get better accuracy than other matrix sketching methods. The other two data-dependent matrix sketching methods ESCK-minibatch and SRHT-topr also gets better than the four data-oblivious matrix sketching method. When the parameter is small, the accuracy improvement of our proposed method is large on real-sim, rcv1-binary and rcv1-multi datasets.

Impact of Sparse Sketched Matrix By tuning the parameter - ball projection, our proposed method can result in a very sparse sketched matrix . In this section, we would like to explore how the sparsity rate of the sketched matrix affects the classification accuracy. In Figure 3, we show the sparsity rate and accuracy for count-sketch and ESCK-full. The blue dashed line shows the accuracy of count-sketch and the sparsity rate is annotated by the text above this line. The red line shows the accuracies of ESCK-full with different sparsity rate of the sketched matrix. As shown in Figure 3

, our proposed methods obtain better accuracy than count-sketch with a higher sparsity rate. As the sparsity rate increased, we can observe that accuracy could slightly decrease but still higher than count-sketch. On the mnist dataset, the count-sketch method generates a dense sketched matrix with a sparsity rate equals to 1.72% and the accuracy of the subsequent classifier is 87.65%. In comparison, the ESCK-full can generate a sparse sketched matrix with higher classification accuracy.

Conclusion

In this paper, we propose a novel data-dependent count-sketch algorithm that can produce more effective and sparse subspace embedding than the original data-independent count-sketch algorithm. Our new method applies -means clustering algorithm to obtain the sketched data matrix. Sparse sketched data matrix is obtained by using gradient descent with - ball projection to optimize the -means clustering objective function. We compared our proposed algorithm with the other five matrix sketching algorithms. Our experimental results on six real-life datasets have demonstrated that our proposed methods achieve higher classification accuracies than count-sketch and other matrix sketching methods. Also, our proposed methods can produce sketched matrix with high sparsity rate than other methods that can make the subsequent classification model more efficient than other methods.

References

  • D. Achlioptas (2003) Database-friendly random projections: johnson-lindenstrauss with binary coins. Journal of computer and System Sciences 66 (4), pp. 671–687. Cited by: Introduction, 2nd item.
  • N. Ailon and B. Chazelle (2006) Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In

    Proceedings of the thirty-eighth annual ACM symposium on Theory of computing

    ,
    pp. 557–563. Cited by: Introduction.
  • N. Ailon and E. Liberty (2009) Fast dimension reduction using rademacher series on dual bch codes. Discrete & Computational Geometry 42 (4), pp. 615. Cited by: Introduction.
  • M. Bojarski, A. Choromanska, K. Choromanski, F. Fagan, C. Gouy-Pailler, A. Morvan, N. Sakr, T. Sarlos, and J. Atif (2017) Structured adaptive and random spinners for fast machine learning computations. In Artificial Intelligence and Statistics, pp. 1020–1029. Cited by: Introduction.
  • C. Boutsidis, A. Zouzias, and P. Drineas (2010) Random projections for -means clustering. In Advances in Neural Information Processing Systems, pp. 298–306. Cited by: Introduction.
  • C. Chang and C. Lin (2011)

    LIBSVM: a library for support vector machines

    .
    ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: Experiments.
  • K. M. Choromanski, M. Rowland, and A. Weller (2017) The unreasonable effectiveness of structured random orthogonal embeddings. In Advances in Neural Information Processing Systems, pp. 219–228. Cited by: Introduction.
  • K. L. Clarkson and D. P. Woodruff (2017) Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM) 63 (6), pp. 1–45. Cited by: Introduction, Introduction, Count-Sketch, 3rd item.
  • Y. Dahiya, D. Konomis, and D. P. Woodruff (2018) An empirical evaluation of sketching for numerical linear algebra. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1292–1300. Cited by: Introduction.
  • S. Dasgupta and A. Gupta (1999) An elementary proof of the johnson-lindenstrauss lemma. International Computer Science Institute, Technical Report 22 (1), pp. 1–5. Cited by: Introduction, 1st item.
  • E. Dobriban and S. Liu (2019) Asymptotics for sketching in least squares regression. In Advances in Neural Information Processing Systems, pp. 3675–3685. Cited by: Introduction.
  • J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra (2008) Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pp. 272–279. Cited by: Improved count-sketch by -means and ball projection, Improved count-sketch by -means and ball projection.
  • W. B. Johnson and J. Lindenstrauss (1984) Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics 26 (189-206), pp. 1. Cited by: Lemma 1.
  • I. Jolliffe (2011) Principal component analysis. Springer. Cited by: Introduction.
  • Z. Lei and L. Lan (2020) Improved subsampled randomized hadamard transform for linear SVM. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pp. 4519–4526. Cited by: Introduction.
  • W. Liu, X. Shen, and I. Tsang (2017) Sparse embedded -means clustering. In Advances in Neural Information Processing Systems, pp. 3319–3327. Cited by: Introduction.
  • S. Lloyd (1982) Least squares quantization in pcm. IEEE transactions on information theory 28 (2), pp. 129–137. Cited by: Improved count-sketch by -means and ball projection.
  • M. W. Mahoney (2011) Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning 3 (2), pp. 123–224. Cited by: Introduction.
  • P. Martinsson and J. Tropp (2020) Randomized numerical linear algebra: foundations & algorithms. arXiv preprint arXiv:2002.01387. Cited by: Introduction.
  • S. Paul, C. Boutsidis, M. Magdon-Ismail, and P. Drineas (2014) Random projections for linear support vector machines. ACM Transactions on Knowledge Discovery from Data (TKDD) 8 (4), pp. 1–25. Cited by: Introduction, Methodology.
  • D. Sculley (2010) Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web, pp. 1177–1178. Cited by: Improved count-sketch by -means and ball projection, Algorithm 1.
  • J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2017) Practical sketching algorithms for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications 38 (4), pp. 1454–1485. Cited by: Introduction.
  • J. A. Tropp (2011) Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis 3 (01n02), pp. 115–126. Cited by: Introduction, 4th item.
  • D. P. Woodruff (2014) Sketching as a tool for numerical linear algebra. Theoretical Computer Science 10 (1-2), pp. 1–157. Cited by: Introduction.
  • Y. Xu, H. Yang, L. Zhang, and T. Yang (2017) Efficient non-oblivious randomized reduction for risk minimization with improved excess risk guarantee. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 2796–2802. Cited by: Introduction.