1 Introduction
Modern computer science tasks are facing enormous data set sizes. For example, modern machine learning models may require millions of data examples in order to train. It is crucial that we can decrease the size of the data to save on computational power. A coreset is one such data structure for this task. Given a set of
points , a coreset is a data structure consuming a much smaller amount of memory than , which can be used as a substitute for , for any query on . For example, in the median problem, the query can be a set of points, and we want to find a coreset to obtain a approximation to , where is the closest point to in . Often, we want to construct a strong coreset, meaning with high probability,
can be used in place of simultaneously for all possible queries . If this is the case, then we can throw away the original dataset , which save us not only computational power, but also on storage.There is a long line of work which has focused on constructing coresets for subspace approximation and means, see, e.g., [deshpande2006matrix, deshpande2007sampling, feldman2011unified, feldman2010coresets, feldman2013turning, varadarajan2012sensitivity, shyamalkumar2007efficient, badoiu2002approximate, chen2009coresets, feldman2012data, frahling2005coresets, frahling2008fast, har2007smaller, har2004coresets, langberg2010universal]. The work of [feldman2013turning] gave the first coreset of size independent of . For subspace approximation, they gave strong coresets of size , and for means. Later [cohen2015dimensionality] improved the result and provided an time algorithm, where is the number of nonzero entries in . Later, [sohler2018strong] provided a strong coreset of size for the median problem, and also subspace approximation with sum of distances loss, building upon a long line of earlier work on median. Their algorithm runs in time. Recent work [makarychev2019performance] provided an oblivious dimensionality reduction for Median to a dimensional space while preserving the cost of every clustering. This dimension reduction result can be used to construct a strong coreset of size . We remark that our method works not only for Medians but also for the subspace approximation problem. The method we propose can potentially be modified to provide coresets for other shapefitting problems where the query centers have lowdimensional nature.
Despite obtaining the first coreset for the fundamental problems of median and subspace approximation of size independent of and , a glaring drawback of the work of [sohler2018strong] is that the running time to build the coreset is exponential in . This is due to the requirement that their algorithm needs to solve a approximate subspace approximation in order to build their coreset. This does not seem ideal, as one motivation for building a coreset in the first place might be to use it for solving subspace approximation. Moreover, for the means problem, the strong coreset construction of [feldman2013turning] runs in fully polynomial time. We thus consider the main open question of this line of work whether we can get a size strong coreset for median and subspace approximation in polynomial time.
1.1 Our Results
Our main contribution is that when considering the sum of power of Euclidean distances with , we provide the first nearly linear time algorithm for constructing sized strong coresets for the subspace approximation problems. Previously the best algorithm that found strong coresets with size independent of and ran in . In this paper we get remove the exponential term.
Theorem 1 (Informal version of Theorem 15 and 16).
For any , , and , there is a time algorithm that finds sized strong coresets for the subspace approximation and median problems.
When is dense, i.e., , the quantity may be too large to afford. In this case, we also provide a fast coreset construction algorithm which runs in time when .
Theorem 2 (Informal version of Theorem 26).
For any , , and , there is an time algorithm that finds sized strong coresets for the subspace approximation and median problems.
For , since there is no oblivious subspace embedding, the nonadaptive sampling in [clarkson2015input] is no longer valid, which is a key tool that we build upon for . For , we instead provide an adaptive sampling algorithm for our coreset construction.
1.2 Technical Overview
When the input matrix is sparse, we first find a approximation subspace such that for any projection matrix with rank , the following inequality holds:
To achieve this, we first right multiply a socalled lopsided embedding to get that the matrix is of much smaller dimensions, and then we sample its rows by left multiplying by a sampling matrix based on the leverage scores of . Finally we find an orthonormal basis of the rowspace of which gives a approximation to the problem.
Based on the approximation we obtain, we do residual sampling. This is shown to give a approximate subspace in [clarkson2015input]. We obtain a subspace such that
Starting with , we obtain a approximate subspace . But the coreset construction requires another condition on such that adding any dimensional subspace to does not decrease the cost by a lot. To obtain this guarantee, we run this approximation algorithm adaptively times. In each iteration we store all the previously found subspaces , and run the next iteration on . The final subspace has the property that we cannot expand by any dimensional subspace to get a much better performance. More precisely,
We can then project onto and use a canonical coreset construction in a much smaller dimension.
For dense inputs , the first algorithm gives a running time of . We observe that for most of the computing time is taken by the computation of leverage scores for sampling. We propose a novel alternate sampling schema which computes leverage scores of rows in each iteration compared to which is done by naive algorithm. We achieve this by dividing the matrix into nearly equal sized partitions and computing the sum of probabilities of rows in the partition without computing the individual probabilities. We also show that we need not look inside many partitions and hence can save on the computation time of computing those leverage scores. We show that a similar sampling scheme works for the Algorithm DimReduce too. This lets us shave the factor of from the running time.
For the sum of powers of Euclidean distances with , we run the adaptive sampling algorithm times, and each time we store all the rows in picked so far, and sample new subspace on top.
1.3 Outline
In Section 2 we discuss the preliminaries and notations for our paper. In Section 3 we discuss how to find a approximate subspace. In Section 4 we show how to build a approximate subspace based on the approximation. In Section 5 we present how to iteratively run the algorithm in Section 4 to get a coreset for subspace approximation and Median. In Section 6 we make our algorithm efficient when the input matrix is dense and . Lastly, in Section 7 we give an algorithm for .
2 Preliminaries and Notation
We use as our input matrix. It can be interpreted as a set points in . Throughout the paper, we write as the row of , and as the column.
Let be a sampling matrix. We can write where and . For each column in and , we sample with replacement independently a row index with probability , and set and . In expectation .
Given a vector
and a subspace , we let denote the distance between and . Given a subspace , we denote as the projection onto , and the orthogonal complement of .For any matrix , we write to represent the matrix where we attach an allzeros column to . We write as its pseudo inverse
Definition 2.1.
Define as the following: for any
where should be clear from the context.
Definition 2.2.
Define as
Definition 2.3.
For a matrix with rows and a subset , we define as the submatrix obtained by taking the rows of a with indices in . Throughout the paper, we consider only contiguous submatrices.
3 Finding a approximation
Let . To get a coreset of size independent of or , we want to construct a subspace with rank and project onto to reduce the dimension, and then construct a coreset. An important property we need for is that we cannot expand by any other dimensional subspace and get a much better approximation.
Given an algorithm that finds a good approximation of , intuitively we can run this algorithm iteratively. In each iteration we project away from the we have found so far and run the same algorithm again to expand . Our first step is to find a approximation to , for any projection matrix of appropriate dimension.
To achieve this, we first perform a dimension reduction by a ”lopsided embedding” to get . Then we construct a sampling matrix using the leverage score of . The rowspace of is then a approximation.
Definition 3.1 (lopsided embedding).
is a lopsided embedding for with respect to matrix norm and constraint set if:


Let , we have:
Many sparse embedding matrices, including the CountSketch matrix, are lopsided embeddings, and they satisfy the following property:
Theorem 4 (Theorem 8 in [clarkson2015input]).
Given , if is a sparse embedding matrix [bourgain2015toward, clarkson2017low, meng2013low, nelson2013osnap] with sparsity parameter , there is and such that with constant probability:
for of appropriate dimension. Here .
Lemma 5.
Given a subspace , for any , let be a vector, and be ’s projection onto , we have:
Proof.
Let be ’s projection onto . Then we can write , hence:
∎
The sampling matrix based on the leverage score of can also be computed efficiently:
Definition 3.2 (Definition 13 in [clarkson2015input], wellconditioned basis for the norm).
An matrix is an wellconditioned basis for the column space of if:
Theorem 6 (Theorem 14 in [clarkson2015input]).
Suppose . Suppose is an subspace embedding for the column space of , i.e., for all , . Suppose we compute a factorization of , where has orthonormal columns. Then is a wellconditioned basis for the column space of . There are subspace embeddings with for that can be applied in time, so that can be computed in time.
Lemma 7.
Given matrix and a matrix which is a projection onto a dimension subspace , In time Algorithm 2 ConstApprox returns a matrix which is a projection matrix onto a dimensional subspace orthogonal to such that
(1) 
Proof.
The correctness of ConstApprox is given by the following theorem from [clarkson2015input]:
Theorem 8 (Theorem 47 in [clarkson2015input]).
With constant probability, the matrix U output by ConstApprox has:
We apply this theorem with , and the approximation follows. As proven in [clarkson2015input], has rows with high probability.
We now turn to show that this algorithm runs in time. For computing
, we need to estimate the row
norm of . As mentioned in Theorem 6, we can compute a wellconditioned base for by doing a QR factorization of , using a subspace embedding . We can calculate in time . For the leverage score estimation, we also right multiply by a Gaussian vector and compute . This multiplication can be done from right to left in time .Calculating takes time since is a sampling matrix and is a projection matrix with rank . The row span of can be computed in time . Finally the projection matrix is given by (we do not compute this matrix multiplication).
Combining everything, the total running time is . ∎
4 approximation
Based on the approximation constructed in Section 3, in this section we show how to construct a approximation.
Theorem 9.
Given matrix , projection matrix given as with , projection matrix given as with , such that
(2) 
then returns a projection matrix given as with having columns in expectation which projects onto a subspace such that
(3) 
in time
Proof.
Let be the the matrix with orthonormal columns computed in step 8 of DimReduce. Let be . Let be the corresponding projection matrix. We can write . From Theorem 46 of [clarkson2015input], we obtain that
(4) 
We also have
Thus
(5) 
We also have that computed in step 9 of DimReduce is an orthonormal basis for and hence satisfies the conditions of the theorem.
In step 5 of DimReduce, for all such that has at least 1 nonzero entry can be computed in time and hence, for all such can be computed in time. The sampling matrix can now be computed in time . In expectation, samples rows and hence an orthonormal basis for rowspan of can be computed in time using any standard orthogonalization techniques and can be then computed in time. Thus, the total time required for DimReduce is . ∎
Theorem 10.
For a matrix , for any projection matrix given as with corresponding to a subspace , we get a projection matrix given as with corresponding to a subspace such that
(6) 
We have for any , and an optimal would be in . Hence, we obtain a subspace with dimensions in expectation and its projection matrix such that
(7) 
and such can be obtained in time
Proof.
From Lemma 7, we obtain a projection matrix such that
(8) 
Hence, returns a projection matrix which projects onto a subspace
(9) 
We note that is never explicitly computed. We have the following
(10) 
and
(11) 
and hence we obtain
(12) 
∎
5 Coresets for Subspace Approximation and Median
Now we show how to construct coresets for subspace approximation and Median using DimReduce. The key property we need from the approximated subspace is that for any dimentional subspace , attaching to will not increase the performance too much. To achieve this, we run DimReduce times, where each iteration we keep expanding . The details of our algorithm are described in Algorithm 4 DimensionReduction.
Lemma 11.
Let . With probability at least , DimensionReduction finds a dimensional subspace for which all dimensional spaces W, we have
Proof.
With probability at least , after iterations of the forloop in Algorithm 4 DimensionReduction, by the guarantee of Algorithm 3 DimReduce, contains a dimensional subspace that is approximation of OPT. For each , let be the dimensional subspace that DimensionReduction produces, where . Consider the telescoping sum:
We can see that fraction of the summands must be at most . Let be the index sampled by the algorithm, with probability at least , we have:
By Theorem 10, let be any dimensional subspace, we have with vanishing probability that .
So for any dim :
∎
Given with the property in Lemma 11, we adopt the technique in [sohler2018strong] for constructing coresets by attaching to every row of .
We cannot afford calculating directly, but we could use the following lemma as a fast estimation:
Lemma 12 (Lemma 14 in [sohler2018strong]).
Given the subspace guaranteed by Lemma 5, we can compute in time a matrix or rank such that with probability at least we have for every set contained in a dimensional subspace:
We remark that CoresetConstruction satisfies:
Theorem 13 (Theorem 8 in [sohler2018strong]).
Let . Let be the input matrix, be the rank matrix output by CoresetConstruction. Let be any nonempty set that is contained in a dimensional subspace. Let and be the matrices whose rows are the closest points in the closure of with respect to the rows of and respectively, then we have:
We can now present our coreset construction result:
Lemma 14 (Lemma 16 in [sohler2018strong]).
Given , in time it is possible to find a sampling and rescaling matrix with rows for which for all rank orthogonal project matrix :
Let be the output of Algorithm 5, would have rows.
Theorem 15 (Strong coresets for subspace approximation, modified Theorem 17 in [sohler2018strong], ).
For , there exists such that for any orthogonal projection , we have:
Comments
There are no comments yet.