# Strong Coresets for Subspace Approximation and k-Median in Nearly Linear Time

Recently the first (1+ϵ)-approximate strong coresets for k-median and subspace approximation of size poly(k/ϵ) were obtained by Sohler and Woodruff 2018. Importantly, given n points in R^d, the size of these coresets was the first that was independent of both n and d. Unfortunately their construction had a running time which was exponential in poly(k/ϵ). Here we give the first polynomial time, and in fact nearly linear time, algorithms for constructing such coresets. Our first algorithm runs in nnz(A)/ϵ + (n+d) poly(k/ϵ) time and our second runs in nd log^2(nd) + (n+d) poly(k/ϵ) time.

## Authors

• 9 publications
• 4 publications
• 93 publications
09/09/2018

### Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension

We obtain the first strong coresets for the k-median and subspace approx...
09/15/2017

### On the Difference Between Closest, Furthest, and Orthogonal Pairs: Nearly-Linear vs Barely-Subquadratic Complexity in Computational Geometry

Point location problems for n points in d-dimensional Euclidean space (a...
06/07/2019

### Robust subgaussian estimation of a mean vector in nearly linear time

We construct an algorithm, running in nearly-linear time, which is robus...
08/29/2019

### Nearly Tight Bounds for Robust Proper Learning of Halfspaces with a Margin

We study the problem of properly learning large margin halfspaces in th...
12/20/2018

### Near-Linear Time Approximation Schemes for Clustering in Doubling Metrics

We consider the classic Facility Location, k-Median, and k-Means problem...
03/20/2021

### On Subspace Approximation and Subset Selection in Fewer Passes by MCMC Sampling

We consider the problem of subset selection for ℓ_p subspace approximati...
04/20/2020

### Isotropy and Log-Concave Polynomials: Accelerated Sampling and High-Precision Counting of Matroid Bases

We define a notion of isotropy for discrete set distributions. If μ is a...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Modern computer science tasks are facing enormous data set sizes. For example, modern machine learning models may require millions of data examples in order to train. It is crucial that we can decrease the size of the data to save on computational power. A coreset is one such data structure for this task. Given a set of

points , a coreset is a data structure consuming a much smaller amount of memory than , which can be used as a substitute for , for any query on . For example, in the -median problem, the query can be a set of points, and we want to find a coreset to obtain a -approximation to , where is the closest point to in . Often, we want to construct a strong coreset

, meaning with high probability,

can be used in place of simultaneously for all possible queries . If this is the case, then we can throw away the original dataset , which save us not only computational power, but also on storage.

There is a long line of work which has focused on constructing coresets for subspace approximation and -means, see, e.g., [deshpande2006matrix, deshpande2007sampling, feldman2011unified, feldman2010coresets, feldman2013turning, varadarajan2012sensitivity, shyamalkumar2007efficient, badoiu2002approximate, chen2009coresets, feldman2012data, frahling2005coresets, frahling2008fast, har2007smaller, har2004coresets, langberg2010universal]. The work of [feldman2013turning] gave the first coreset of size independent of . For subspace approximation, they gave strong coresets of size , and for -means. Later [cohen2015dimensionality] improved the result and provided an time algorithm, where is the number of non-zero entries in . Later, [sohler2018strong] provided a strong coreset of size for the -median problem, and also subspace approximation with sum of distances loss, building upon a long line of earlier work on -median. Their algorithm runs in time. Recent work [makarychev2019performance] provided an oblivious dimensionality reduction for -Median to a -dimensional space while preserving the cost of every clustering. This dimension reduction result can be used to construct a strong coreset of size . We remark that our method works not only for -Medians but also for the subspace approximation problem. The method we propose can potentially be modified to provide coresets for other shape-fitting problems where the query centers have low-dimensional nature.

Despite obtaining the first coreset for the fundamental problems of -median and subspace approximation of size independent of and , a glaring drawback of the work of [sohler2018strong] is that the running time to build the coreset is exponential in . This is due to the requirement that their algorithm needs to solve a -approximate subspace approximation in order to build their coreset. This does not seem ideal, as one motivation for building a coreset in the first place might be to use it for solving subspace approximation. Moreover, for the -means problem, the strong coreset construction of [feldman2013turning] runs in fully polynomial time. We thus consider the main open question of this line of work whether we can get a size strong coreset for -median and subspace approximation in polynomial time.

### 1.1 Our Results

Our main contribution is that when considering the sum of power of Euclidean distances with , we provide the first nearly linear time algorithm for constructing -sized strong coresets for the subspace approximation problems. Previously the best algorithm that found strong coresets with size independent of and ran in . In this paper we get remove the exponential term.

###### Theorem 1 (Informal version of Theorem 15 and 16).

For any , , and , there is a time algorithm that finds -sized strong coresets for the subspace approximation and -median problems.

When is dense, i.e., , the quantity may be too large to afford. In this case, we also provide a fast coreset construction algorithm which runs in time when .

###### Theorem 2 (Informal version of Theorem 26).

For any , , and , there is an time algorithm that finds -sized strong coresets for the subspace approximation and -median problems.

For , since there is no oblivious subspace embedding, the non-adaptive sampling in [clarkson2015input] is no longer valid, which is a key tool that we build upon for . For , we instead provide an adaptive sampling algorithm for our coreset construction.

###### Theorem 3 (Informal version of Theorem 30 and 31).

For any , , and , there is a time algorithm that finds a -sized strong coreset for the subspace approximation and -median problems.

### 1.2 Technical Overview

When the input matrix is sparse, we first find a -approximation subspace such that for any projection matrix with rank , the following inequality holds:

 ∥A(I−P)(I−^X)∥v≤poly(k/ϵ)minrank−k X∥A(I−P)(I−X)∥v

To achieve this, we first right multiply a so-called lopsided embedding to get that the matrix is of much smaller dimensions, and then we sample its rows by left multiplying by a sampling matrix based on the leverage scores of . Finally we find an orthonormal basis of the rowspace of which gives a approximation to the problem.

Based on the -approximation we obtain, we do residual sampling. This is shown to give a -approximate subspace in [clarkson2015input]. We obtain a subspace such that

 ∥A(I−P)(I−^S)∥v≤(1+ϵ)minrank−k X∥A(I−P)(I−X)∥v

Starting with , we obtain a approximate subspace . But the coreset construction requires another condition on such that adding any dimensional subspace to does not decrease the cost by a lot. To obtain this guarantee, we run this approximation algorithm adaptively times. In each iteration we store all the previously found subspaces , and run the next iteration on . The final subspace has the property that we cannot expand by any dimensional subspace to get a much better performance. More precisely,

 ∀k-dimensional W:∥A(I−PS)∥v−∥A(I−PS∪W)∥v≤Θ(1/ϵmax(2/p,1))minrank−k X∥A(I−X)∥v

We can then project onto and use a canonical coreset construction in a much smaller dimension.

For dense inputs , the first algorithm gives a running time of . We observe that for most of the computing time is taken by the computation of leverage scores for sampling. We propose a novel alternate sampling schema which computes leverage scores of rows in each iteration compared to which is done by naive algorithm. We achieve this by dividing the matrix into nearly equal sized partitions and computing the sum of probabilities of rows in the partition without computing the individual probabilities. We also show that we need not look inside many partitions and hence can save on the computation time of computing those leverage scores. We show that a similar sampling scheme works for the Algorithm DimReduce too. This lets us shave the factor of from the running time.

For the sum of powers of Euclidean distances with , we run the adaptive sampling algorithm times, and each time we store all the rows in picked so far, and sample new subspace on top.

### 1.3 Outline

In Section 2 we discuss the preliminaries and notations for our paper. In Section 3 we discuss how to find a -approximate subspace. In Section 4 we show how to build a -approximate subspace based on the -approximation. In Section 5 we present how to iteratively run the algorithm in Section 4 to get a coreset for subspace approximation and -Median. In Section 6 we make our algorithm efficient when the input matrix is dense and . Lastly, in Section 7 we give an algorithm for .

## 2 Preliminaries and Notation

We use as our input matrix. It can be interpreted as a set points in . Throughout the paper, we write as the row of , and as the column.

Let be a sampling matrix. We can write where and . For each column in and , we sample with replacement independently a row index with probability , and set and . In expectation .

Given a vector

and a subspace , we let denote the distance between and . Given a subspace , we denote as the projection onto , and the orthogonal complement of .

For any matrix , we write to represent the matrix where we attach an all-zeros column to . We write as its pseudo inverse

###### Definition 2.1.

Define as the following: for any

 ∥A∥v=(∑i∥Ai∗∥p2)1/p

where should be clear from the context.

###### Definition 2.2.

Define as

 ∥A∥h=∥AT∥v
###### Definition 2.3.

For a matrix with rows and a subset , we define as the sub-matrix obtained by taking the rows of a with indices in . Throughout the paper, we consider only contiguous submatrices.

## 3 Finding a poly(k/ϵ)-approximation

Let . To get a coreset of size independent of or , we want to construct a subspace with rank and project onto to reduce the dimension, and then construct a coreset. An important property we need for is that we cannot expand by any other -dimensional subspace and get a much better approximation.

Given an algorithm that finds a good approximation of , intuitively we can run this algorithm iteratively. In each iteration we project away from the we have found so far and run the same algorithm again to expand . Our first step is to find a -approximation to , for any projection matrix of appropriate dimension.

To achieve this, we first perform a dimension reduction by a ”lopsided embedding” to get . Then we construct a sampling matrix using the leverage score of . The rowspace of is then a -approximation.

###### Definition 3.1 (lopsided embedding).

is a lopsided -embedding for with respect to matrix norm and constraint set if:

1.  ∀X of appropriate dimension: ∥S(AX−B)∥≥(1−ϵ)∥AX−B∥
2. Let , we have:

 ∥SB∗∥≤(1+ϵ)∥B∗∥

Many sparse embedding matrices, including the CountSketch matrix, are lopsided -embeddings, and they satisfy the following property:

###### Theorem 4 (Theorem 8 in [clarkson2015input]).

Given , if is a sparse embedding matrix [bourgain2015toward, clarkson2017low, meng2013low, nelson2013osnap] with sparsity parameter , there is and such that with constant probability:

 minrank−k X∥ARX−A∥pv≤(1+ϵ)∥A−Ak∥pv

for of appropriate dimension. Here .

###### Lemma 5.

Given a subspace , for any , let be a vector, and be ’s projection onto , we have:

 d(a,V∪B)=d(aB⊥,V)
###### Proof.

Let be ’s projection onto . Then we can write , hence:

 d(a,V∪B)=d(aB+aB⊥,V∪B)=d(aB⊥,V∪B)=d(aB⊥,V)

The sampling matrix based on the leverage score of can also be computed efficiently:

###### Definition 3.2 (Definition 13 in [clarkson2015input], well-conditioned basis for the p-norm).

An matrix is an -well-conditioned basis for the column space of if:

1.  ⎛⎝∑i∈[n]∑j∈[d]|Uij|p⎞⎠1/p≤α
2.  ∀x∈Rd,∥x∥q≤β∥Ux∥p, where 1/p+1/q=1
###### Theorem 6 (Theorem 14 in [clarkson2015input]).

Suppose . Suppose is an subspace embedding for the column space of , i.e., for all , . Suppose we compute a -factorization of , where has orthonormal columns. Then is a -well-conditioned basis for the column space of . There are subspace embeddings with for that can be applied in time, so that can be computed in time.

###### Lemma 7.

Given matrix and a matrix which is a projection onto a dimension subspace , In time Algorithm 2 ConstApprox returns a matrix which is a projection matrix onto a -dimensional subspace orthogonal to such that

 ∥A(I−PB)(I−PS)∥v≤poly(k/ϵ)minH:dim(H)=k∥A(I−PB)(I−PH)∥v (1)
###### Proof.

The correctness of ConstApprox is given by the following theorem from [clarkson2015input]:

###### Theorem 8 (Theorem 47 in [clarkson2015input]).

With constant probability, the matrix U output by ConstApprox has:

 ∥A(I−UUT)∥v≤poly(k/ϵ)minrank−kX∥A(I−X)∥v

We apply this theorem with , and the approximation follows. As proven in [clarkson2015input], has rows with high probability.

We now turn to show that this algorithm runs in time. For computing

, we need to estimate the row

norm of . As mentioned in Theorem 6, we can compute a well-conditioned base for by doing a QR factorization of , using a subspace embedding . We can calculate in time . For the leverage score estimation, we also right multiply by a Gaussian vector and compute . This multiplication can be done from right to left in time .

Calculating takes time since is a sampling matrix and is a projection matrix with rank . The row span of can be computed in time . Finally the projection matrix is given by (we do not compute this matrix multiplication).

Combining everything, the total running time is . ∎

## 4 (1+ϵ)-approximation

Based on the -approximation constructed in Section 3, in this section we show how to construct a -approximation.

###### Theorem 9.

Given matrix , projection matrix given as with , projection matrix given as with , such that

 ||A(I−PB)(I−^X)||v≤Kminrank−k X||A(I−PB)(I−X)||v (2)

then returns a projection matrix given as with having columns in expectation which projects onto a subspace such that

 ||A(I−PB)(I−PS)||v≤(1+ϵ)minrank−k X||A(I−PB)(I−X)||v (3)

in time

###### Proof.

Let be the the matrix with orthonormal columns computed in step 8 of DimReduce. Let be . Let be the corresponding projection matrix. We can write . From Theorem 46 of [clarkson2015input], we obtain that

 ||A(I−PB)(I−PS′)||v≤(1+ϵ)minrank−k X||A(I−PB)(I−X)||v (4)

We also have

 A(I−PB)(I−PS′) =A(I−PB)(I−(PS′∩B+PS′∩B⊥) =A(I−PB)−A(I−PB)PS′∩B−A(I−PB)PS′∩B⊥ =A(I−PB)−0−A(I−PB)PS′∩B⊥ (% Since PBPS′∩B=PS′∩B) =A(I−PB)(I−PS′∩B⊥)

Thus

 ||A(I−PB)(I−PS′∩B⊥)||v≤(1+ϵ)min% rank−k X||A(I−X)||v (5)

We also have that computed in step 9 of DimReduce is an orthonormal basis for and hence satisfies the conditions of the theorem.

In step 5 of DimReduce, for all such that has at least 1 non-zero entry can be computed in time and hence, for all such can be computed in time. The sampling matrix can now be computed in time . In expectation, samples rows and hence an orthonormal basis for rowspan of can be computed in time using any standard orthogonalization techniques and can be then computed in time. Thus, the total time required for DimReduce is . ∎

###### Theorem 10.

For a matrix , for any projection matrix given as with corresponding to a subspace , we get a projection matrix given as with corresponding to a subspace such that

 ∥A(I−PB)(I−S)∥v≤(1+ϵ)mindim−k Hk∥A(I−PB)(I−Hk)∥v (6)

We have for any , and an optimal would be in . Hence, we obtain a subspace with dimensions in expectation and its projection matrix such that

 ∥A(I−PB∪S)∥v≤(1+ϵ)mindim−k Hk∥A(I−PB∪Hk)∥v (7)

and such can be obtained in time

###### Proof.

From Lemma 7, we obtain a projection matrix such that

 ||A(I−PB)(I−^X)||v≤poly(k/ϵ)minrank−k X||A(I−PB)(I−X)||v (8)

Hence, returns a projection matrix which projects onto a subspace

 ||A(I−PB)(I−PS)||v≤(1+ϵ)minrank−k X||A(I−PB)(I−X)||v (9)

We note that is never explicitly computed. We have the following

 ||A(I−PB)(I−PS)||v=(∑id(ai(I−PB),S)p)1/p=(∑id(ai,B∪S)p)1/p (10)

and

 (11)

and hence we obtain

 ∥A(I−PB∪S)∥v≤(1+ϵ)mindim−k Hk∥A(I−PB∪Hk)∥v (12)

## 5 Coresets for Subspace Approximation and k-Median

Now we show how to construct coresets for subspace approximation and -Median using DimReduce. The key property we need from the approximated subspace is that for any -dimentional subspace , attaching to will not increase the performance too much. To achieve this, we run DimReduce times, where each iteration we keep expanding . The details of our algorithm are described in Algorithm 4 DimensionReduction.

###### Lemma 11.

Let . With probability at least , DimensionReduction finds a -dimensional subspace for which all -dimensional spaces W, we have

 ∥A(I−PS)∥v−∥A(I−PS∪W)∥v≤Θ(τ)OPT
###### Proof.

With probability at least , after iterations of the for-loop in Algorithm 4 DimensionReduction, by the guarantee of Algorithm 3 DimReduce, contains a -dimensional subspace that is -approximation of OPT. For each , let be the -dimensional subspace that DimensionReduction produces, where . Consider the telescoping sum:

 (1+ϵ)OPT−∥A(I−PS10/τ+1)∥v≥∥A(I−PS1)∥v−∥A(I−PS10/τ+1)∥v=10/τ+1∑i=1(∥A(I−PSi−1)∥v−∥A(I−PSi)∥v)≥0

We can see that fraction of the summands must be at most . Let be the index sampled by the algorithm, with probability at least , we have:

 ∥A(I−PSi∗)∥v−∥A(I−PSi∗+1)∥v≤τ(1+ϵ)OPT

By Theorem 10, let be any -dimensional subspace, we have with vanishing probability that .

So for any dim :

 ∥A(I−PSi∗)∥v−∥A(I−PSi∗∪W)∥v ≤ ∥A(I−PSi∗)∥v−11+τ∥A(I−PSi∗+1)∥v ≤ ∥A(I−PSi∗)∥v−(1−τ)∥A(I−PSi∗+1)∥v By Bernoulli's inequality ≤ O(τ)OPT

Given with the property in Lemma 11, we adopt the technique in [sohler2018strong] for constructing coresets by attaching to every row of .

We cannot afford calculating directly, but we could use the following lemma as a fast estimation:

###### Lemma 12 (Lemma 14 in [sohler2018strong]).

Given the subspace guaranteed by Lemma 5, we can compute in time a matrix or rank such that with probability at least we have for every set contained in a -dimensional subspace:

 \abs∥B−B′I∥v−∥~B−~B′I∥v≤ϵ∥A−A′∥v

We remark that CoresetConstruction satisfies:

###### Theorem 13 (Theorem 8 in [sohler2018strong]).

Let . Let be the input matrix, be the rank matrix output by CoresetConstruction. Let be any non-empty set that is contained in a -dimensional subspace. Let and be the matrices whose rows are the closest points in the closure of with respect to the rows of and respectively, then we have:

 \abs∥A−A′∥v−∥B−B′I∥v≤ϵ∥A−A′∥v

We can now present our coreset construction result:

###### Lemma 14 (Lemma 16 in [sohler2018strong]).

Given , in time it is possible to find a sampling and rescaling matrix with rows for which for all rank- orthogonal project matrix :

 ∥∥TB−TBIPIT∥∥1,2=(1±ϵ)∥∥B−BIPIT∥∥1,2

Let be the output of Algorithm 5, would have rows.

###### Theorem 15 (Strong coresets for subspace approximation, modified Theorem 17 in [sohler2018strong], p∈[1,2)).

For , there exists such that for any orthogonal projection , we have:

 \abs∥A−AP∥v−∥TB−TBIPIT∥v≤ϵ