A PTAS for ℓ_p-Low Rank Approximation

07/16/2018 ∙ by Frank Ban, et al. ∙ Max Planck Society Carnegie Mellon University berkeley college 0

A number of recent works have studied algorithms for entrywise ℓ_p-low rank approximation, namely algorithms which given an n × d matrix A (with n ≥ d), output a rank-k matrix B minimizing A-B_p^p=∑_i,j |A_i,j - B_i,j|^p. We show the following: On the algorithmic side, for p ∈ (0,2), we give the first n^poly(k/ϵ) time (1+ϵ)-approximation algorithm. For p = 0, there are various problem formulations, a common one being the binary setting for which A∈{0,1}^n× d and B = U · V, where U∈{0,1}^n × k and V∈{0,1}^k × d. There are also various notions of multiplication U · V, such as a matrix product over the reals, over a finite field, or over a Boolean semiring. We give the first PTAS for what we call the Generalized Binary ℓ_0-Rank-k Approximation problem, for which these variants are special cases. Our algorithm runs in time (1/ϵ)^2^O(k)/ϵ^2· nd ·^2^k d. For the specific case of finite fields of constant size, we obtain an alternate algorithm with time n · d^poly(k/ϵ). On the hardness front, for p ∈ (1,2), we show under the Small Set Expansion Hypothesis and Exponential Time Hypothesis (ETH), there is no constant factor approximation algorithm running in time 2^k^δ for a constant δ > 0, showing an exponential dependence on k is necessary. For p = 0, we observe that there is no approximation algorithm for the Generalized Binary ℓ_0-Rank-k Approximation problem running in time 2^2^δ k for a constant δ > 0. We also show for finite fields of constant size, under the ETH, that any fixed constant factor approximation algorithm requires 2^k^δ time for a constant δ > 0.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Low rank approximation is a common way of compressing a matrix via dimensionality reduction. The goal is to replace a given matrix by a rank- matrix that approximates well, in the sense that is small for some measure . Since we can write the rank- matrix as , where is and is , it suffices to store the entries of and , which is a significant reduction compared to the entries of . Furthermore, computing takes time , which is much less than the time for computing .

Low rank approximation is extremely well studied, see the surveys [36, 46, 71] and the many references therein. In this paper, we study the following two variants of entrywise -low rank approximation. Given a matrix and an integer , one seeks to find a rank- matrix , minimizing when ; and for , where is the Iverson bracket, that is, denotes the number of entries for which .

When

, this coincides with the Frobenius norm error measure, which can be solved in polynomial time using the singular value decomposition (SVD); see also

[71] for a survey of more efficient algorithms based on the technique of linear sketching.

Recently there has been considerable interest in obtaining algorithms for . For , this error measure is often considered more robust than the SVD, since one pays less attention to noisy entries as one does not square the differences, but instead raises the difference to a smaller power. Conversely, for

, this error measure pays more attention to outliers, and

corresponds to a guarantee on each entry. This problem was shown to be NP-hard for [27, 21, 53].

-Low Rank Approximation for . A number of initial algorithms for -low rank approximation were given in [37, 38, 41, 45, 75, 14, 12, 13, 51, 49, 50, 48, 58]. There is also related work on robust PCA [72, 15, 55, 56, 17, 74] and measures which minimize the sum of Euclidean norms of rows [23, 25, 24, 65, 20], though neither directly gives an algorithm for -low rank approximation. Song et al. [67] gave the first approximation algorithms with provable guarantees for entrywise -low rank approximation for . Their algorithm provides a approximation and runs in polynomial time, that is, the algorithm outputs a matrix for which . This was generalized by Chierichetti et al. [18] to -low rank approximation, for every , where the authors also obtained a approximation in polynomial time.

In Song et al. [67] it is also shown that if has entries bounded by then an approximation can be achieved, albeit in time. This algorithm depends inherently on the triangle inequality and as a result the constant factor of approximation is greater than . Improving this constant of approximation requires techniques that break this triangle inequality barrier. This is a real barrier, since the algorithm of [67] is based on a row subset selection algorithm, and there exist matrices for which any subset of rows contains at best a -approximation (Theorem G.8 of [67]), which we discuss more below.

-Low Rank Approximation. When , one seeks a rank- matrix for which is as small as possible, where for a matrix , denotes the number of non-zero entries of . Thus, in this case, we are trying to minimize the number of disagreements between and . Since has rank , we can write it as and we seek to minimize . This was studied by Bringmann et al. [11] when and are matrices over the reals and denotes the standard matrix product, and the work of [11] provides a bicriteria approximation algorithm. See also earlier work for giving a 2-approximation [64, 34]. -low rank approximation is also well-studied when , , and are each required to be binary matrices. In this case, there are a number of choices for the ground field (or, more generally, semiring). Specifically, for we can write the entry as the inner product of the -th row of with the -th column of – and the specific inner product function depends on the ground field. We consider both (1) the ground field is with inner product [73, 30, 21, 57], and (2) the Boolean semiring in which the inner product becomes [8, 21, 54, 63, 66, 70]. Besides the abovementioned upper bounds, which coincide with all of these models when , the only other algorithm we are aware of is by Dan et al. [21], who for arbitrary presented an -time -approximation over , and an -time -approximation over the Boolean semiring.

Although -low rank approximation is NP-hard for , a central open question is if -approximation is possible, namely: Does -low rank approximation have a polynomial time approximation scheme (PTAS) for any constant  and ?

1.1 Our Results

We give the first PTAS for -low rank approximation for in the unit cost RAM model of computation. For our algorithms work for both finite fields and the Boolean semiring models. We also give time lower bounds, assuming the Exponential Time Hypothesis (ETH) [33] and in some cases the Small Set Expansion Hypothesis [59], providing evidence that an exponential dependence on , for , and a doubly-exponential dependence on , for , may be necessary.

1.1.1 Algorithms

We first formally define the problem we consider for . We may assume w.l.o.g. that , and thus the input size is .

Definition 1.

(Entrywise -Rank- Approximation) Given an matrix with integer entries bounded in absolute value by , and a positive integer , output matrices and minimizing . An algorithm for Entrywise -Rank- Approximation is an -approximation if it outputs and for which

Our main result for is as follows.

Theorem 1 (PTAS for ).

For any and , there is a -approximation algorithm to Entrywise -Rank- Approximation running in time .

For any constants and , Theorem 1 computes in polynomial time a -approximate solution to Entrywise -Rank- Approximation. This significantly strengthens the approximation guarantees in [67, 18].

We next consider the case . In order to study the and Boolean semiring settings in a unified way, we introduce the following more general problem.

Definition 2.

(Generalized Binary -Rank-) Given a matrix with , an integer , and an inner product function , compute matrices and minimizing , where the product uses . An algorithm for the Generalized Binary -Rank- problem is an -approximation, if it outputs matrices and such that

Our first result for is as follows.

Theorem 2 (PTAS for ).

For any , there is a -approximation algorithm for the Generalized Binary -Rank- problem running in time

and succeeds with constant probability 

111 The success probability can be further amplified to for any by running independent trials of the preceding algorithm., where hides a factor .

Hence, we obtain the first almost-linear time approximation scheme for the Generalized Binary -Rank- problem, for any constant . In particular, this yields the first polynomial time -approximation for constant for -low rank approximation of binary matrices when the underlying field is or the Boolean semiring. Even for , no PTAS was known before.

Theorem 2 is doubly-exponential in , and we show below that this is necessary for any approximation algorithm for Generalized Binary -Rank-. However, in the special case when the base field is , or more generally and and have entries belonging to , it is possible to obtain an algorithm running in time , which is an improvement for certain super-constant values of and . We formally define the problem and state our result next.

Definition 3.

(Entrywise -Rank- Approximation over ) Given an matrix with entries that are in for any constant , and a positive integer , output matrices and minimizing . An algorithm for Entrywise -Rank- Approximation over is an -approximation if it outputs matrices and such that

Our main result for Entrywise -Rank- Approximation over is the following:

Theorem 3 (Alternate PTAS for ).

For there is a -approximation algorithm to Entrywise -Rank- Approximation over running in time .

1.1.2 Hardness

We first obtain conditional time lower bounds for Entrywise -Rank- Approximation for . Our results assume the Small Set Expansion Hypothesis (SSEH). Originally conjectured by Raghavendra and Stuerer [59], it is still the only assumption that implies strong hardness results for various graph problems such as Uniform Sparsest Cut [61] and Bipartite Clique [47]. Assuming this hypothesis, we rule out any constant factor approximation .

Theorem 4 (Hardness for Entrywise -Rank- Approximation).

Fix and . Assuming the Small Set Expansion Hypothesis, there is no -approximation algorithm for Entrywise -Rank- Approximation that runs in time .

Consequently, additionally assuming the Exponential Time Hypothesis, there exists such that there is no -approximation algorithm for Entrywise -Rank- Approximation that runs in time .

This shows that assuming the SSEH and the ETH, any constant factor approximation algorithm needs at least a subexponential dependence on . We also prove hardness of approximation results for (see Theorem 19) without the SSEH. They are the first hardness results for Entrywise -Rank- Approximation other than .

We next show that our running time for Generalized Binary -Rank- is close to optimal, in the sense that the running time of any PTAS for Generalized Binary -Rank- must depend exponentially on and doubly exponentially on , assuming the Exponential Time Hypothesis.

Theorem 5 (Hardness for Generalized Binary -Rank-).

Assuming the Exponential Time Hypothesis, Generalized Binary -Rank- has no -approximation algorithm in time . Further, for any , Generalized Binary -Rank- has no -approximation algorithm in time .

Next we obtain conditional lower bounds for Entrywise -Rank- Approximation over for any fixed :

Theorem 6 (Hardness for Entrywise -Rank- Approximation over ).

Let be a finite field and . Assuming , there is no -approximation algorithm for Entrywise -Rank- Approximation over that runs in time .

Consequently, assuming the Exponential Time Hypothesis, there exists such that there is no -approximation algorithm for Entrywise -Rank- Approximation over that runs in time .

This shows that assuming the ETH, any constant factor approximation algorithm needs at least a subexponential dependence on .

1.1.3 Additional Results

We obtain several additional results on -low rank approximation. We summarize our results below and defer the details to Section 6.

-low rank approximation for

Let

be a standard Gaussian random variable and let

. We note that , for any . Then, under ETH no -approximation algorithm runs in time . On the algorithmic side, we give a simple -approximation algorithm running in time .

Weighted -low rank approximation for

We also generalize Theorem 1 to the following weighted setting. Given a matrix , an integer and a rank- matrix , we seek to find a rank- matrix such that

Our algorithm runs in time . We defer the details to Theorem 25.

Related Work

Our results, in particular Theorem 2 and Theorem 5, had been in submission as of April 2018. Shortly after posting this manuscript [6] to arXiv on 16 July 2018, we became aware that in an unpublished work Fomin et al. have independently obtained a very similar PTAS for Binary -Rank-. Their manuscript [26] was posted to arXiv on 18 July 2018. Interestingly, [6, 26] have independently discovered i) a reduction between the Binary -Rank- problem and a clustering problem with constrained centers; ii) a structural sampling theorem extending [2] which yields a simple but inefficient deterministic PTAS; and iii) an efficient sampling procedure, building on ideas from [43, 3, 1], which gives an efficient randomized PTAS. Notably, by establishing an additional structural result, Fomin et al. [26] design a faster sampling procedure which yields a randomized PTAS for the Binary -Rank- problem that runs in linear time .

1.2 Our Techniques

We give an overview of our techniques, separating them into those for our algorithms for , those for our algorithms for , and those for our hardness proofs.

1.2.1 Algorithms for

We illustrate the techniques for ; the algorithms for other follow similarly. Consider a target rank . One of the surprising aspects of our -approximation result is that for , it breaks a potential lower bound from [67]. Indeed, in Theorem G.8, they construct matrices such that the closest rank- matrix in the row span of provides at best a -approximation to !

This should be contrasted with , for which it is well-known that for any there exists a subset of rows of containing a -dimensional subspace in its span which is a -approximation (these are called column subset selection algorithms; see [71] for a survey). In fact, for , all known algorithms [67, 18] find a best -dimensional subspace in either the span of the rows or of the columns of , and thus provably cannot give better than a -approximation. To bypass this, we therefore critically need to leave the row space and column space of .

Our starting point is the “guess a sketch” technique of [62], which was used in the context of weighted low rank approximation. Let us consider the optimization problem , where is a left factor of an optimal -low rank approximation for . Suppose we could choose a sketching matrix with a small number of rows for which for all . Then, if we somehow knew , we could optimize for in the sketched space to find a good right factor .

Of course we do not know , but if had a small number of rows, then we could consider instead the -norm optimization problem , where for a matrix , is defined as , the sum of the -norms of its columns. The solution to is a -approximation to the original problem .

In the norm, the solution can be written in terms of the so-called normal equations for regression, namely, , where denotes the Moore-Penrose pseudoinverse of . The key property exploited in [67] is then that although we do not know , is a -dimensional subspace in the row span of providing a -approximation, and one does know . This line of reasoning ultimately leads to a -approximation.

The approach above fails to give a -approximation for multiple reasons: (1) we may not be able to find a -approximation from the row span of , and (2) we lose a factor when we switch to the norm.

Instead, suppose we were instead just to guess all the values of . These values might be arbitrary real numbers, but observe that we can assume there is an optimal solution for which is a so-called -well conditioned basis, which loosely speaking means that

for any row vector

. Also, we can show that if , then . Furthermore, we can assume that the entries of are bounded by . These three facts allow us to round the entries of to an integer multiple of of absolute value at most . Now suppose we could also discretize the entries of to multiples of and of absolute value at most . Then we would actually be able to guess the correct after tries, where recall is the number of rows of . We will show below that can be , so this will be within our desired running time.

In general, if for all , then we say that defines a subspace embedding. At this point, we can use the triangle inequality to get a constant factor approximation. If is a subspace embedding, then

and

so by taking to be a minimizer for we can get an approximation factor close to . The triangle inequality was useful here because had a small distortion on the subspace defined by . To improve this result, we would need a mapping that has small distortion on the affine space defined by , as varies.

Given and , if in fact has the property that for all , then we will be in good shape. At this point we can solve for the optimal to by solving an

-regression problem using linear programming. Notice that unlike

[62], the approach described above does not create “unknowns” to represent the entries of and set up a polynomial system of inequalities. For Frobenius norm error, this approach is feasible because can be minimized over each column using the normal equations for regression. However, we do not know how to set up a polynomial system of inequalities for -error (which define in terms of the variables).

Unfortunately the approach above is fatally flawed; there is no known sketching matrix with a small number of rows for which for all . Instead, we adapt a “median-based” embedding with a non-standard subspace embedding analysis that appeared in the context of sparse recovery [5]. In Lemma F.1 of that paper, it is shown that if is a -dimensional subspace of , and is an matrix of i.i.d. standard Cauchy random variables for , then with constant probability, simultaneously for all . Here for a vector , denotes the median of absolute values of its entries. For a matrix , denotes the sum of the medians of its columns .

In our context, this gives us that for a fixed column of and -th column of , if is an i.i.d. Cauchy matrix with rows, then with constant probability for all vectors . Since is only -dimensional, and one can show that its entries can be taken to be integer multiples of bounded in absolute value by , we can enumerate over all and find the best solution. We need, however, to adapt the argument in [5] to argue that if rather than taking the median, we take a

-quantile, we still obtain a subspace embedding. We do this in Lemma

7 and explain why this modification is crucial for the argument below.

Unfortunately, this still does not work. The issue is that succeeds only with constant probability in achieving for all vectors . Call this property, of an index , good. A naïve amplification of the probability to would allow us to union bound over all , but this would require to have rows. At this point though, we would not obtain a PTAS since enumerating the entries of would take time. Nor can we use different for different columns of , since we may guess different for different and not obtain a consistent solution .

Before proceeding, we first relax the requirement that for all . We only need for all , and for the fixed optimum . We can prove by using tail bounds for a Cauchy random variable; we do so in Lemma 6.

Moreover, we next argue that it suffices to have the properties: i) a -fraction of columns are good, and ii) the error introduced by bad columns is small. We can achieve (i) by increasing the number of rows of by a factor, which still allows for an enumeration in time . The main issue is to control the error from bad columns. In particular, it is possible to have a matrix and a column such that is large and yet is small, which results in accepting a bad solution . While for an average matrix , the expected value of is small, we need to argue that this holds for every matrix .

In order to control the error from bad columns, we first show that for the fixed matrix , and then we demonstrate that the total contribution to from bad columns, is small. We show the latter using Markov’s bound for the fixed matrix . Combining this with the former, yields that the total contribution of to from bad columns (in the original, unsketched space) is small.

We convert the preceding argument for bad columns of the fixed matrix , into an argument for bad columns of a general matrix . Inspired by ideas for norm, established in [20], we partition the bad columns of a given matrix into classes, using the following measurement, which differs substantially from [20]. We look at quantiles to handle the median operator, and we say that a bad column is large if

(1)

where is the -th quantile of coordinates of column arranged in order of non-increasing absolute values. Otherwise, a bad column is small.

We show that small bad columns can be handled by applying the preceding argument for the fixed matrix , since intuitively, the error they introduce is dominated by the contribution of the corresponding columns of matrix , and we can control this contribution.

Our analysis for the large bad columns uses a different approach, which we summarize in Claim 2. The key insight is to use the additivity of a sketch matrix , and to write

(2)

Then, by applying our “robust” version (Lemma 5) of median-based subspace embedding [5], it follows that at least a -fraction of the entries of column vector have absolute value at least

where (a) follows by triangle inequality, and (b) by (1) since the bad column is large. Thus, at least a -fraction of entries of have absolute value at least

(3)

Since at most an fraction of entries of have absolute value at least , by definition of quantile, it follows by (3) that in equation (2) at most an -fraction of entries of can have their absolute value reduced to less than . Furthermore, by (3) at least -fraction of entries of have absolute value at least . Therefore, the median of absolute value of the entries of is at least , as desired.

Our analysis for uses similar arguments, but in contrast relies on -stable random variables. In the case when , special care is needed since the triangle inequality does not hold.

1.2.2 Algorithms for

In the case when and the entries of matrix belong to a finite field for constant , we use similar arguments as in the case for . Here, instead of

-stable random variables we apply a linear sketch for estimating the number of distinct elements, established in

[35]. We show that it suffice to set the number of rows of the sketching matrix to . Further, since each entry of has only possible values, it is possible to guess matrix by enumeration in time , which will lead to a total running time of . This yields a PTAS for constant . We defer the details to Section 3.

We now consider the binary setting, where both the input matrix has entries in and any solution is restricted to have entries in . In this case, the Generalized Binary -Rank- problem can be rephrased as a clustering problem with constrained centers, whose goal is to choose a set of centers satisfying a certain system of linear equations, in order to minimize the total -distance of all columns of to their closest center. The main difference to usual clustering problems is that the centers cannot be chosen independently.

We view the choice of matrix as picking a set of “cluster centers” . Observe that any column of is in , and thus we view the choice of column as picking one of the constrained centers in . Formally, we rephrase the Generalized Binary -Rank- problem as

(4)

Any matrix gives rise to a “clustering” as partitioning of the columns of with . If we knew an optimal clustering , for some optimal matrix , we could compute an optimal matrix as the best response to . Note that

Therefore, given we can compute independently for each the optimal row , by enumerating over all possible binary vectors of dimension and selecting the one that minimizes the summation

What if instead we could only sample from ? That is, suppose that we are allowed to draw a constant number of samples from each of the optimal clusters uniformly at random. Denote by the samples drawn from

. A natural approach is to replace the exact cost above by the following unbiased estimator:

We show that with good probability any matrix minimizing the estimated cost is close to an optimal solution. In particular, we prove for any matrix that

(5)

The biggest issue in proving statement (5) is that the number of samples is independent of the ambient space dimension . A key prior probabilistic result, established by Alon and Sudakov [2], gives an additive approximation for the maximization version of a clustering problem with unconstrained centers, known as Hypercube Segmentation. Since the optimum value of this maximization problem is always at least , a multiplicative factor -approximation is obtained. Our contribution is twofold. First, we generalize their analysis to clustering problems with constrained centers, and second we prove a multiplicative factor -approximation for the minimization version. The proof of (5) takes a significant fraction of this chapter.

We combine the sampling result (5) with the following observations to obtain a deterministic polynomial time approximation scheme (PTAS) in time . We later discuss how to further improve this running time. Let be an optimal solution to the Generalized Binary -Rank- problem.

  1. To evaluate the estimated cost , we need the sizes of an optimal clustering . We can guess these sizes with an overhead in the running time. In fact, it suffices to know these cardinalities approximately, see Lemma 23, and thus this overhead 222 In Section 4.4, we establish an efficient sampling procedure, see Algorithm 3, that further reduces the total overhead for guessing the sizes of an optimal clustering to . can be reduced to .

  2. Using the (approximate) size and the samples drawn u.a.r. from , for all , we can compute in time a matrix minimizing the estimated cost , since the estimator can be split into a sum over the rows of and each row is chosen independently as a minimizer among all possible binary vectors of dimension .

  3. Given , we can compute a best response matrix which has cost , and thus by (5) the expected cost at most .

  4. The only remaining step is to draw samples from the optimal clustering. However, in time we can enumerate all possible families , and the best such family yields a solution that is at least as good as a random sample. In total, we obtain a PTAS in time .

The largest part of this chapter is devoted to make the above PTAS efficient, i.e., to reduce the running time from to , where hides a factor . By the preceding outline, it suffices to speed up Steps (1) and (4), i.e., to design a fast algorithm that guesses approximate cluster sizes and samples from the optimal clusters.

The standard sampling approach for clustering problems such as -means [43] is as follows. At least one of the clusters of the optimal solution is “large”, say . Sample columns uniformly at random from the set of all columns. Then with probability at least all samples lie in , and in this case they form a uniform sample from this cluster. In the usual situation without restrictions on the cluster centers, the samples from allow us to determine an approximate cluster center . Do this as long as large clusters exist (recall that we have guessed approximate cluster sizes in Step (1), so we know which clusters are large). When all remaining clusters are small, remove the columns that are closest to the approximate cluster centers determined so far, and estimate the cost of these columns using the centers . As there are no restrictions on the cluster centers, this yields a good cost estimation of the removed columns, and since the -distance is additive the algorithm recurses on the remaining columns, i.e. on an instance of twice smaller size. We continue this process until each cluster is sampled. This approach has been used to obtain linear time approximation schemes for -means and -median in a variety of ambient spaces [43, 44, 1].

The issue in our situation is that we cannot fix a cluster center by looking only at the samples , since we have dependencies among cluster centers. We nevertheless make this approach work, by showing that a uniformly random column is a good “representative” of the cluster with not-too-small probability. In the case when all remaining clusters are small, we then simply remove the columns that are closest to the representatives of the clusters that we already sampled from. Although these representatives can be far from the optimal cluster centers due to the linear restrictions on the latter, we show in Section 4.4 that nevertheless this algorithm yields samples from the optimal clusters.

We prove that the preceding algorithm succeeds with probability at least . Further, we show that the approximate cluster sizes of an optimal clustering can be guessed with an overhead of . In contrast to the standard clustering approach, the representatives do not yield a good cost estimation of the removed columns. We overcome this issue by first collecting all samples from the optimal clusters, and then computing approximate cluster centers that satisfy certain linear constraints, i.e. a matrix and its best response matrix . The latter computation runs in linear time in the size of the original instance, and this in combination with the guessing overhead, yields the total running time of . For further details, we refer the reader to Algorithm 3 in Subsection 4.4.2.

Our algorithm achieves a substantial generalization of the standard clustering approach and applies to the situation with constrained centers. This yields the first randomized almost-linear time approximation scheme for the Generalized Binary -Rank- problem.

1.2.3 Hardness

Our hardness results for the norm for in Theorem 4 and in Theorem 19 are established via a connection to the matrix norm problem and its variants. Given a matrix ,