Optimal Analysis of Subset-Selection Based L_p Low Rank Approximation

10/30/2019 ∙ by Chen Dan, et al. ∙ 0

We study the low rank approximation problem of any given matrix A over R^n× m and C^n× m in entry-wise ℓ_p loss, that is, finding a rank-k matrix X such that A-X_p is minimized. Unlike the traditional ℓ_2 setting, this particular variant is NP-Hard. We show that the algorithm of column subset selection, which was an algorithmic foundation of many existing algorithms, enjoys approximation ratio (k+1)^1/p for 1< p< 2 and (k+1)^1-1/p for p> 2. This improves upon the previous O(k+1) bound for p> 1<cit.>. We complement our analysis with lower bounds; these bounds match our upper bounds up to constant 1 when p≥ 2. At the core of our techniques is an application of Riesz-Thorin interpolation theorem from harmonic analysis, which might be of independent interest to other algorithmic designs and analysis more broadly. As a consequence of our analysis, we provide better approximation guarantees for several other algorithms with various time complexity. For example, to make the algorithm of column subset selection computationally efficient, we analyze a polynomial time bi-criteria algorithm which selects O(klog m) columns. We show that this algorithm has an approximation ratio of O((k+1)^1/p) for 1< p< 2 and O((k+1)^1-1/p) for p> 2. This improves over the best-known bound with an O(k+1) approximation ratio. Our bi-criteria algorithm also implies an exact-rank method in polynomial time with a slightly larger approximation ratio.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Low rank approximation has wide applications in compressed sensing, numerical linear algebra, machine learning, and many other domains. In compressed sensing, low rank approximation serves as an indispensable building block for data compression. In numerical linear algebra and machine learning, low rank approximation is the foundation of many data processing algorithms, such as PCA. Given a data matrix

, low rank approximation aims at finding a low-rank matrix such that

(1)

Here the field can be either or . The focus of this work is on the case when is the entry-wise

norm, and we are interested in an estimate

with a tight approximation ratio so that we have the guarantee:

As noted earlier, such low-rank approximation is a fundamental workhorse of machine learning. The key reason to focus on approximations with respect to general norms, in contrast to the typical norm, is that these general norms are better able to capture a broader range of realistic noise in complex datasets. For example, it is well-known that the

norm is more robust to the sparse outlier 

(candes2011robust; huber2011robust; xu1995robust). So the low-rank approximation problem is a robust version of the classic PCA which uses the

norm and has received tremendous attentions in machine learning, computer vision and data mining

(meng2013robust), (wang2013bayesian), (xiong2011direct). A related problem linear regression has also been studied extensively in the statistics community, and these two problems share similar motivation. In particular, if we assume a statistical model , where is a low rank matrix and are i.i.d. noise, the different values of correspond to the MLE of different noise distributions, say for Laplacian noise and for Gaussian noise.

While it has better empirical and statistical properties, the key bottleneck to solving the problem in (1) is computational, and is known to be NP-hard in general. For example, the low-rank approximation is NP-hard to solve exactly even when  (gillis2018complexity), and is even hard to approximate with large error under the Exponential Time Hypothesis (song2017low). (gillis2017low) proved the NP-hardness of the problem when . A recent work (ban2019ptas) proves that the problem has no constant factor approximation algorithm running in time for a constant , assuming the correctness of Small Set Expansion Hypothesis and Exponential Time Hypothesis. The authors also proposed a PTAS (Polynomial Time Approximation Scheme) with approximation ratio when . However, the running time is as large as .

Many other efforts have been devoted to designing approximation algorithms in order to alleviate the computational issues of low-rank approximation. One promising approach is to apply subgradient descent based methods or alternating minimization (kyrillidis2018simple). Unfortunately, the loss surface of problem (1) suffers from saddle points even in the simplest case (baldi1989neural), which might be arbitrarily worse than . Therefore, they may not work well for the low-rank approximation problem as these local searching algorithms may easily get stuck at bad stationary points without any guarantee.

Instead, we consider another line of research—the heuristic algorithm of column subset selection (CSS). Here, the algorithm proceeds by choosing the best

columns of as an estimation of column space of and then solving an linear regression problem in order to obtain the optimal row space of . See Algorithm 1 for the detailed procedure. Although the vanilla form of the subset selection based algorithm also has an exponential time complexity in terms of the rank , it can be slightly modified to polynomial time bi-criteria algorithms which selects more than columns (chierichetti2017algorithms). Most importantly, these algorithms are easy to implement and runs fast with nice empirical performance. Thus, subset selection based algorithms might seem to effectively alleviate the computational issues of problem (1). The caveat however is that CSS might seem like a simple heuristic, with potentially a very large worst-case approximation ratio .

In this paper, we show that CSS yields surprisingly reasonable approximation ratios, which we also show to be tight by providing corresponding lower bounds, thus providing a strong theoretical backing for the empirical observations underlying CSS.

Due in part to its importance, there has been a burgeoning set of recent analyses of column subset selection. In the traditional low rank approximation problem with Frobenious norm error (the case in our setting), deshpande2006matrix showed that CSS achieves approximation ratio. The authors also showed that the bound is tight (both upper and lower bounds can be recovered by our analysis). frieze2004fast; deshpande2006adaptive; boutsidis2009improved; deshpande2010efficient improved the running time of CSS with different sampling schemes while preserving similar approximation bounds. The CSS algorithm and its variants are also applied and analyzed under various different settings. For instance, drineas2008relative and boutsidis2017optimal studied the CUR decomposition with the Frobenius norm. wang2015column studied the CSS problem under the missing-data case. With error, bhaskara2018non studied CSS for non-negative matrices in error. dan_et_al:LIPIcs:2018:9623 gave tight approximation bounds for CSS under finite-field binary matrix setting. Furthermore, song2017low

considered the low rank tensor approximation with the Frobenius norm.

Despite a large amount of work on the subset-selection algorithm and the

low rank approximation problem, many fundamental questions remain unresolved. Probably one of the most important open questions is: what is the

tight approximation ratio for the subset-selection algorithm in the low rank approximation problem, up to a constant factor? In (chierichetti2017algorithms), the approximation ratio is shown to be upper bounded by and lower bounded by when . This problem becomes even more challenging when one requires the approximation ratio to be tight up to factor , as little was known about a direct tool to achieve this goal in general. In this work, we improve both upper and lower bounds in (chierichetti2017algorithms) to optimal when . Note that our bounds are still applicable and improve over (chierichetti2017algorithms) when , but there is an gap between the upper and lower bounds.

1.1 Our Results

The best-known approximation ratio of subset selection based algorithms for low-rank approximation is  (chierichetti2017algorithms). In this work, we give an improved analysis of this algorithm. In particular, we show that the Column Subset Selection in Algorithm 1 is a -approximation, where

This improves over Theorem 4 in (chierichetti2017algorithms) which proved that the algorithm is an -approximation, for all . Below, we state our main theorem formally:

[Upper bound] The subset selection algorithm in Algorithm 1 is a -approximation. Our proof of Theorem 1.1 is built upon novel techniques of Riesz-Thorin interpolation theorem. In particular, with the proof of special cases for , we are able to interpolate the approximation ratio of all intermediate ’s. Our techniques might be of independent interest to other norm or Schatten- norm related problem more broadly. See Section 1.2 for more discussions.

We also complement our positive result of subset selection algorithm with a negative result. Surprisingly, our upper bound matches our lower bound exactly up to constant for . Below, we state our negative results formally: [Lower bound] There exist infinitely many different values of , such that approximation ratio of any -subset-selection based algorithm is at least for rank- approximation. Note that our lower bound strictly improves the bound in (chierichetti2017algorithms). The main idea of the proof can be found in Section 1.2 and we put the whole proof of Theorem 1.1 in Appendix 3.

One drawback of Algorithm 1 is that the running time scales exponentially with the rank . However, it serves as an algorithmic foundation of many existing computationally efficient algorithms. For example, a bi-criteria variant of this algorithm runs in polynomial time, only requiring the rank parameter to be a little over-parameterized. Our new analysis can be applied to this algorithm as well. Below, we state our result informally: [Informal statement of Theorem 4] There is a bi-criteria algorithm which runs in time and selects columns of . The algorithm is an -approximation algorithm.

Our next result is a computationally-efficient, exact-rank algorithm with slightly larger approximation ratio. Below, we state our result informally:

[Informal statement of Theorem 4] There is an algorithm which solves problem (1) and runs in time with an -approximation ratio, provided that .

1:  Input: Data matrix and rank parameter .
2:  Output: such that and .
3:  for  do
4:     .
5:     Run linear regression over that minimizes the loss .
6:     Let
7:  end for
8:  return   which minimizes for .
Algorithm 1 A approximation to problem (1) by column subset selection.

1.2 Our Techniques

In this section, we give a detailed discussion about our techniques in the proofs. We start with the analysis of approximation ratio of column subset selection algorithm.

Remark Throughout this paper, we state the theorems for real matrices. The results can be naturally generalized for complex matrices as well.

Notations: We denote by the input matrix, and the -th column of . is the optimal rank- approximation, where .

is the error vector on the

-th column, and the -th element of vector . For any , define the error of projecting onto by Let be a subset of with cardinality . We denote as the following column subset in matrix : Similarly, we denote by the following column subset in the -th dimension of matrix : Denote by the column subset which gives smallest approximation error, i.e.,

Analysis in Previous Work: In order to show that the column subset selection algorithm gives an -approximation, we need to prove that

(2)

Directly bounding is prohibitive. In (chierichetti2017algorithms), the authors proved an upper bound of in two steps. First, the authors constructed a specific , and upper bounded by . Their construction is as follows: is defined as the minimizer of

In the second step, chierichetti2017algorithms upper bounded by considering the approximation error on each column , and upper bounded the distance from to the subspace spanned by using triangle inequality of distance. They showed that the distance is at most times of , uniformly for all columns . Therefore, the approximation ratio is bounded by Our approach is different from the above analysis in both steps.

Weighted Average: In the first step, we use a so-called weighted average technique, inspired by the approach in (deshpande2006matrix; dan_et_al:LIPIcs:2018:9623). Instead of using the error of one specific column subset as an upper bound, we use a weighted average over all possible column subsets, i.e.,

where the weight ’s are carefully chosen for each column subset . This weighted average technique captures more information from all possible column subsets, rather than only from one specific subset, and leads to a tighter bound.

Riesz-Thorin Interpolation Theorem: In the second step, unlike (chierichetti2017algorithms) which simply used triangle inequality to prove the upper bound, our technique leads to more refined analysis of upper bounds for the approximation error for each subset . With the technique of weighted average in the first step, proving a technical inequality (Lemma 2) concerning the determinants suffices to complete the analysis of approximation ratio. In the proof of this lemma, we introduce several powerful tools from harmonic analysis, the theory of interpolating linear operators. Riesz-Thorin theorem is a classical result in interpolation theory that gives bounds for to operator norm. In general, it is easier to prove estimates within spaces like , and . Interpolation theory enables us to generalize results in those spaces to some and spaces with an explicit operator norm. By the Riesz-Thorin interpolation theorem, we are able to prove the lemma by just checking the special cases , and then interpolate the inequality for all the intermediate value of ’s.

Lower Bounds: We now discuss the techniques in proving the lower bounds. Our proof is a generalization of (deshpande2006matrix), which shows that for the special case , is the best possible approximation ratio. Their proof for the lower bound is constructive: they constructed a matrix , such that using any -subset leads to a sub-optimal solution by a factor no less than . However, since norm is not rotationally-invariant in general, it is tricky to generalize their analysis to other values of ’s. To resolve the problem, we use a specialized version of their construction, the perturbed Hadamard matrices (see Section 3 for details), as they have nice symmetricity and are much easier to analyze. We give an example of special case for better intuition:

Here is a positive constant close to . We note that is very close to a rank- matrix: if we replace the first row by four zeros, then it becomes rank-. Thus, the optimal rank- approximation error is at most . Now we consider the column subset selection algorithm. For example, we use the first three columns to approximate the whole matrix — the error only comes from the fourth column. We can show that when is small, the projection of to is very close to

Therefore, the column subset selection algorithm achieve about error on this matrix, which is a factor from being optimal. The similar construction works for any integer , where the lower bound is replaced by , also matches with our upper bound exactly when .

2 Analysis of Approximation Ratio

In this section, we will prove Theorem 1.1. Recall that our goal is to bound . We first introduce two useful lemmas. Lemma 2 gives an upper bound on approximation error by choosing a single arbitrary column subset . Lemma 2 is our main technical lemma.

If satisfies , then the approximation error of can be upper bounded by

Let be a complex matrix, be -dimensional complex vector, and then we have

where

We first show that Theorem 1.1 has a clean proof using the two lemmas, as stated below.

Proof.

of Theorem 1.1: We can WLOG assume that . In fact, if , then of course and there is nothing to prove. Otherwise if then by the definition of , we know that .

We will upper bound the approximation error of the best column by a weighted average of . In other words, we are going to choose a set of non-negative weights such that , and upper bound by

In the following analysis, our choice of will be

Since , are well-defined. We first prove

(3)

where we denote .

In fact, when , of course LHS of (3) = 0 RHS. When , we know that is invertible. By Lemma 2,

The second to last equality follows from the Schur’s determinant identity. Therefore (3) holds, and

By Lemma 2,

Therefore,

which means

Therefore, we only need to prove the two lemmas. Lemma 2 is relatively easy to prove.

Proof.

of Lemma 2: Recall that by definition of , ,

The main difficulty in our analysis comes from Lemma 2. The proof is based on Riesz-Thorin interpolation theorem from harmonic analysis. Although the technical details in verifying a key inequality (4) are quite complicated, the remaining part which connects Lemma 2 to the Riesz-Thorin interpolation theorem is not that difficult to understand. Below we give a proof to Lemma 2 without verifying (4), and leave the complete proof of (4) in the appendix.

Proof.

of Lemma 2: We first state a simplified version of the Riesz-Thorin interpolation theorem, which is the most convenient-to-use version for our proof. The general version can be found in the Appendix. [Simplified version of Riesz-Thorin] Let be a multi-linear operator, such that the following inequalities

hold for all , then we have

holds for all , where

Riesz-Thorin theorem is a classical result in interpolation theory that gives bounds for to operator norm. In general, it is easier to prove estimates within spaces like , and . Interpolation theory enables us to generalize results in those spaces to some and spaces in between with an explicit operator norm. In our application, the is a set of elements and is , the space of functions on elements.
Now we prove Lemma 2. In fact, by symmetricity, Lemma 2 is equivalent to

Here, denotes the -subsets of .

Taking -th power on both sides, we have the following equivalent form

By Laplace expansion on the first row of , we have for every

Here, .

This motivates us to define the following multilinear map : for all , and index set , is defined as

Now, by letting , the inequality can be written as

(4)

Let , the inequality can be rewritten as . We denote

here, when , we choose ; when , we choose . Then, we can observe the following nice property about :

(5)

This is exactly the same form as Riesz-Thorin Theorem! Hence, we only need to show (4) holds for , then applying Riesz-Thorin proves all the intermediate cases immediately.

We leave the complete proof of (4) in the appendix. ∎

3 Lower Bounds

In this section, we give a proof sketch of Theorem 1.1. The proof is constructive: we prove the theorem by showing for all , we can construct a matrix , such that selecting every columns of leads to an approximation ratio at least . Then, the theorem follows by letting . Our choice of is a perturbation of Hadamard matrices.

Throughout the proof, we assume that , for some , and is an arbitrarily small constant. We consider the well known Hadamard matrix of order , defined below:

Now we can define , the construction of lower bound instance: it is a perturbation of by replacing all the entries on the first row by , i.e.,

(6)

We can see that is close to a rank- matrix. In fact, has rank at most . Therefore, we can upper bound by

(7)

The remaining work is to give a lower bound on the approximation error using any columns. For simplicity of notations, we use as shorthand for when it’s clear from context. Say we are using all columns except the -th, i.e. the column subset is . Obviously, we achieve zero error on all the columns other than the -th. Therefore, the approximation error is essentially the distance from to . We can show that the projection from to is very close to , in other words,

(8)

The theorem follows by combining (7) and (8). The complete proof can be found in the appendix.

4 Analysis of Efficient Algorithms

One drawback of the column subset selection algorithm is its time complexity - it requires
time, which is not desirable since it’s exponential in . However, several more efficient algorithms (chierichetti2017algorithms) are designed based on it. Our tighter analysis on Algorithm 1 implies better approximation guarantees on these algorithms as well. The improved bounds can be stated as follows:

1:  Input: Data matrix and rank parameter .
2:  Output: columns of
3:  if number of columns of  then
4:     return all the columns of
5:  else
6:     repeat
7:        Let be uniform at random columns of
8:     until at least -fraction columns of are -approximately covered
9:     Let be the columns of not approximately covered by
10:     return SelectColumns ()
11:  end if
Algorithm 2 (chierichetti2017algorithms) SelectColumns (): Selecting columns of .

Algorithm 2, which runs in time and selects columns, is a bi-criteria approximation algorithm.

1:  Input: ,
2:  Output: ,
3:  Apply Lemma E.1 to to obtain matrix
4:  Run linear regression over , s.t. is minimized
5:  Apply Algorithm 1 with input and to obtain and
6:  Set
7:  Set
8:  Output and
Algorithm 3 (chierichetti2017algorithms)An algorithm that transforms an -rank matrix factorization into a -rank matrix factorization without inflating the error too much.

Algorithm 3, which runs in time as long as is an approximation algorithm.

These results improve the previous and bounds respectively. We include the analysis of Algorithm 2 and Algorithm 3 in Appendix for completeness.

Acknowledgments

C.D. and P.R. acknowledge the support of Rakuten Inc., and NSF via IIS1909816. The authors would also like to acknowledge two MathOverflow users, known to us only by their usernames, ’fedja’ and ’Mahdi’, for informing us the Riesz-Thorin interpolation theorem.

References

Appendix A Riesz-Thorin Interpolation Theorem

[Riesz-Thorin interpolation theorem , see Lemma 8.5 in mashreghi2009representation]

Let , be measure spaces. Let represent the complex vector space of simple functions on . Suppose that

is a multi-linear operator of types and where , with constants and , respectively. i.e.,

for . Let and define

Then, is of type with constant , that is,

Lemma 2 is a direct corollary of this theorem.

Appendix B Lower Bounds

In this section, we will prove Theorem 1.1. The proof is constructive: we prove the theorem by showing for all , we can construct a matrix , such that selecting every columns of leads to an approximation ratio at least . Then, the theorem follows by letting . Our choice of is a perturbation of Hadamard matrices, defined below.

Throughout the proof of Theorem 1.1, we assume that , for some , and is an arbitrarily small constant.

Proof.

of Theorem 1.1: We consider the well known Hadamard matrix of order , defined below:

The Hadamard matrix has the following properties: (we will use to represent when it’s clear from context)

  • or .

  • All entries on the first row are ones, i.e.

  • The columns of are pairwisely orthogonal, i.e.

    holds when

Now we can define : it is a perturbation of by replacing all the entries on the first row by , i.e.,

(9)

We can see that is close to a rank- matrix. In fact, has rank at most . Also, is an , or equivalently, matrix, and it has all zeros on the first row. Therefore, we can upper bound by

The remaining work is to give a lower bound on the approximation error using any columns. For simplicity of notations, we use as shorthand for when it’s clear from context. Say we are using all columns except the -th, i.e. the column subset is . Obviously, we achieve zero error on all the columns other than the -th. Therefore, the approximation error is essentially the distance from to . Thus,

By Hölder’s inequality,

where .

We can actually show that .

Using the fact that and ,

Now we can finally bound the approximation error