A deterministic theory of low rank matrix completion

10/02/2019 ∙ by Sourav Chatterjee, et al. ∙ Stanford University 0

The problem of completing a large low rank matrix using a subset of revealed entries has received much attention in the last ten years. Yet, a characterization of missing patterns that allow completion has remained an open question. The main result of this paper gives such a characterization in the language of graph limit theory. It is then shown that a modification of the Candes-Recht nuclear norm minimization algorithm succeeds in completing the matrix whenever completion is possible. The theory is fully deterministic, with no assumption of randomness.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The problem of reconstructing a large low rank matrix from a subset of revealed entries has attracted widespread attention in the statistics and machine learning literatures in the last ten years. For a very recent survey of this vast body of work, see 

[13]. Notice that the problem itself is a problem in linear algebra, with nothing random in it. However, matrix completion in classical linear algebra is restricted to matrices with special structure, such as positive definite matrices [8]. In the literature on low rank matrix completion, randomness enters into the picture through the assumption that the set of missing entries is random. In most papers, the randomness is uniform over all subsets of a given size. This assumption, while unrealistic, allows researchers to prove many beautiful theorems. There are a handful of papers that strive to work with deterministic missing patterns or missing patterns that depend on the matrix, using spectral gap conditions [1], rigidity theory [15], algebraic geometry [10] and other methods [14, 11, 16]. However, a complete characterization of missing patterns that allow approximate completion of large low rank matrices has remained an open question. The aim of this paper is to give such a characterization.

Right away, it is important to note that not all patterns of revealed entries allow low rank matrix completion, even if a substantial fraction of entries are revealed. For example, if we have a large square matrix of order , and only the top rows are revealed, the matrix cannot be completed even if it is known to have rank .

This example suggests that the set of revealed entries has to be in some sense ‘dense’ in the set of all entries for the matrix to be recoverable. However, one has to be cautious about this intuition. Consider a second counterexample: Let be even, and consider an matrix whose entry is revealed if and only if and

have the same parity (that is, both even or both odd). This set of revealed entries looks sufficiently ‘dense’. Yet, recovery is not possible even if the rank of the matrix is as small as 

. To see this, note that the rows and columns can be relabeled such that the even numbered rows and columns in the original matrix are renumbered from to and the odd numbered rows and columns are renumbered from to . Then in this new arrangement of rows and columns, the entry is revealed if and only if either both and are between and , or both and are between and . In other words, the matrix is a block matrix with blocks of order , where only the top-left and bottom-right blocks are revealed. Clearly, the other two blocks cannot be recovered using this information if the rank is or higher.

The problem with the above counterexample is that the rows and columns could be relabeled so that the pattern of revealed entries is no longer ‘dense’. This suggests that for recoverability of low rank matrices, it is necessary that the pattern of revealed entries remains ‘dense’ under any relabeling of rows and columns. It turns out that this condition is also sufficient. This is the main theorem of this paper (Theorem 4.1). The precise statement is given in the language of graph limit theory [12]. It is then proved that a modification of a popular method of low rank matrix completion by nuclear norm minimization [3, 4, 2] succeeds in approximately recovering the full matrix whenever the above condition holds (Theorem 4.3). In other words, this algorithm does the job whenever the job is doable.

2. Notations

All our matrices will have real entries. We will denote the entry of a matrix by , of by , and so on. The transpose of a matrix will be denoted by , and the trace by , and the rank by

. Vectors will be treated as matrices with one column.

Let be an matrix. We will have occasions to use the following matrix norms. The Frobenius norm of is defined as

More frequently, we will use the following averaged version of the Frobenius norm:

For us, the average Frobenius norm will be more useful than the usual Frobenius norm because it is a measure of the size of a typical entry of .

If

are the non-zero singular values of

, the nuclear norm of is defined as

The norm of is simply

We will also use a somewhat non-standard matrix norm, called the cut norm, defined as

In the usual definition of the cut norm for matrices, the maximum is not divided by . We divide by because it will be more convenient for us to work with this version, and also because this is the custom in graph limit theory.

For each , let be the group of all permutations of . For and , let be the matrix whose entry is . The cut norm is used to define the cut distance between two matrices and as

(2.1)

We will say that a matrix is a binary matrix if each of its entries is either or . We will use binary matrices to denote the locations of revealed entries in matrix completion problems.

If and are two matrices, the Hadamard product of and , denoted by , is the matrix whose entry is . Hadamard products will be useful for us in the following way. If is a matrix which is partially revealed, and is a binary matrix indicating the locations of the revealed entries, then is the matrix whose entries equal the entries of wherever they are revealed, and zero elsewhere.

3. Definitions

As mentioned earlier, certain patterns of revealed entries may not suffice for recovering the full matrix, whereas other patterns may suffice. While this makes intuitive sense, we need to give a precise mathematical definition of the notion of recoverability before proceeding further with this. Roughly speaking, recoverability should mean that if two low rank matrices are approximately equal on the revealed entries, they should also be approximately equal everywhere. To make this fully precise, we need to state it in terms of sequences of matrices rather than a single matrix.

Definition 3.1.

Let be a sequence of binary matrices. We will say that this sequence admits stable recovery of low rank matrices if it has the following property. Take any two sequences of matrices and , where and have the same dimensions as . Suppose that there are numbers and such that and are bounded by and and are bounded by for each . Then for any there is some , depending only on , and , such that if , then .

The word ‘stable’ is added in the above definition to emphasize that we only need approximate equality of the revealed entries, rather than exact equality.

To understand the essence of the above definition, it is probably helpful to revisit the counterexample mentioned earlier. For each

, let be the binary matrix whose entries are in the first rows, and elsewhere. Let be the matrix of all zeros, and be the matrix whose entries are in the top rows and elsewhere. Then and are bounded by for all , and and are bounded by for all . Now clearly , but a simple calculation shows that

Thus, the sequence does not admit stable recovery of low rank matrices.

To verify that a sequence admits stable recovery of low rank matrices according to Definition 3.1, one needs to verify the stated condition for all sequences and . It would however be much more desirable to have an equivalent criterion in terms of some intrinsic property of the sequence . The main result of this paper gives such a criterion. To state this result, we need to introduce some further definitions.

In graph limit theory [12], a graphon is a Borel measurable function from into , which is symmetric in its arguments. Since we are dealing with matrices that need not be symmetric, we need to generalize this definition by dropping the symmetry condition.

Definition 3.2.

An asymmetric graphon is a Borel measurable function from into .

If is an asymmetric graphon and and are two positive integers, we define the discrete approximation of to be the matrix , whose entry is the average value of in the rectangle , that is,

If is an matrix and is an asymmetric graphon, we define the cut distance between and to be

where the right side is as defined in equation (2.1).

Definition 3.3.

We will say that a sequence of matrices converges to an asymmetric graphon if as .

Note that the limit defined in the above sense may not be unique. The same sequence may converge to many different limits. In graph limit theory, all of these different limits are considered to be equivalent by defining an equivalence relation on the space of graphons. It is possible to do a similar thing for asymmetric graphons, but that is not needed for this paper.

We will use asymmetric graphons to represent limits of binary matrices. Not every sequence has a limit, but subsequential limits always exist.

Theorem 3.4.

Any sequence of binary matrices with dimensions tending to infinity has a subsequence that converges to an asymmetric graphon.

The above theorem is the asymmetric analog of a fundamental compactness theorem in graph limit theory [12, Theorem 9.23]. It is probable that the asymmetric version already exists in the literature, but since the proof is not difficult, it is presented in Section 8.

4. Results

Our main objective is to give a necessary and sufficient condition for a sequence of binary matrices to admit stable recovery of low rank matrices. Because of Theorem 3.4, it suffices to only consider convergent sequences.

Theorem 4.1.

A sequence of binary matrices with dimensions tending to infinity and converging to an asymmetric graphon admits stable recovery of low rank matrices (in the sense of Definition 3.1) if and only if is nonzero almost everywhere.

To understand this result, first consider the familiar case of entries missing at random. Suppose that each entry is revealed with probability , independently of each other. Then the corresponding sequence of binary matrices converges to the graphon that is identically equal to on . If , Theorem 4.1 tells us that this sequence of revelation patterns admits stable recovery of low rank matrices. On the other hand, consider our running counterexample, where only the top half of the rows are revealed. The corresponding sequence of binary matrices converges to the graphon that is in and in . Therefore this sequence does not admit stable recovery of low rank matrices, as we observed before.

At this point the reader may be slightly puzzled by the fact that Theorem 4.1 implies that recovery is impossible if the set of revealed entries is sparse (because then the limit graphon is identically zero), whereas there are many existing results about recoverability low rank matrices from a sparse set of revealed entries. The reason is that we are not assuming randomness and at the same time demanding that the recovery is ‘stable’. Suppose that most entries are the same for two matrices, but the entries that differ are the only ones that are revealed. Then there is no way to tell that the matrices are mostly the same. Thus, stable recovery is impossible from a small set of revealed entries if there is no assumption of randomness.

Theorem 4.1 succeeds in giving an intrinsic characterization of recoverability in terms of the locations of revealed entries. However, it does not tell us how to actually recover a matrix from a set of revealed entries when recovery is possible. Fortunately, it turns out that this is doable by a small modification of an algorithm that is already used in practice, namely, the Candès–Recht algorithm for matrix completion by nuclear norm minimization [3, 4, 2]

. The Candès–Recht estimator of a partially revealed matrix

is the matrix with minimum nuclear norm among all matrices that agree with at the revealed entries. The modified estimator is the following.

Definition 4.2.

Let be a matrix whose entries are partially revealed. Suppose that for some known constant . We define the modified Candès–Recht estimator of as the matrix that minimizes nuclear norm among all that agree with at the revealed entries and satisfy .

The assumption of a known upper bound on the norm is not unrealistic. Usually such upper bounds are known, for example in recommender systems. The modified estimator is the solution of a convex optimization problem, just like the original estimator, and should therefore be computable on a computer if the dimensions are not too large. The following theorem shows that this algorithm is able to approximately recover the full matrix whenever the pattern of revealed entries allows stable recovery.

Theorem 4.3.

Let be a sequence of binary matrices with dimensions tending to infinity that admits stable recovery of low rank matrices. Let be a sequence of matrices such that for each , has the same dimensions as . Suppose that and are uniformly bounded over . Let be the modified Candès–Recht estimate of (as defined in Definition 4.2) when the locations of the revealed entries are defined by . Then .

The modified Candès–Recht estimator, just like the original estimator, will run into computational cost issues for very large matrices. It would be interesting to figure out if there is a faster algorithm (for example, by some kind of singular value thresholding [9, 5, 7]) that also has the above ‘universal recovery’ feature.

Another interesting and important problem is to develop an analog of the above theory when the set of revealed entries is sparse. As noted before, the problem is unsolvable in this setting if we demand that the recovery be stable. However, dropping the stability requirement may render it possible to recover the full matrix from a sparse set of revealed entries even in the absence of randomness. In particular, Theorem 4.3 may have an extension to the sparse setting under appropriate assumptions. The methods of this paper would need to be significantly extended to make this possible.

This concludes the statements of results. The rest of the paper is devoted to proofs. The proof of Theorem 4.1 is divided between Sections 5 and 6. Theorem 4.3 is proved in Section 7, and Theorem 3.4 is proved in Section 8.

5. Towards the proof of Theorem 4.1

The goal of this section is to prove a quantitative result that underlies the proof of Theorem 4.1. We need to prove a number of lemmas before arriving at this theorem.

Lemma 5.1.

Let be an matrix with

and singular value decomposition

Then for each ,

Proof.

Let denote the component of . Since and , we get

Dividing throughout by and maximizing over , we get the required bound for . The bound for is obtained similarly. For the bound on , notice that since there is some such that , and use this information in the above display. ∎

Recall that a matrix is called a block matrix if its entries are constant in rectangular blocks — in other words, if the matrix can be expressed as an array of constant matrices. We will say that two matrices and have a simultaneous block structure if they are both block matrices and the rows and columns defining the blocks are the same. Note that block structures may not be uniquely defined, but that will not be a problem for us.

Lemma 5.2.

Let and be matrices with and . Let be a number such that and are bounded by . Take any . Then, there exist matrices and with a simultaneous block structure with at most blocks, and permutations and , such that and . Moreover, it can be ensured that and .

Proof.

Fix some . Let , and be three other positive numbers, to be chosen later. Let

be the singular value decompositions of of , with and . Choose two numbers and such that and . If for all , let , and if for all , let . Similarly, if for all , let , and if for all , let . Let

Then by the definition of ,

(5.1)

Similarly, the same bound holds for .

For and , let denote the component of . Define to be the integer multiple of that is closest to under the constraint that . Then note that . Let be the vector whose component is . Let be defined similarly. Define and the same way, but using instead of . Let

Now take any . By Lemma 5.1 and the choice of , we have

Therefore for any , the set of possible values of has size at most

where the inequality was obtained under the assumption that

(5.2)

We will later choose and such that this assumption is valid. We can give similar bounds on the sizes of the sets of possible values of the components of , and .

Declare that two rows and are ‘equivalent’ if and for all and . Similarly declare that two columns and are equivalent if and for all and . Clearly, these define equivalence relations. By the previous paragraph, there are at most equivalence classes of rows, and at most equivalence classes of columns.

Let be a permutation of the rows that ‘clumps together’ equivalent rows, and let be a permutation of the columns that clumps together equivalent columns. Then it is clear that and are block matrices. By the previous paragraph, the number of blocks is at most .

Now note that

But . Thus,

Similarly, is also bounded by the same quantity. Thus, the number of blocks is at most

Now notice that by Lemma 5.1 and the definition of ,

By a similar argument, the same bound holds for . Combining with (5.1), we see that if and , then and have a simultaneous block structure with at most blocks, and and are bounded by

Now take any and define

Plugging these values into the previous display gives

For a given , the above quantity is minimized by taking , and the minimum value is . Choose to make this equal to , which ensures that and are bounded by . With these choices of and , an easy calculation gives

Also, it is easy to check (using and ) that with these choices of and , the inequality (5.2) holds. Thus, the proof is complete except that we have not ensured that and in our construction. To force this, just take any element of either matrix; if it is bigger than , replace it by ; if it is less than , replace it by . This retains the block structures of the matrices, and it cannot increase or for any since and . ∎

Lemma 5.3.

Let and be matrices with a simultaneous block structure. Let be the number of blocks. Let and be matrices such that is binary and the entries of are all in . Then

Proof.

Let each block be represented by the set of pairs of indices that belong to the block. Let be the set of all blocks. Take any block . By the definition the cut norm,

(5.3)

Recall that is the same for all in a block, and the same holds for . Let and denote the values of and in a block . Since for all ,

Therefore by (5.3),

Since is binary, for all . Thus, we get

The proof is not completed by applying the inequality to the right side. ∎

We are now ready to prove the main result of this section. The result roughly says the following. Let and be matrices with relatively small nuclear norms (of the same order as that for low rank matrices). Let be a binary matrix and be a matrix with entries in , such that is close to in the cut norm. Then, the closeness of to in average Frobenius norm implies the closeness of to in average Frobenius norm.

Theorem 5.4.

Let and be matrices with norms bounded by  . Let be a number such that the nuclear norms of and are bounded by . Let and be matrices such that is binary and the entries of are all in . Then

where depends only on .

Proof.

Without loss of generality, assume that . Take any . Let , , and be as in Lemma 5.2. Let

be the upper bound on the number of blocks given by Lemma 5.2. Note that

By Lemma 5.3,

But

Adding up, we get

where is a universal constant. This bound holds for , but it also holds for due to the presence of the term. The required bound is now obtained by choosing

for some sufficiently large constant depending only on . ∎

6. Proof of Theorem 4.1

Let be a sequence of binary matrices converging to a graphon . Suppose that is nonzero everywhere. Let and be the number of rows and the number columns in . Suppose that and tend to infinity as . We will first prove the following generalization of the ‘if’ part of Theorem 4.1.

Theorem 6.1.

Let and be as above. Take any two sequences of matrices and . Suppose that there are numbers and such that and are bounded by and and are bounded by for each . Then for any there is some , depending only on , and , such that if , then .

Proof.

For simplicity of notation, let us denote the matrix by . The convergence of to means that for each , there are permutations and such that

Rearranging the rows and columns of , and , we may assume without loss of generality that and are the identity permutations, so that

(6.1)

Let be a number such that

(6.2)

Without loss of generality, . Then by (6.1), (6.2) and Theorem 5.4,

(6.3)

Now take any . Define two functions as

and . Let be the matrix whose element is and let be the matrix whose element is . Since for all ,

Therefore by (6.3),

(6.4)

Since ,

where now denotes the function which equals for all in the rectangle . In other words, is obtained by averaging within each such rectangle. Since and tend to and is measurable, it follows by a standard result from analysis (see, for example, [12, Proposition 9.8]) that as for almost every . Since is a continuous function, this shows that