Data matrices exhibiting both row and column cluster structure, arise in many applications, such as collaborative filtering, gene expression analysis, and text mining. For example, in recommender systems, a rating matrix can be formed with rows corresponding to users and columns corresponding to items, and similar users and items form clusters. In DNA microarrays, a gene expression matrix can be formed with rows corresponding to patients and columns corresponding to genes, and similar patients and genes form clusters. Such row and column cluster structure is of great scientific interest and practical importance. For instance, the user and movie cluster structure is crucial for predicting user preferences and making accurate item recommendations . The patient and gene cluster structure reveals functional relations among genes and helps disease detection [2, 3]. In practice, we usually only observe a very small fraction of entries in these data matrices, possibly contaminated with noise, which obscures the intrinsic cluster structure. For example, in Netflix movie dataset, about of movie ratings are missing and the observed ratings are noisy .
In this paper, we study the problem of inferring hidden row and column cluster structure in binary data matrices from a few noisy observations. We consider a simple model introduced in [5, 6] for generating binary data matrix from underlying row and column clusters. In the context of movie recommender systems, our model assumes that users and movies each form equal-sized clusters. Users in the same cluster give the same rating to movies in the same cluster, where ratings are either or with being “like” and
being “dislike”. Each rating is flipped independently with a fixed flipping probability less than, modeling the noisy user behavior and the fact that users (movies) in the same cluster do not necessarily give (receive) identical ratings. Each rating is further erased independently with a erasure probability, modeling the fact that some ratings are not observed. Then, from the observed noisy ratings, we aim to exactly recover the underlying user and movie clusters, i.e., jointly cluster the rows and columns of the observed rating matrix.
The binary assumption on data matrices is of practical interest. Firstly, in many real datasets like Netflix dataset and DNA microarrays, estimation of entry values appears to be very unreliable, but the task of determining whether an entry isor can be done more reliably . Secondly, in recommender systems like rating music on Pandora or rating posts on sites such as Facebook and MathOverflow, the user ratings are indeed binary . The equal-sized assumption on cluster size is just for ease of presentation and can be relaxed to allow for different cluster sizes.
The hardness of our cluster recovery problem is governed by the erasure probability and cluster size. Intuitively, recovery becomes harder when the erasure probability increases, meaning fewer observations, and the cluster size decreases, meaning that clusters are harder to detect. The first goal of this paper is to understand when exact cluster recovery is possible or fundamentally impossible. Furthermore, our cluster recovery problem poses a computational challenge: An algorithm exhaustively searching over all the possible row and column cluster structures would have a time complexity exponentially increasing with the matrix dimension. The second goal of this paper is to understand how the computational complexity of our cluster recovery problem changes when increasingly more observations are available.
In this paper, our contributions are as follows. We first derive a lower bound on the minimum number of observations needed for exact cluster recovery as a function of matrix dimension and cluster size. Then we propose three algorithms with different runtimes and compare the number of observations needed by them for successful cluster recovery.
The first algorithm directly searches for the optimal clustering of rows and columns separately; it is combinatorial in nature and takes exponential-time but achieves the best statistical performance among the three algorithms in the noiseless setting.
By noticing that the underlying true rating matrix is a specific type of low-rank matrix, the second algorithm recovers the clusters by solving a nuclear norm regularized convex optimization problem, which is a popular heuristic for low rank matrix completion problems; it takes polynomial-time but has less powerful statistical performance than the first algorithm.
The third algorithm applies spectral clustering to the rows and columns separately and then performs a joint clean-up step; it has lower computational complexity than the previous two algorithms, but less powerful statistical performance. We believe that this is the first such performance guarantee for exact cluster recovery, with a growing number of clusters, using spectral clustering.
These algorithms are then compared with a simple nearest-neighbor clustering algorithm proposed in . Our analytical results show smooth time-data trade-offs: when increasingly more observations are available, one can gradually reduce the computational complexity by applying simpler algorithms while still achieving the desired performance. Such time-data trade-offs is of great practical interest for statistical learning problems involving large datasets .
The rest of the paper is organized as follows. In Section II, we discuss related work. In Section III, we formally introduce our model and main results. The lower bound is presented in Section IV. The combinatorial method, convex method, spectral method are studied in Section V, Section VI and Section VII, respectively. The proofs are given in Section VIII. The simulation results are presented in Section IX. Section X concludes the paper with remarks.
Ii Related work
In this section, we point out some connections of our model and results to prior work. There is a vast literature on clustering and we only focus on theoretical works with rigorous performance analysis. More detailed comparisons are provided after we present the theorems.
Ii-a Graph clustering
Much of the prior work on graph clustering, as surveyed in , focuses on graphs with a single node type, where nodes in the same cluster are more likely to have edges among them. A low-rank plus sparse matrix decomposition approach is proved to exactly recover the clusters with the best known performance guarantee in . The same approach is used to recover the clusters from a partially observed graph in . A spectral method for exact cluster recovery is proposed and analyzed in  with the number of clusters fixed. More recently,  proved an upper bound on the number of nodes “mis-clustered” by a spectral clustering algorithm in the high dimensional setting with a growing number of clusters.
In contrast to the above works, in our model, we have a labeled bipartite graph with two types of nodes (rows and columns). Notice that there are no edges among nodes of the same type and cluster structure is defined for the two types separately. In this sense, our cluster recovery problem can be viewed as a natural generalization of graph clustering problem to labeled bipartite graphs. In fact, our second algorithm via convex programming is inspired by the work [10, 11].
A model similar to ours but with a fixed number of clusters has been considered in , where the spectral method plus majority voting is shown to approximately predict the rating matrix. However, our third algorithm via spectral method is shown to achieve exact cluster and rating matrix recovery with a growing number of clusters. This is the first theoretical result on spectral method for exact cluster recovery in with a growing number of clusters to our knowledge.
Biclustering [15, 16, 2, 17] tries to find sub-matrices (which may overlap) with particular patterns in a data matrix. Many of the proposed algorithms are based on heuristic searches without provable performance guarantees. Our cluster recovery problem can be viewed as a special case where the data matrix consists of non-overlapping sub-matrices with constant binary entries, and our paper provides a thorough study of this special biclustering problem. Recently, there is a line of work studying another special case of biclustering problem, which tries to detect a single small submatrix with elevated mean in a large fully observed noisy matrix . Interesting statistical and computational trade-offs are summarized in .
Ii-C Low-rank matrix completion
Under our model, the underlying true data matrix is a specific type of low-rank matrix. If we recover the true data matrix, we immediately get the user (or movie) clusters by assigning the identical rows (or columns) of the matrix to the same cluster. In the noiseless setting with no flipping, the nuclear norm minimization approach [20, 21, 22] can be directly applied to recover the true data matrix and further recover the row and column clusters. Alternate minimization is another popular and empirically successful approach for low-matrix completion . However, it is harder to analyze and the performance guarantee is weaker than nuclear norm minimization . In the low noise setting with the flipping probability restricting to be a small constant, the low-rank plus sparse matrix decomposition approach [24, 25, 26] can be applied to exactly recover data matrix and further recover the row and column clusters.
The performance guarantee for our second algorithm via convex programming is better than these previous approaches and it allows the flipping probability to be any constant less than . Moreover, our proof turns out to be much simpler. The recovery of our true data matrix can also be viewed as a specific type of one-bit matrix completion problem recently studied in . However,  only focuses on approximate matrix recovery and the results there cannot be used to recover row and column clusters.
|parameter regime||of observations||runtime||remark|
|combinatorial method||exponential||assuming noiseless, i.e.,|
|convex method||polynomial||assuming Conjecture 1 holds|
Iii Model and Main Results
In this section, we formally state our model and main results.
Our model is described in the context of movie recommender systems, but it is applicable to other systems with binary data matrices having row and column cluster structure. Consider a movie recommender system with users and movies. Let be the rating matrix of size where is the rating user gives to movie . Assume both users and movies form clusters of size . Users in the same cluster give the same rating to movies in the same cluster. The set of ratings corresponding to a user cluster and a movie cluster is called a block. Let be the block rating matrix of size where is the block rating user cluster gives to movie cluster . Then the rating if user is in user cluster and movie is in movie cluster . Further assume that entries of
are independent random variables which areor with equal probability. Thus, we can imagine the rating matrix as a block-constant matrix with all the entries in each block being either or . Observe that if is a fixed constant, then users from two different clusters have the same ratings for all movies with some positive probability, in which case it is impossible to differentiate between these two clusters. To avoid such situations, assume is at least .
Suppose each entry of goes through an independent binary symmetric channel with flipping probability , representing noisy user behavior, and an independent erasure channel with erasure probability , modeling the fact that some entries are not observed. The expected number of observed ratings is . We assume that is a constant throughout the paper and could converge to as . Let denote the output of the binary symmetric channel and denote the set of non-erased entries. Let if and otherwise. The goal is to exactly recover the row and column clusters from the observation .
Iii-B Main Results
The main results are summarized in Table I. Note that these results do not explicitly depend on . In fact, as is assumed to be a constant strictly less than , it affects the results by constant factors.
The parameter regime where exact cluster recovery is fundamentally impossible for any algorithm is proved in Section IV. The combinatorial method, convex method and spectral method are studied in Section V, Section VI and Section VII, respectively. We only analyze the combinatorial method in the noiseless case where , but we believe similar result is true for the noisy case as well. The parameter regime in which the convex method succeeds is obtained by assuming that a technical conjecture holds, which is justified through extensive simulation. The parameter regime in which the spectral method succeeds is obtained for the first time for exact cluster recovery with a growing number of clusters. The nearest-neighbor clustering algorithm was proposed in . It clusters the users by finding the most similar neighbors for each user. The similarity between user and is measured by the number of movies with the same observed rating, i.e.,
where is an indicator function. Movies are clustered similarly. It is shown in  that the nearest-neighbor clustering algorithm exactly recovers user and movie clusters when for a constant .
The number of observations needed for successful cluster recovery can be derived from the corresponding parameter regime using the identity as shown in Table I. For better illustration, we visualize our results in Figure 1. In particular, we take as -axis and as -axis and normalize both axes by . Since exact cluster recovery becomes easy when the number of observations and cluster size increase, we expect that exact cluster recovery is easy near and hard near .
From Figure 1
, we can observe interesting trade-offs between algorithmic runtime and statistical performance. In terms of the runtime, the combinatorial method is exponential, while the other three algorithms are polynomial. In particular, the convex method can be casted as a semidefinite programming and solved in polynomial-time. For the spectral method, the most computationally expensive step is the singular value decomposition of the observed data matrix which can always be done in timeand more efficiently when the observed data matrix is sparse. It is not hard to see that the time complexity for the nearest-neighbor clustering algorithm is and more careful analysis reveals that its time complexity is . On the other hand, in terms of statistical performance, the combinatorial method needs strictly fewer observations than the other three algorithms when there is no noise, and the convex method always needs fewer observations than the spectral method. It is somewhat surprising to see that the simple nearest-neighbor clustering algorithm needs fewer observations than the more sophisticated convex method when the cluster size is .
In summary, we see that when more observations available, one can apply algorithms with less runtime while still achieving exact cluster recovery. For example, consider the noiseless case with cluster size , the number of observations per user required for cluster recovery by the combinatorial method, convex method, spectral method and nearest-neighbor clustering algorithm are , , and , respectively. Therefore, when the number of observations per user increases from to , one can gradually reduces the computational complexity from exponential-time to polynomial-time as low as .
The main results in this paper can be easily extended to the more general case with rows and columns and row clusters and column clusters. The sizes of different clusters could vary as long as they are of the same order. Likewise, the flipping probability and the erasure probability could also vary for different entries of the data matrix as long as they are of the same order. Due to space constraints, such generalizations are omitted in this paper.
A variety of norms on matrices will be used. The spectral norm of a matrix is denoted by , which is equal to the largest singular value. Let denote the inner product between two matrices. The nuclear norm is denoted by which is equal to the sum of singular values and is a convex function of . Let denote the norm and denote the norm. Let denotes the singular value decomposition such that . The best rank approximation of is defined as . For vectors, let denote the inner product between two vectors and the only norm that will be used is the usual norm, denoted as .
Throughout the paper, we say that an event occurs “a.a.s.” or “asymptotically almost surely” when it occurs with a probability which tends to one as goes to infinity.
Iv Lower Bound
In this section, we derive a lower bound for any algorithm to reliably recover the user and movie clusters. The lower bound is constructed by considering a genie-aided scenario where the set of flipped entries is revealed as side information, which is equivalent to saying that we are in the noiseless setting with . Hence, the true rating matrix agrees with on all non-erased entries. We construct another rating matrix with the same movie cluster structure as but different user cluster structure by swapping two users in two different user clusters. We show that if , then agrees with on all non-erased entries with positive probability, which implies that no algorithm can reliably distinguish between and and thus recover user clusters.
Fix . If , then with probability at least , it is impossible for any algorithms to recover the user clusters or movie clusters.
Intuitively, Theorem 1 says that when the erasure probability is high and the cluster size is small that , the observed rating matrix does not carry enough information to distinguish between different possible cluster structures.
V Combinatorial Method
In this section, we study a combinatorial method which clusters users or movies by searching for a partition with the least total number of “disagreements”. We describe the method in Algorithm 1 for clustering users only. Movies are clustered similarly. The number of disagreements between a pair of users is defined as the number of movies satisfying that: The two ratings given by users are both observed and the observed two ratings are different. In particular, if for every movie, the two ratings given by users are not observed simultaneously, then .
The idea of Algorithm 1 is to reduce the problem of clustering both users and movies to a standard user clustering problem without movie cluster structure. In fact, this algorithm looks for the optimal partition of the users which has the minimum total in-cluster distance, where the distance between two users is measured by the number of disagreements between them. The following theorem shows that such simple reduction does not achieve the lower bound given in Theorem 1. The optimal algorithm for our cluster recovery problem might need to explicitly make use of both user and movie cluster structures.
If , then with probability at least , Algorithm 1 cannot recover user and movie clusters.
Next we show that the above necessary condition for the combinatorial method is also sufficient up to a logarithmic factor when there is no noise, i.e., . We suspect that the theorem holds for the noisy setting as well, but we have not yet been able to prove this.
If and for some constant , then a.a.s. Algorithm 1 exactly recovers user and movie clusters.
This theorem is proved by considering a conceptually simpler greedy algorithm that does not need to know . After computing the number of disagreements for every pair of users, we search for a largest set of users which have no disagreement between each other, and assign them to a new cluster. We then remove these users and repeat the searching process until there is no user left. In the noiseless setting, the users from the same true cluster have no disagreement between each other. Therefore, it is sufficient to show that, for any set of users consisting of users from more than one cluster, they have more than one disagreement with high probability under our assumption.
Vi Convex Method
In this section, we show that the rating matrix can be exactly recovered by a convex program, which is a relaxation of the maximum likelihood (ML) estimation. When is known, we immediately get the user (or movie) clusters by assigning the identical rows (or columns) of to the same cluster.
Let denote the set of binary block-constant rating matrix with blocks of equal size. As the flipping probability , Maximum Likelihood (ML) estimation of is equivalent to finding a which best matches the observation :
Since , solving (1) via exhaustive search takes exponential-time. Observe that implies that is of rank at most . Therefore, a natural relaxation of the constraint that is to replace it with a rank constraint on , which gives the following problem:
Further by relaxing the integer constraint and replacing the rank constraint with the nuclear norm regularization, which is a standard technique for low-rank matrix completion, we get the desired convex program:
The clustering algorithm based on the above convex program is given in Algorithm 2
The convex program (2) can be casted as a semidefinite program and solved in polynomial-time. Thus, Algorithm 2 takes polynomial-time. Our performance guarantee for Algorithm 2 is stated in terms of the incoherence parameter defined below. Since the rating matrix has rank , the singular vector decomposition (SVD) is , where are matrices with orthonormal columns and is a diagonal matrix with nonnegative entries. Define such that .
Denote the SVD of the block rating matrix by . The next lemma shows that
and thus is upper bounded by .
The following theorem provides a sufficient condition under which Algorithm 2 exactly recovers the rating matrix and thus the row and column clusters as well.
If for some constant , and
where is a constant and is the incoherence parameter for , then a.a.s. the rating matrix is the unique maximizer to the convex program (2) with .
Note that Algorithm 2 is easy to implement as only depends on the erasure probability , which can be reliably estimated from . Moreover, the particular choice of in the theorem is just to simplify notations. It is straightforward to generalize our proof to show that the above theorem holds with for any constant .
Using Lemma 1, we immediately conclude from the above theorem that the convex program succeeds when for some constant . However, based on extensive simulation in Fig 2, we conjecture that the following result is true.
Assuming this conjecture holds, Theorem 4 implies that
for some constant is sufficient to recover the rating matrix, which is better than the previous condition by a factor of . We do not have a proof for the conjecture at this time.
Comparison to previous work In the noiseless setting with , the nuclear norm minimization approach [20, 21, 22] can be directly applied to recover data matrix and further recover the row and column clusters. It is shown in  that the nuclear norm minimization approach exactly recovers the matrix with high probability if . The performance guarantee for Algorithm 2 given in (4) is better by at least a factor of . Theorem 3 shows that the combinatorial method exactly recovers the row and column clusters if , which is substantially better than the two previous conditions by at least a factor of . This suggests that a large performance gap might exist between exponential-time algorithms and polynomial-time algorithms. Similar performance gap due to computational complexity constraint has also been observed in other inference problems like Sparse PCA [27, 28, 29] and sparse submatrix detection [18, 19, 30].
In the low noise setting with restricting to be a small constant, the low-rank plus sparse matrix decomposition approach [24, 25, 26] can be applied to exactly recover data matrix and further recover the row and column clusters. It is shown in  that a weighted nuclear norm and norm minimization succeeds with high probability if and for two constants and . The performance guarantee for Algorithm 2 given in (4) is better by several factors and we allow the fraction of noisy entries to be any constant less than . Moreover, our proof turns out to be much simpler.
Vii Spectral Method
In this section, we study a polynomial-time clustering algorithm based on the spectral projection of the observed rating matrix . The description is given in Algorithm 3.
Step 1 of the algorithm produces two subsets, and of such that: for each rating is observed in with probability independently of other elements; and is independent of The purpose of Step 1 is to remove dependency between Step 2 and Steps 3, 4 in our proof. In particular, to establish our theoretical results, we identify the initial clustering of users and movies using , and then majority voting and reclustering are done using . In practice, one can simply use the same set of observations, i.e., .
The following theorem shows that the spectral method exactly recovers the user and movie clusters under a condition stronger than (4). In particular, we show that Step 3 exactly recovers the block rating matrix and Step 4 cleans up clustering errors made in Step 2.
for a positive constant , then Algorithm 3 with a.a.s. exactly recovers user and movie clusters, and the rating matrix .
Algorithm 3 is also easy to implement as only depends on parameters and . As mentioned before, the erasure probability can be reliably estimated from using empirical statistics. The number of clusters can be reliably estimated by searching for the largest eigen-gap in the spectrum of (See Algorithm 2 and Theorem 3 in  for justification). We further note that the threshold used in the theorem can be replaced by for any constant .
Comparison to previous work Variants of spectral method are widely used for clustering nodes in a graph. Step 2 of Algorithm 3 for approximate clustering has been previously proposed and it is analyzed in . In , an adaptation of Step 1 is shown to exactly recover a fixed number of clusters under the planted partition model. More recently,  proves an upper bound on the number of nodes “mis-clustered” by spectral method under the stochastic block model with a growing number of clusters.
Compared to previous work, the main novelty of Algorithm 3 is Steps 1, 3, and 4 which allow for exact cluster recovery even with a growing number of clusters. To our knowledge, Theorem 5 provides the first theoretical result on spectral method for exact cluster recovery with a growing number of clusters.
Viii-a Proof of Theorem 1
Without loss of generality, suppose that users are in cluster and users are in cluster . We construct a block-constant matrix with the same movie cluster structure as but a different user cluster structure. In particular, under , user forms a new cluster with users and user forms a new cluster with users .
Let -th row of be identical to the -th row of for all . Consider all movies in movie cluster . If the ratings of user to movies in movie cluster are all erased, then let and for ; otherwise let and for . If the ratings of user to movies in movie cluster are all erased, then let and for ; otherwise let and for . From the above procedure, it follows that the first row of is identical to the -th row of for all , and the second row of is identical the -th row of for all .
We show that agrees with on all non-erased entries. We say that movie cluster is conflicting between user and user cluster if (1) user cluster and have different block rating on movie cluster ; and (2) the ratings of user to movies in movie cluster are not all erased; and (3) the block corresponding to user cluster and movie cluster is not totally erased. Therefore, the probability that movie cluster is conflicting between user and user cluster equals to . By the union bound,
where the third inequality follows because for and and the last inequality follows from the assumption. Similarly, the probability that there exists a conflicting movie cluster between user and cluster is also upper bounded by . Hence, with probability at least , there is no conflicting movie cluster between user and cluster as well as between user and cluster , and thus agrees with on all non-erased entries.
Viii-B Proof of Theorem 2
Consider a genie-aided scenario where the set of flipped entries is revealed as side information, which is equivalent to saying that we are in the noiseless setting with . Then the true partition corresponding to the true user cluster structure has zero disagreement. Suppose users are in true cluster and users are in true cluster . We construct a new partition different from the true partition by swapping user and user . In particular, under the new partition, user forms a new cluster with users , user forms a new cluster with users . It suffices to show that for , any two users in has zero disagreement with probability at least , in which case the new partition has zero agreement and Algorithm 1 cannot distinguish between the true partition and the new one.
For , we lower bound the probability that any two users in has zero disagreement.
By union bound, the probability that for , any two users in has zero disagreement is at least .
Viii-C Proof of Theorem 3
Consider a compatibility graph with vertices representing users. Two vertices are connected if users have zero disagreement, i.e., . In the noiseless setting, each user cluster forms a clique of size in the compatibility graph. We call a clique of size in the compatibility graph a bad clique if it is formed by users from more than one cluster. Then to prove the theorem, it suffices to show that there is no bad clique a.a.s. Since the probability that bad cliques exist increases in , without loss of generality, we assume .
Recall that is or with equal probability. Define for . As , by Chernoff bound, we get that a.a.s., for any
Assume this condition holds throughout the proof.
Fix a set of users that consists of users from different clusters. Without loss of generality, assume these users are from cluster . Let denote the number of users from the cluster and define . By definition, , and . For any movie in cluster , among the ratings given by these users, there are ratings being and ratings being . Let denote the event that the observed ratings for movie by these users are the same. Then,
Let be the probability that users, out of which are from cluster , form a bad clique. Because are independent and there are movies in each movie cluster,
for some constant . For a large enough constant in the assumption regarding in the statement of the theorem and using the fact that , we have
Viii-D Proof of Theorem 4
We first introduce some notations. Let be the normalized characteristic vector of user cluster , i.e., if user is in cluster and otherwise. Thus, . Let . Similarly, let be the normalized characteristic vector of movie cluster and . It is not hard to see that the rating matrix can be written as . Denote the SVD of the block rating matrix by , then the SVD of is simply , where and . When , has full rank almost surely . We will assume is full rank in the following proofs, which implies that and . Note that and .
We now briefly recall the subgradient of the nuclear norm . Define to be the subspace spanned by all matrices of the form or for any . The orthogonal projection of any matrix onto the space is given by . The projection of onto the complement space is . Then is a subgradient of at if and only if and .
of Lemma 1.
Assume user is from user cluster and movie is in movie cluster , then
where the inequality follows from the Cauchy-Schwartz inequality. By definition . ∎
Next we establish the concentration property of . By definition the conditional expectation of is given by
Furthermore, the variance is given by
The following corollary applies Theorem 1.4 in  to bound the spectral norm .
If for a constant , then conditioned on ,
We adopt the trick called dilations . In particular, define as
Observe that , so it is sufficient to prove the theorem for . Conditioned on , is a random symmetric matrix with each entry bounded by , and are independent random variables with mean and variance at most . By Theorem 1.4 in  , if , then conditioned on a.a.s.
where the last inequality holds when is sufficiently large. ∎
of Theorem 4.
For any feasible that , we have to show that Rewrite as
The first term in (13) can be written as
where the second equality follows from the fact that and . Define the normalized noise matrix . Note that and . The second term in (13) becomes . By Corollary 1, almost surely. Thus is a subgradient of at . Hence, for the third term in (13), . Therefore,
where the last inequality follows from definition of the incoherence parameter . Below we bound the term . From the definition of and the fact that and ,
We first bound . To bound the term , assume user belongs to user cluster and let be the set of users in user cluster . Recall that is the normalized characteristic vector of user cluster . Then
which is the average of independent random variables. By Bernstein’s inequality (stated in the supplementary material), with probability at least ,