Statistical-Computational Tradeoffs in Planted Problems and Submatrix Localization with a Growing Number of Clusters and Submatrices

02/06/2014 ∙ by Yudong Chen, et al. ∙ University of Illinois at Urbana-Champaign berkeley college 0

We consider two closely related problems: planted clustering and submatrix localization. The planted clustering problem assumes that a random graph is generated based on some underlying clusters of the nodes; the task is to recover these clusters given the graph. The submatrix localization problem concerns locating hidden submatrices with elevated means inside a large real-valued random matrix. Of particular interest is the setting where the number of clusters/submatrices is allowed to grow unbounded with the problem size. These formulations cover several classical models such as planted clique, planted densest subgraph, planted partition, planted coloring, and stochastic block model, which are widely used for studying community detection and clustering/bi-clustering. For both problems, we show that the space of the model parameters (cluster/submatrix size, cluster density, and submatrix mean) can be partitioned into four disjoint regions corresponding to decreasing statistical and computational complexities: (1) the impossible regime, where all algorithms fail; (2) the hard regime, where the computationally expensive Maximum Likelihood Estimator (MLE) succeeds; (3) the easy regime, where the polynomial-time convexified MLE succeeds; (4) the simple regime, where a simple counting/thresholding procedure succeeds. Moreover, we show that each of these algorithms provably fails in the previous harder regimes. Our theorems establish the minimax recovery limit, which are tight up to constants and hold with a growing number of clusters/submatrices, and provide a stronger performance guarantee than previously known for polynomial-time algorithms. Our study demonstrates the tradeoffs between statistical and computational considerations, and suggests that the minimax recovery limit may not be achievable by polynomial-time algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we consider two closely related problems: planted clustering and submatrix localization, both concerning the recovery of hidden structures from a noisy random graph or matrix.

  • Planted Clustering: Suppose that out of a total of nodes, of them are partitioned into clusters of size , and the remaining

    nodes do not belong to any clusters; each pair of nodes is connected by an edge with probability

    if they are in the same cluster, and with probability otherwise. Given the adjacency matrix of the graph, the goal is to recover the underlying clusters (up to a permutation of cluster indices). By varying the values of the model parameters, this formulation covers several classical models including planted clique, planted coloring, planted densest subgraph, planted partition, and stochastic block model (cf. Definition 1 and discussion thereafter).

  • Submatrix Localization: Suppose

    is a random matrix with independent Gaussian entries with unit variance, where there are

    submatrices of size with disjoint row and column supports, such that the entries inside these submatrices have mean , and the entries outside have mean zero. The goal is to identify the locations of these hidden submatrices given . This formulation generalizes the submatrix detection and bi-clustering models with a single bi-submatrix/cluster that are studied in previous work (cf. Definition 2 and discussion thereafter).

We are particularly interested in the setting where the number of clusters or submatrices may grow unbounded with the problem dimensions , , and at an arbitrary rate. We may call this the high-rank setting because equals the rank of a matrix representation of the clusters and submatrices (cf. Definitions 1 and 2). The other parameters , , , and are also allowed to scale with or .

These two problems have been studied under various names such as community detection, graph clustering/bi-clustering, and reconstruction in stochastic block models, and have a broad range of applications. They are used as generative models for approximating real-world networks and data arrays with natural cluster/community structures, such as social networks [40], gene expressions [69], and online ratings [77]. They serve as benchmarks in the evaluation of algorithms for clustering [58], bi-clustering [15], community detection [65], and other network inference problems. They also provide a venue for studying the average-case behaviors of many graph theoretic problems including max-clique, max-cut, graph partitioning, and coloring [21, 33]. The importance of these two problems are well-recognized in many areas across computer science, statistics, and physics [67, 14, 64, 34, 62, 53, 12, 20, 10].

The planted clustering and submatrix localization problems exhibit an interplay between statistical and computational considerations. From a statistical point of view, we are interested in identifying the range of the model parameters for which the hidden structures—in this case the clusters and submatrices—can be recovered from the noisy data . The values of the parameters govern the statistical hardness of the problems: the problems become more difficult with smaller values of , , , and larger , because the observations are noisier and the sought-after structures are more complicated. A statistically powerful algorithm is one that can recover the hidden structures for a large region of the model parameter space.

From a computational point of view, we are concerned with the running time of different recovery algorithms. An exhaustive search over the solution space (i.e., all possible clusterings or locations of the submatrices) may make for a statistically powerful algorithm, but is computationally intractable. A simpler algorithm with lower running time is computationally more desirable, but may succeed only in a smaller region of the model parameter space and thus has weaker statistical power.

Therefore, it is important to take a joint statistical-computational view to the planted clustering and submatrix localization problems, and to understand the tradeoffs between these two considerations. How do algorithms with different computational complexity achieve different statistical performance? For these two problems, what is the information limit (under what conditions on the model parameters does recovery become infeasible for any algorithm), and what is the computational limit (when does it become infeasible for computationally tractable algorithms)?

The results on this paper sheds light on the above questions. For both problems, our results demonstrate, in a precise and quantitative way, the following phenomenon: The parameter space can be partitioned into four disjoint regions, such that each region corresponds to statistically easier instances of the problem than the previous one, and recovery can be achieved by simpler algorithms with lower running time. Significantly, there might exist a large gap between the statistical performance of computationally intractable algorithms and that of computationally efficient algorithms. We elaborate in the next two subsections.

1.1 Planted Clustering: The Four Regimes

For concreteness, we first consider the planted clustering problem in the setting , and . This covers the standard planted bisection/partition/-disjoint-clique models.

The statistical hardness of cluster recovery is captured by the quantity , which is essentially a measure of the Signal-to-Noise Ratio (SNR). Our main theorems identify the following four regimes of the problem defined by the value of this quantity. Here for simplicity, the results use the notation and , which ignore constant and factors; our main theorems do capture the factors.

  • The Impossible Regime: . In this regime, there is no algorithm, regardless of its computational complexity, that can recover the clusters with a vanishing probability of error.

  • The Hard Regime: . There exists a computationally expensive algorithm—specifically the Maximum Likelihood Estimator (MLE)—that recovers the clusters with high probability in this regime (as well as in the next two easier regimes; we omit such implications in the sequel). There is no known polynomial-time algorithm that succeeds in this regime.

  • The Easy Regime: . There exists a polynomial-time algorithm—specifically a convex relaxation of the MLE—that recovers the clusters with high probability in this regime. Moreover, this algorithm provably fails in the hard regime above.

  • The Simple Regime: . A simple algorithm based on counting node degrees and common neighbors recovers the clusters with high probability in this regime, and provably fails outside this regime (i.e., in the hard and easy regimes).

We illustrate these four regimes in Figure 1 assuming the scaling and for two constants . Here cluster recovery becomes harder with larger and smaller . In this setting, the four regimes correspond to four disjoint and non-empty regions of the parameter space. Therefore, a computationally more expensive algorithm leads to an order-wise (polynomial in ) enhancement in the statistical power. For example, when , the simple, polynomial-time, and computationally intractable algorithms succeeds for larger than , , and , respectively. There is a similar hierarchy for the allowable sparsity of the graph, given by , , and assuming .

impossible

hard

easy

simple
Figure 1: Illustration of the four regimes. The figure applies to the planted clustering problem with and , as well as to the submatrix localization problem with , and .

The results in the impossible and hard regimes together establish the minimax recovery boundary of the planted clustering problem, and show that the MLE is statistically order-optimal. These two regimes are separated by an “information barrier”: in the impossible regime the graph does not carry enough information to distinguish different cluster structures, so recovery is statistically impossible.

Our performance guarantees for the convexified MLE improve the best known results for polynomial time algorithms in terms of the scaling, particularly in the setting when the number of clusters are allow to grow with . We conjecture that no polynomial-time algorithm can perform significantly better and succeed in the hard regime, i.e., the convexified MLE achieves the computational limit order-wise. While we do not prove the conjecture, there are many supporting evidences; cf. Section 2.3

. For instance, there is a “spectral barrier”, determined by the spectrum of an appropriately defined noise matrix, that prevents the convexified MLE and spectral clustering algorithms from succeeding in the hard regime. In the special setting with a single cluster, the work by

[55, 43] proves that no polynomial-time algorithm can reliably recover the cluster if conditioned on the planted clique hardness hypothesis.

The simple counting algorithm fails outside the simple regime due to a “variance barrier” which is associated with the fluctuations of the node degrees and the numbers of common neighbors. The simple algorithm is statistically order-wise weaker than the convexified MLE in separating different clusters.

General results

Our main theorems apply beyond the special setting above and allow for general values of , , , and . The four regimes and the statistical-computational tradeoffs can be observed for a broad spectrum of planted problems, including planed partition, planted coloring, planted -disjoint-clique and planted densest-subgraph models. Table 1 summarizes the implications of our results for some of these models. More precise and general results are given in Section 2.

center Planted -Disjoint-Clique Planted Partition , Planted Coloring , Impossible Thm 2.1, Cor 2.2 MLE Thm 2.3, Cor 2.4 Convexified MLE Thm 2.5 Simple Counting Thm 2.9, Rem 2.10

Table 1: Our results specialized to different planted models. Here the notation and ignore constant factors. This table shows the necessary conditions for any algorithm to succeed under a mild assumption , as well as the sufficient conditions under which the algorithms in this paper succeed, thus corresponding to the four regimes described in Section 1.1. The relevant theorems/corollaries are also listed. The conditions for convexified MLE and simple counting can further be shown to be also necessary in a broad range of settings; cf. Theorems 2.7 and 2.11. The results in this table are not the strongest possible; see the referenced theorems for more precise statements.

1.2 Submatrix Localization: The Four Regimes

Similar results hold for the submatrix localization problem. Consider the setting with and . The statistical hardness of submatrix localization is captured by the quantity , which is again a measure of the SNR. In the high SNR setting with , the submatrices can be trivially identified by element-wise thresholding. In the more interesting low SNR setting with , our main theorems identify the following four regimes, which have the same meanings as before:

  • The Impossible Regime: . All algorithm fail in this regime.

  • The Hard Regime: . The computationally expensive MLE succeeds, and it is conjectured that no polynomial-time algorithm succeeds here.

  • The Easy Regime: . The polynomial-time convexified MLE succeeds, and provably fails in the hard regime.

  • The Simple Regime: . A simple thresholding algorithm succeeds, and provably fails outside this regime.

We illustrate these four regimes in Figure 1 assuming and . In fact, the results above hold in the more general setting where the entries of are sub-Gaussian.

1.3 Discussions

This paper presents a systematic study of planted clustering and submatrix localization with a growing number of clusters/submatrices. We provide sharp characterizations of the minimax recovery boundary with the lower and upper bounds matching up to constants. We also give improved performance guarantees for convex optimization approaches and the simple counting/thresholding algorithms. In addition, complementary results are given on the failure conditions for these algorithms, hence characterizing their performance limits. Our analysis addresses several challenges that arise in the high-rank setting. The results in this paper highlight the similarity between planted clustering and submatrix localization, and place under a unified framework several classical problems such as planted clique, partition, coloring, and densest graph.

The central theme of our investigation is the interaction between the statistical and the computational aspects in the problems, i.e., how to handle more noise and more complicated structures using more computation. Our study parallels a recent line of work that takes a joint statistical and computational view on inference problems [15, 66, 18, 24, 55]; several of these works are closely related to special cases of the planted clustering and bi-clustering models. In this sense, we investigate two specific but fundamental problems, and we expect that the phenomena and principles described in this paper are relevant more generally. Below we provide additional discussions, and comment on the relations with existing work.

High rank vs. rank one.

Several recent works investigate the problems of single-submatrix detection/localization [50, 13], planted densest subgraph detection [14]

and sparse principal component analysis (PCA) 

[11] (cf. Section 1.4 for a literature review). Even earlier is the extensive study of the statistical/computational hardness of Planted Clique. The majority of these works focus on the rank-one setting with a single clique, cluster, submatrix or principal component. This paper considers the more general high-rank setting where the number of clusters/submatrices may grow quickly with the problem size. This setting is important in many empirical networks [54, 67], and poses significant challenges to the analysis. Moreover, there are qualitative differences between these two settings. We discuss one such difference in the next paragraph.

The power of convex relaxations.

In the previous work on the rank-one case of the submatrix detection/localization problem [55, 15] and the sparse PCA problem [51], it is shown that simple algorithms based on averaging/thresholding have order-wise similar statistical performance as more sophisticated convex optimization approaches. In contrast, for the problems of finding multiple clusters/submatrices, we show that convex relaxation approaches are statistically much more powerful than the simple counting/thresholding algorithm. Our analysis reveals that the power of convex relaxations lies in separating different clusters/submatrices, but not in identifying a single cluster/submatrix. Our results thus provide one explanation for the (somewhat curious) observation in previous work regarding the lack of benefit of using sophisticated methods, and demonstrate a finer spectrum of computational-statistical tradeoffs.

Detection vs. estimation.

Several recent works on planted densest subgraph and submatrix detection have focused on the detection or hypothesis testing version of the problems, i.e., detecting the existence of a dense cluster or an elevated submatrix (cf. Section 1.4 for literature review). In this paper, we study the (support) estimation version of the problems, where the goal is to find the precise locations of the clusters/submatrices. In general estimation appears to be harder than detection. For example, if we consider the scalings of and in Figure 1 of this paper, and compare with Figure 1 in [55] which studies submatrix detection, we see that the minimax localization boundary is , whereas the minimax detection boundary is at a higher value . For the planted densest subgraph problem, we see a similar gap between the minimax detection and estimation boundaries if we compare our results with results in [14, 43]. In addition, it is shown in [55, 43] that if , the planted submatrix or densest subgraph can be detected in linear time; if , no polynomial-time test exists assuming the hardness of the planted clique detection problem. For estimation, we prove the sufficient condition , which is the best known performance guarantee for polynomial-time algorithms—again we see a gap between detection and estimation. For detecting a sparse principal component, see the seminar work [18] for proving computational lower bounds conditioned on the hardness of Planted Clique.

Extensions.

It is a simple exercise to extend our results to a variant of the planted clustering model where the graph adjacency matrix has sub-Gaussian entries instead of Bernoulli, corresponding to a weighted graph clustering problem. Similarly, we can also extend the submatrix location problem to the setting with Bernoulli entries, which is the bi-clustering problem on an unweighted graph and covers the planted bi-clique problem [39, 9] as a special case.

1.4 Related Work

There is a large body of literature, from the physics, computer science and statistics communities, on models and algorithms for graph clustering and bi-clustering, as well as on their various extensions and applications. A complete survey is beyond the scope of this paper. Here we focus on theoretical work on planted clustering/submatrix localization concerning exact recovery of the clusters/submatrices. Detailed comparisons of existing results with ours are provided after we present each of our theorems in Sections 2 and 3. We emphasize that our results are non-asymptotic and applicable to finite values of and , whereas some of the results below require .

Planted Clique, Planted Densest Subgraph

The planted clique model (, , ) is the most widely studied planted model. If the clique has size , recovery is impossible as the random graph will have a clique with at least the same size; if , an exhaustive search succeeds [6]; if , various polynomial-time algorithms work [6, 35, 36]; if , the nodes in the clique can be easily identified by counting degrees [52]. It is an open problem to find polynomial-time algorithms which succeed in the regime with , and it is believed that this cannot be done [45, 48, 4, 39]. The four regimes above can be considered as a special case of our results for the general planted clustering model. The planted densest subgraph model generalizes the planted clique model by allowing general values of and . The detection version of this problem is studied in [14, 73], and conditional computational hardness results are obtained in [43].

Planted -Disjoint-Cliques, Partition, and Coloring

Subsequent work considers the setting with planted cliques [60], as well as the planted partition model (a.k.a. stochastic block model) with general values of  [33, 46]. A subset of these results allow for growing values . Most existing work focuses on the recovery performance of specific polynomial-time algorithms. The state-of-the-art recovery results for planted -disjoint-clique are given in [60, 29, 8], and for planted partition in [29, 12, 23]; see [30] for a survey of these results. The setting with is sometimes called the heterophily case, with the planted coloring model () as an important special case [5, 32]. Our performance guarantees for the convexified MLE (cf. Table 1) improve upon the previously known results for polynomial-time algorithms. Also, particularly when the number of clusters is allowed to scale arbitrarily with , matching upper and lower bounds for the information-theoretic limits were previously unknown. This paper identifies the minimax recovery thresholds for general values of and , and shows that they are achieved by the MLE. Our results also suggest that polynomial-time algorithms may not be able to achieve these thresholds in the growing setting with the cluster size sublinear in .

Converse Results for Planted Problems

Complementary to the achievability results, another line of work focuses on converse results, i.e., identifying necessary conditions for recovery, either for any algorithm, or for any algorithm in a specific class. For the planted partition model with , necessary conditions for any algorithm to succeed are obtained in [26, 29, 16, 1] using information-theoretic tools. For spectral clustering algorithms and convex optimization approaches, more stringent conditions are shown to be needed [64, 74]. We generalize and improve upon the existing work above.

Sharp Exact Recovery Thresholds with a Constant Number of Clusters

Since the conference version of this paper is published [31], a number of papers have appeared on the information-theoretic limits of exact recovery under the stochastic block model. Under the special setting with and , the recovery threshold with sharp constants is identified in [1] for , and in [63] for general scalings of . Very recently, [2] proved the sharp recovery threshold for the more general case where , and the in-cluster and cross-cluster edge probabilities are heterogeneous and scale as . Notably, when the number of clusters is bounded, sharp recovery thresholds may be achieved by polynomial-time algorithms, in particular, by the semi-definite programming relaxation of the maximum likelihood estimator [42, 44]. Our results are optimal up to absolute constant factors, but are non-asymptotic and apply to a growing number of clusters/submatrices of size sublinear in .

Approximate Recovery

While not the focus of this paper, approximate cluster recovery (under various criteria) has also been studied, e.g., for planted partition with clusters in [61, 62, 56, 78, 34]. These results are not directly comparable to ours, but often the approximate recovery conditions differ from the exact recovery conditions by a factor. When constant factors are concerned, the existence of a hard regime is also conjectured in [34, 61].

Submatix Localization

The statistical and computational tradeoffs in locating a single submatrix (i.e., ) are studied in [15, 50], where the information limit is shown to be achieved by a computationally intractable algorithm order-wise. The success and failure conditions for various polynomial-time procedures are also derived. The work [7] focuses on success conditions for a convex relaxation approach; we improve the results particularly in the high-rank setting. The single-submatrix detection problem is studied in [22, 69, 70, 13, 19], and the recent work by [55] establishes the conditional hardness for this problem.

1.5 Paper Organization and Notation

The remainder of this paper is organized as follows. In Section 2 we set up the planted clustering model and present our main theorems for the impossible, hard, easy, and simple regimes. In Section 3 we turn to the submatrix localization problem and provide the corresponding theorems for the four regimes. Section 4 provides a brief summary with a discussion of future work. We prove the main theorems for planted clustering and submatrix localization in Sections 5 and 6, respectively.

Notation

Let and , and for any positive integer . We use etc. to denote absolute numerical constants whose values can be made explicit and are independent of the model parameters. We use the standard big-O notations: for two sequences , we write or to mean for an absolute constant and all . Similarly, means , and means .

2 Main Results for Planted Clustering

The planted clustering problem is defined by five parameters and such that .

Definition 1 (Planted Clustering).

Suppose nodes (which are identified with ) are divided into two subsets and with and . The nodes in are partitioned into disjoint clusters (called true clusters), where for each and . Nodes in do not belong to any of the clusters and are called isolated nodes. A random graph is generated based on the cluster structure: for each pair of nodes and independently of all others, we connect them by an edge with probability (called in-cluster edge density) if they are in the same cluster, and otherwise with probability (called cross-cluster edge density).

We emphasize again that the values of , , , and are allowed to be functions of . The goal is to exactly recover the true clusters up to a permutation of cluster indices given the random graph.

The model parameters are assumed to be known to the algorithms. This assumption is often not necessary and can be relaxed [29, 14]. It is also possible to allow for non-uniform cluster sizes [3], and heterogeneous edge probabilities [23] and node degrees [26, 29]. These extensions are certainly important in practical applications; we do not delve into them, and point to the referenced papers above and the references therein for work in this direction.

To facilitate subsequent discussion, we introduce a matrix representation of the planted clustering problem. We represent the true clusters by a cluster matrix , where for , for , and if and only if nodes and are in the same true cluster. Note that the rank of equals , hence the name of the high-rank setting. The adjacency matrix of the graph is denoted as , with the convention . Under the planted clustering model, we have if and if for all . The problem reduces to recovering given .

The planted clustering model generalizes several classical planted models.

  • Planted -Disjoint-Clique [60]. Here and , so cliques of size are planted into an Erdős-Rényi random graph . The special case with is known as the planted clique problem [6].

  • Planted Densest Subgraph [14]. Here and , so there is a subgraph of size and density planted into a graph.

  • Planted Partition [33]. Also known as the stochastic blockmodel [46]. Here and . The special case with can be called planted bisection [33]. The case with is sometimes called planted noisy coloring or planted -cut [34, 21].

  • Planted -Coloring [5]. Here and , so each cluster corresponds to a group of disconnected nodes that are assigned with the same color.

Reduction to the case.

For clarity we shall focus on the homophily setting with ; results for the case are similar. In fact, any achievability or converse result for the case immediately implies a corresponding result for . To see this, observe that if the graph is generated from the planted clustering model with , then the flipped graph ( is the all-one matrix and

is the identity matrix) can be considered as generated with in/cross-cluster edge densities

and , where . Therefore, a problem with can be reduced to one with . Clearly the reduction can also be done in the other direction.

2.1 The Impossible Regime: Minimax Lower Bounds

In this section, we characterize the necessary conditions for cluster recovery. Let be the set of cluster matrices corresponding to clusters of size ; i.e.,

We use to denote an estimator which takes as input the graph and outputs an element of as an estimate of the true

. Our results are stated in terms of the Kullback-Leibler (KL) divergence between two Bernoulli distributions with means

and , denoted by . The following theorem gives a lower bound on the minimax error probability of recovering .

Theorem 2.1 (Impossible).

Suppose . Under the planted clustering model with , if one of the following two conditions holds:

(1)
(2)

then

where the infimum ranges over all measurable function of the graph.

The theorem shows it is fundamentally impossible to recover the clusters with success probability close to 1 in the regime where (1) or (2) holds, which is thus called the impossible regime. This regime arises from an information/statistical barrier: The KL divergence on the LHSs of (1) and (2) determines how much information of is contained in the data . If the in-cluster and cross-cluster edge distributions are close (measured by the KL divergence) or the cluster size is small, then does not carry enough information to distinguish different cluster matrices.

It is sometimes more convenient to use the following corollary, derived by upper-bounding the KL divergence in (1) and (2) using its Taylor expansion. This corollary was used when we overviewed our results in Section 1.1. See table 1 for its implications for specific planted models.

Corollary 2.2.

Suppose . Under the planted clustering model with , if any one of the following three conditions holds:

(3)
(4)
(5)

then

Note the asymmetry between the roles of and in the conditions (1) and (2); this is made apparent in Corollary 2.2. To see why the asymmetry is natural, recall that by a classical result of [41], the largest clique in a random graph has size almost surely. Such a clique cannot be distinguished from a true cluster if , even when . This is predicted by the condition (5). When , cluster recovery requires to ensure all true clusters are connected within themselves, matching the condition (4). The term on the RHS of (1) and (4) is relevant only when . Potential improvement on this term is left to future work.

Comparison to previous work

When and , our results recover the threshold for the classical planted clique problem. For planted partition with clusters of size and , the work in [26, 28] establishes the necessary condition ; our result is stronger by a logarithmic factor. The work in [1] also considers planted partition with and focus on the special case with the scaling ; they establish the condition , which is consistent with our results up to constants in this regime. Compared to previous work, we handle the more general setting where and may scale arbitrarily with .

2.2 The Hard Regime: Optimal Algorithm

In this subsection, we characterize the sufficient conditions for cluster recovery which match the necessary conditions given in Theorem 2.1 up to constant factors. We consider the Maximum Likelihood Estimator of under the planted clustering model, which we now derive. The log-likelihood of observing the graph  given a cluster matrix is

(6)

Given , the MLE maximizes the the log-likelihood over the set of all possible cluster matrices. Note that for all , so the last three terms in (6) are independent of . Therefore, the MLE for the case is given as in Algorithm 1.

(7)
s.t. (8)
Algorithm 1 Maximum Likelihood Estimator ()

Algorithm 1 is equivalent to finding disjoint clusters of size that maximize the number of edges inside the clusters (similar to Densest -Subgraph), or minimize the number of edges outside the clusters (similar to Balanced Cut) or the disagreements between and (similar to Correlation Clustering in [17]). Therefore, while Algorithm 1 is derived from the planted clustering model, it is in fact quite general and not tied to the modeling assumptions. Enumerating over the set is computationally intractable in general since .

The following theorem provides a success condition for the MLE.

Theorem 2.3 (Hard).

Under the planted clustering model with , there exists a universal constant such that for any , the optimal solution to the problem (7)–(8) is unique and equal to with probability at least if both of the following hold:

(9)

We refer to the regime in which the condition (9) holds but (14) below fails as the hard regime, as clustering is statistically possible but conjectured to be computationally hard (cf. Conjecture 2.8). The conditions (9) above and (1)–(2) in Theorem 2.1 match up to a constant factor under the mild assumption . This establishes the minimax recovery boundary for planted clustering and the minimax optimality of the MLE up to constant factors.

By lower bounding the KL divergence, we obtain the following corollary, which is sometimes more convenient to use. See Table 1 for its implications for specific planted models.

Corollary 2.4.

For planted clustering with , there exists a universal constant such that for any , the optimal solution to the problem (7)–(8) is unique and equal to with probability at least provided

(10)

The condition (10) can be simplified to if , and to if . These match the converse conditions in Corollary 2.2 up to constants.

Comparison to previous work

Theorem 2.3 provides the first minimax results tight up to constant factors when the number of clusters is allowed to grow, potentially at a nearly-linear rate . Interestingly, for a fixed cluster size, the recovery boundary (9) depends only weakly on the number of clusters  though the logarithmic term. For and , we recover the recovery boundary for planted clique . For the planted densest subgraph model where , bounded away from and , the minimax detection boundary is shown in [14] to be ; our results show that the minimax recovery boundary is , which is strictly above the detection boundary because can be much smaller than For the planted bisection model with two equal-sized clusters: if , the sharp recovery boundary is found in [1] and [63] to be , which is consistent with our results up to constants; if , the correlated recovery limit is shown in [61, 56, 62] to be , which is consistent with our results up to a logarithmic factor.

2.3 The Easy Regime: Polynomial-Time Algorithms

In this subsection, we present a polynomial-time algorithm for the planted clustering problem and show that it succeeds in the easy regime described in the introduction.

Our algorithm is based on taking a convex relaxation of the MLE in Algorithm 1. Note that the objective function (7) in the MLE is linear, but the constraint involves a set that is discrete, non-convex and exponentially large. We replace this non-convex constraint with a trace norm (a.k.a. nuclear norm) constraint and a set of linear constraints. This leads to the convexified MLE given in Algorithm 2. Here the trace norm

is defined as the sum of the singular values of

. Note that the true is feasible to the optimization problem (11)–(13) since

(11)
s.t. (12)
(13)
Algorithm 2 Convexified Maximum Likelihood Estimator ()

The optimization problem in Algorithm 2 is a semidefinite program (SDP) and can be solved in polynomial time by standard interior point methods or various fast specialized algorithms such as ADMM; e.g., see [47, 7]. Similarly to Algorithm 1, this algorithm is not strictly tied to the planted clustering model as it can also be considered as a relaxation of Correlation Clustering or Balanced Cut. In the case where the values of and are unknown, one may replace the hard constraints (12) and (13) with an appropriately weighted objective function; cf. [29].

The following theorem provides a sufficient condition for the success of the convexified MLE. See Table 1 for its implications for specific planted models.

Theorem 2.5 (Easy).

Under the planted clustering model with , there exists a universal constant such that with probability at least , the optimal solution to the problem (11)–(13) is unique and equal to provided

(14)

When , we refer to the regime where the condition (14) holds and (17) below fails as the easy regime. When , the easy regime is where (14) holds and (17) or (18) below fails.

If , it is easy to see that the smallest possible cluster size allowed by (14) is and the largest number of clusters is , both of which are achieved when . This generalizes the tractability threshold of the classic planted clique problem. If (we call it the high SNR setting), the condition (14) becomes to . In this case, it is possible to go beyond the limit on the cluster size. In particular, when , the smallest possible cluster size is , which can be much smaller than .

Remark 2.6.

Theorem 2.5 immediately implies guarantees for other tighter convex relaxations. Define the sets and

The constraint in Algorithm 2 corresponds to , while is the constraint in the standard SDP relaxation. Clearly Therefore, if we replace the constraint (12) with , we obtain a tighter relaxation of the MLE, and Theorem 2.5 guarantees that it also succeeds to recover under the condition (14). The same is true if we consider other tighter relaxations, such as those involving the triangle inequalities [58], the row-wise constraints  [7], the max norm [47] or the Fantope constraint [76]. For the purpose of this work, these variants of the convex formulation make no significant difference, and we choose to focus on (11)–(13) for generality.

Converse for the trace norm relaxation approach

We have a partial converse to the achievability result in Theorem 2.5. The following theorem characterizes the conditions under which the trace norm relaxation (11)–(13) provably fails with high probability; we suspect the standard SDP relaxation with the constraint also fails with high probability under the same conditions, but we do not have a proof.

Theorem 2.7 (Easy, Converse).

Under the planted clustering model with , for any constant , there exist positive universal constants for which the following holds. Suppose , and . If

then with probability at least , is not an optimal solution of the program (11)–(13).

Theorem 2.7 proves the failure of our trace norm relaxation that has access to the exact number and sizes of the clusters. Consequently, replacing the constraints (12) and (13) with a Lagrangian penalty term in the objective would not help for any value of the Lagrangian multipliers. Under the assumptions of Theorems 2.5 and 2.7, by ignoring log factors, the sufficient and necessary condition for the success of our convexified MLE is

(15)

whereas the success condition (10) for the MLE simplifies to

We see that the convexified MLE is statistically sub-optimal due to the extra second term in (15). This term is responsible for the threshold on the cluster size for the tractability of planted clique. The term has an interesting interpretation. Let be the centered adjacency matrix. The matrix ,222Here denotes the element-wise product. i.e., the deviation restricted to the inter-cluster node pairs, can be viewed as the “cross-cluster noise matrix”. Note that the squared largest singular value of the matrix is , whereas the squared largest singular value of concentrates around (see e.g., [25]). Therefore, the second term in (15) is the “spectral noise-to-signal ratio” that determines the performance of the convexified MLE. In fact, our proofs for Theorems 2.5 and 2.7 build on this intuition.

Comparison to previous work

We refer to [29] for a survey of the performance of state-of-the-art polynomial-time algorithms under various planted models. Theorem 2.5 matches and in many cases improves upon existing results in terms of the scaling. For example, for planted partition, the previous best results are in [29] and in [12]. Theorem 2.5 removes some extra factors, and is also order-wise better when (the high SNR case) or . For planted -disjoint-clique, existing results require to be  [60],  [8] or  [29]. We improve them to .

Our converse result in Theorem 2.7 is inspired by, and improves upon, the recent work in [74], which focuses on the special case , and considers a convex relaxation approach that is equivalent to our relaxation (11)–(13) but without the additional equality constraint in (13). The approach is shown to fail when . Our result is stronger in the sense that it applies to a tighter relaxation and a larger region of the parameter space.

Limits of polynomial-time algorithms

By comparing the recovery limit established in Theorems 2.1 and 2.3 with the performance limit of our convex method established in Theorem 2.5, we get two strikingly different observations. On one hand, if and , the recovery limit and performance limit of our convex method coincide up to constant factors at . Thus, the convex relaxation is tight and the hard regime disappears up to constants, even though the hard regime may still exist when constant factors are concerned [61, 34]. In this case, we get a computationally efficient and statistically order-optimal estimator. On the other hand, if , there exists a substantial gap between the information limit and performance limit of our convex method. We conjecture that no polynomial-time algorithm has order-wise better statistical performance than the convexified MLE and succeeds significantly beyond the condition (14).

Conjecture 2.8.

For any constant , there is no algorithm with running time polynomial in that, for all and with probability at least , outputs the true of the planted clustering problem with and

(16)

If the conjecture is true, then in the asymptotic regime and , the computational limit for the cluster recovery is given by , i.e., the boundary between the green regime and red regime in Fig. 1.

A rigorous proof of Conjecture 2.8 seems difficult with current techniques. There are other possible convex formulations for planted clustering. The space of possible polynomial-time algorithms is even larger. It is impossible for us to study each of them separately and obtain a converse result as in Theorem 2.7. There are however several evidences that support the conjecture:

  • The special case with corresponds to the regime for the classical Planted Clique problem, which is conjectured to be computationally hard [4, 68, 39], and was used as an assumption for proving other hardness results [45, 48, 49]. Conjecture 2.8 can be considered as a generalization of the Planted Clique conjecture to the setting with multiple clusters and general values of and , and may be used to study the computational hardness of other problems [27].

  • It is shown in [43] that for the special setting with a single cluster, no polynomial-time algorithm can reliably recover the planted cluster if conditioned on the planted clique hardness hypothesis. Here the planted clique hardness hypothesis refers to the statement that for any fixed constants and , there exist no randomized polynomial-time tests to distinguish an Erdős-Rényi random graph and a planted clique model which is obtained by adding edges to vertices chosen uniformly from to form a clique.

  • As discussed earlier, if (16) holds, then the graph spectrum is dominated by noise and fails to reveal the underlying cluster structure. The condition (16) therefore represents a “spectral barrier” for clustering. The work in [64] uses a similar spectral barrier argument to prove the failure of a large class of algorithms that rely on the graph spectrum; our Theorem 2.7 shows that the convexified MLE fails for a similar reason.

  • In the sparse graph case with , it is argued in [34], using non-rigorous but deep arguments from statistical physics, that it is intractable to achieve the correlated recovery under Condition (16).

2.4 The Simple Regime: A Counting Algorithm

In this subsection, we consider a simple recovery procedure in Algorithm 3, which is based on counting node degrees and common neighbors.

  1. (Identify isolated nodes) For each node , compute its degree . Declare as isolated if

  2. (Identify clusters when ) For every pair of non-isolated nodes , compute the number of common neighbors , and assign them into the same cluster if . Declare error if inconsistency found.

Algorithm 3 A Simple Counting Algorithm

We note that steps 1 and 2 of Algorithm 3 are considered in [52] and [38] respectively for the special cases of recovering a single planted clique or two planted clusters. Let be the set of edges. It is not hard to see that step 1 runs in time and step 2 runs in time , since each node only needs to look up its local neighborhood up to distance two. It is possible to achieve even smaller expected running time using clever data structures.

The following theorem provides sufficient conditions for the simple counting algorithm to succeed. Compared to the previous work in [52, 38], our results apply to general values of , , and . See Table 1 for its implications for specific planted models.

Theorem 2.9 (Simple).

For planted clustering with , there exist universal constants such that Algorithm 3 correctly finds the isolated nodes with probability at least if

(17)

and finds the clusters with probability at least if further

(18)
Remark 2.10.

If as , we can obtain slightly better performance by counting the common non-neighbors in Step 2, which succeeds under condition (18) with and replaced by and , respectively, i.e., the RHS of (18) simplifies to .

In the case with a single clusters , we refer to the regime where the condition (17) holds as the simple regime; in the case with , the simple regime is where both conditions (17) and (18) hold. It is instructive to compare these conditions with the success condition (14) for the convexified MLE. The condition (17) has an additional factor on the RHS. This means when and the only task is to find the isolated nodes, the counting algorithm performs nearly as well as the sophisticated convexified MLE. On the other hand, when and one needs to distinguish between different clusters, the convexified MLE order-wise outperforms the counting algorithm whenever , as the condition (18) is order-wise more restrictive than (14). Nevertheless, when , both algorithms can recover clusters of size , making the simple counting algorithm a legitimate candidate in such a setting and a benchmark to which other algorithms can be compared with.

In the high SNR case with , the counting algorithm can recover clusters with size much smaller than ; e.g., if and , it only requires .

Converse for the counting algorithm

We have a (nearly-)matching converse to Theorem 2.9. The following theorem characterizes when the counting algorithm provably fails.

Theorem 2.11 (Simple, Converse).

Under the planted clustering model with , for any constant , there exist universal constants for which the following holds. Suppose , , and . Algorithm 3 fails to correctly identify all the isolated nodes with probability at least if

(19)

and fails to correctly recover all the clusters with probability at least if

(20)
Remark 2.12.

Theorem 2.11 requires a technical condition , which is actually not too restrictive. If , then two nodes from the same cluster will have no common neighbor with probability , so Algorithm 3 cannot succeed with the probability specified in Theorem 2.9.

Apart from some technical conditions, Theorems 2.9 and 2.11 show that the conditions (17) and (18) are both sufficient and necessary. In particular, the counting algorithm cannot succeed outside the simple regime, and is indeed strictly weaker in separating different clusters as compared to the convexified MLE. Our proof reveals that the performance of the counting algorithm is limited by a variance barrier: The RHS of (17) and (18) are associated with the variance of the node degrees and common neighbors (i.e., and in Algorithm 3

), respectively. There exist nodes whose degrees deviate from their expected value on the order of the standard deviation, and if the condition (

17) does not hold, then the deviation will outweigh the difference between the expected degrees of the isolated nodes and those of the non-isolated nodes. A similar argument applies to the number of common neighbors.

3 Main Results for Submatrix Localization

In this section, we turn to the submatrix localization problem, sometimes known as bi-clustering [15]. We consider the following specific setting, which is defined by six parameters , and such that and . We use the shorthand notation .

Definition 2 (Submatrix Localization).

A random matrix is generated as follows. Suppose that rows of are partitioned into disjoint subsets of equal size , and columns of are partitioned into disjoint subsets of equal size . For each , we have if for some and