Minimax Rates in Network Analysis: Graphon Estimation, Community Detection and Hypothesis Testing

11/14/2018 ∙ by Chao Gao, et al. ∙ 0

This paper surveys some recent developments in fundamental limits and optimal algorithms for network analysis. We focus on minimax optimal rates in three fundamental problems of network analysis: graphon estimation, community detection, and hypothesis testing. For each problem, we review state-of-the-art results in the literature followed by general principles behind the optimal procedures that lead to minimax estimation and testing. This allows us to connect problems in network analysis to other statistical inference problems from a general perspective.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Network analysis [52] has gained considerable research interests in both theory [12] and applications [51, 101]. In this survey, we review recent developments that establish the fundamental limits and lead to optimal algorithms in some of the most important statistical inference tasks. Consider a stochastic network represented by an adjacency matrix . In this paper, we restrict ourselves to the setting where the network is an undirected graph without self loops. To be specific, we assume that for all . The symmetric matrix models the connectivity pattern of a social network and fully characterizes the data generating process. The statistical problems we are interested is to learn structural information of the network coded in the matrix . We focus on the following three problems:

  1. Graphon estimation. The celebrated Aldous-Hoover theorem [5, 61] asserts that the exchangeability of implies the representation that with some nonparametric function . Here,

    ’s are i.i.d. random variables uniformly distributed in the unit interval

    . The function is coined as the graphon of the network. The problem of graphon estimation is to estimate with the observed adjacency matrix.

  2. Community detection.

    Many social networks such as collaboration networks and political networks exhibit clustering structure. This means that the connectivity pattern is determined by the clustering labels of the network nodes. In general, for an assortative network, one expects that two network nodes are more likely to be connected if they are from the same cluster. For a disassortative network, the opposite pattern is expected. The task of community detection is to learn the clustering structure, and is also referred to as the problem of graph partition or network cluster analysis.

  3. Hypothesis testing.

    Perhaps the most fundamental question for network analysis is whether a network has some structure. For example, an Erdős–Rényi graph has a constant connectivity probability for all edges, and is regarded to have no interesting structure. In comparison, a stochastic block model has a clustering structure that governs the connectivity pattern. Therefore, before conducting any specific network analysis, one should first test whether a network has some structure or not. The test between an Erdős–Rényi graph and a stochastic block model is one of the simplest examples.

This survey will emphasize the developments of the minimax rates of the problems. The state-of-the-art of the three problems listed above will be reviewed in Section 2, Section 3, and Section 4, respectively. In each section, we will introduce critical mathematical techniques that we use to derive optimal solutions. When appropriate, we will also discuss the general principles behind the problems. This allows us to connect the results of the network analysis to some other interesting statistical inference problems.

Real social networks are often sparse, which means that the number of edges are of a smaller order compared with the number of nodes squared. How to model sparse networks is a longstanding topic full of debate [73, 12, 33, 21]. In this paper, we adopt the notion of network sparsity , which is proposed by [12]. Theoretical foundations of this sparsity notion were investigated by [13, 17]. There are other, perhaps more natural, notions of network sparsity, and we will discuss potential open problems in Section 5.

We close this section by introducing some notation that will be used in the paper. For an integer , we use to denote the set . Given two numbers , we use and . For two positive sequences , means for some constant independent of , and means and . We write if . For a set , we use to denote its indicator function and

to denote its cardinality. For a vector

, its norms are defined by , and . For two matrices , their trace inner product is defined as . The Frobenius norm and the operator norm of are defined by and , where

denotes the largest singular value.

2 Graphon estimation

2.1 Problem settings

Graphon is a nonparametric object that determines the data generating process of a random network. The concept is from the literature of exchangeable arrays [5, 61, 66] and graph limits [74, 37]. We consider a random graph with adjacency matrix , whose sampling procedure is determined by

(1)

For , . Conditioning on , the ’s are mutually independent across all . The function on , which is assumed to be symmetric, is called graphon. The graphon offers a flexible nonparametric way of modeling stochastic networks. We note that exchangeability leads to independent random variables sampled from , but for the purpose of estimating , we do not require this assumption.

We point out an interesting connection between graphon estimation and nonparametric regression. In the formulation of (1), suppose we observe both the adjacency matrix and the latent variables , then can simply be regarded as a regression function that maps to the mean of . However, in the setting of network analysis, we only observe the adjacency matrix . The latent variables are usually used to model latent features of the network nodes [59, 75], and are not always available in practice. Therefore, graphon estimation is essentially a nonparametric regression problem without observing the covariates, which leads to a new phenomenon in the minimax rate that we will present below.

In the literature, various estimators have been proposed. For example, a singular value threshold method is analyzed by [23], later improved by [103]. The paper [73] considers a Bayesian nonparametric approach. Another popular procedure is to estimate the graphon via histogram or stochastic block model approximation [102, 22, 4, 89, 14, 15]. Minimax rates of graphon estimation are investigated by [43, 46, 68].

2.2 Optimal rates

Before discussing the minimax rate of estimating a nonparametric graphon, we first consider graphons that are block-wise constant functions. This is equivalently recognized as stochastic block models (SBMs) [60, 88]. Consider for all . The class of SBMs with clusters is defined as

In other words, the network nodes are divided into clusters that are determined by the cluster labels . The subsets with form a partition of . The mean matrix is a piecewise constant with respect to the blocks .

In this setting, graphon estimation is the same as estimating the mean matrix . If we know the clustering labels , then we can simply calculate the sample averages of in each block . Without the knowledge of , a least-squares estimator proposed by [43] is

(3)

which can be understood as the sample averages of over the estimated blocks .

To study the performance of the least-squares estimator , we need to introduce some additional notation. Since , the estimator can be written as for some and some . The true matrix that generates is denoted by . Then, we define

Here, the class consists of all SBMs with clustering structures determined by . Then, we immediately have the Pythagorean identity

(4)

By the definition of , we have the basic inequality . After a simple rearrangement, we have

where the last inequality is by Cauchy-Schwarz and (4). Therefore, we have

where are fixed matrices with Frobenius norm . To understand the last inequality above, observe that belongs to and has Frobenius norm , and the matrix takes at most different values. Finally, an empirical process argument and a union bound leads to the inequalities

which then implies the bound

(5)

The upper bound (5) consists of two terms. The first term corresponds to the number of parameters we need to estimate in an SBM with clusters. The second term results from not knowing the exact clustering structure. Since there are in total possible clustering configurations, the complexity enters the error bound. Even though the bound (5) is achieved by an estimator that knows the value of , a penalized version of the least-squares estimator with the penalty can achieve the same bound (5) without the knowledge of .

The paper [43] also shows that the upper bound (5) is sharp by proving a matching minimax lower bound. While it is easy to see that the first term cannot be avoided by a classical lower bound argument of parametric estimation, the necessity of the second term requires a very delicate lower bound construction. It was proved by [43] that it is possible to construct a , such that the set has a packing number bounded below by with respect to the norm and the radius at the order of . This fact, together with a standard Fano inequality argument, leads to the desired minimax lower bound.

We summarize the above discussion into the following theorem.

Theorem 2.1 (Gao, Lu and Zhou [43]).

For the loss function

, we have

for all .

Having understood minimax rates of estimating mean matrices of SBMs, we are ready to discuss minimax rates of estimating general nonparametric graphons. We consider the following loss function that is widely used in the literature of nonparametric regression,

Note that if we let and . Then, the minimax risk is defined as

Here, the supreme is over both the function class and the distribution that the latent variables are sampled from. While is allowed to range from the class of all distributions, the Hölder class is defined as

where is the smoothness parameter and is the size of the class. Both are assumed to be constants. In the above definition, is the Hölder norm of the function (see [43] for the details).

The following theorem gives the minimax rate of the problem.

Theorem 2.2 (Gao, Lu and Zhou [43]).

We have

where the expectation is jointly over and .

The minimax rate in Theorem 2.2 exhibits different behaviors in the two regimes depending on whether or not. For , we obtain the classical minimax rate for nonparametric regression. To see this, one can related the graphon estimation problem to a two-dimensional nonparametric regression problem with sample size , and then it is easy to see that for . This means for a nonparametric graphon that is not so smooth, whether or not the latent variables are observed does not affect the minimax rate. In contrast, when , the minimax rate scales as , which does not depend on the value of anymore. In this regime, there is a significant difference between the graphon estimation problem and the regression problem.

Both the upper and lower bounds in Theorem 2.2 can be derived by an SBM approximation. The minimax rate given by Theorem 2.2 can be equivalently written as

where is the optimal rate of estimating a -cluster SBM in Theorem 2.1, and is the approximation error for an -smooth graphon by a -cluster SBM. As a consequence, the least-squares estimator (3) is rate-optimal with . The result justifies the strategies of estimating a nonparametric graphon by network histograms in the literature [102, 22, 4, 89].

Despite its rate-optimality, an disadvantage of the least-squares estimator (3) is its computational intractability. A naive algorithm requires an exhaustive search over all possible clustering structures. Although a two-way -means algorithm in [46] works well in practice, there is no theoretical guarantee that the algorithm can find the global optimum in polynomial time. An alternative strategy is to relax the constraint in the least-squares optimization. For instance, let be the set of all symmetric matrices that have at most ranks. It is easy to see . Moreover, the relaxed estimator

can be computed efficiently through a simple eigenvalue decomposition. This is closely related to the procedures discussed in

[23]. However, such an estimator can only achieve the rate , which can be much slower than the minimax rate . To the best of our knowledge, is the best known rate that can be achieved by a polynomial-time algorithm so far. We refer the readers to [103] for more details on this topic.

2.3 Extensions to sparse networks

In many practical situations, sparse networks are more useful. A network is sparse if the maximum probability of tends to zero as tends to infinity. A sparse graphon is a symmetric nonnegative function on that satisfies [12, 13, 17]. Analogously, a sparse SBM is characterized by the space . An extension of Theorem 2.1 is given by the following result.

Theorem 2.3 (Gao, Ma, Lu and Zhou [46]).

We have

for all .

Theorem 2.3 recovers the minimax rate of Theorem 2.1 if we set . The same result is also obtained by [69] is an independent paper. To achieve the minimax rate, one can consider the least-squares estimator

(6)

when . In the situation when , the minimax rate is and can be trivially achieved by .

Theorem 2.3 also leads to optimal rates of nonparametric sparse graphon estimation in a Hölder space [46, 69]. In addition, sparse graphon estimation in a privacy-aware setting [14] and a heavy-tailed setting [15] have also been considered in the literature.

2.4 Biclustering and related problems

SBM can be understood as a special case of biclustering. A matrix has a biclustering structure if it is block-wise constant with respect to both row and column clustering structures. The biclustering model was first proposed by [57], and has been widely used in modern gene expression data analysis [27, 78]. Mathematically, we consider the following parameter space

Then, for the loss function , it has been shown in [43, 46] that

(7)

as long as . The minimax rate (7) holds under both Bernoulli and Gaussian observations. When and , the result (7) recovers Theorem 2.1.

The minimax rate (7

) reveals a very important principle of sample complexity. In fact, for a large collection of popular problems in high-dimensional statistics, the minimax rate is often in the form of

(8)

For the biclustering problem, is the sample size and is the number of parameters. Since the number of biclustering structures is , the formula (8) gives (7).

To understand the general principle (8), we need to discuss the structured linear model introduced by [45]. In the framework of structured linear models, the data can be written as

where is the signal to be recovered and is a mean-zero noise. The signal part consists of a linear operator indexed by the model/structure and parameters that are organized as . The structure is in some discrete space , which is further indexed by for some finite set . We introduce a function that determines the dimension of . In other words, we have . Then, the optimal rate that recovers the signal with respect to the loss function is given by

(9)

We note that (9) is a mathematically rigorous version of (8). In [45], a Bayesian nonparametric procedure was proposed to achieve the rate (9). Minimax lower bounds in the form of (9) have been investigated by [68] under a slightly different framework. Below we present a few important examples of the structured linear models.

Biclustering. In this model, it is convenient to organize as a matrix in and then . The linear operator is determined by with . With the relations , , , we get and , and the rate (7) can be derived from (9).

Sparse linear regression. The linear model with a sparse can also be written as . To do this, note that a sparse implies a representation for some subset . Then, , with the relations , , , , and . Since , the numerator of (9) becomes

, which is the well-known minimax rate of sparse linear regression

[38, 104, 91]. The principle (9) also applies to a more general row and column sparsity structure in matrix denoising [77].

Dictionary learning. Consider the model for some and . Each column of is assumed to be sparse. Therefore, dictionary learning can be viewed as sparse linear regression without knowing the design matrix. With the relations , and , we have , which is the minimax rate of the problem [68].

The principle (8) or (9

) actually holds beyond the framework of structured linear models. We give an example of sparse principal component analysis (PCA). Consider i.i.d. observations

, where belongs to the following space of covariance matrices

where is a fixed constant. The goal of sparse PCA is to estimate the subspace spanned by the leading eigenvectors . Here, the notation means the set of orthonormal matrices of size , is the set of nonzero rows of , and is a diagonal matrix with entries . It is clear that sparse PCA is a covariance model and does not belong to the class of structured linear models. Despite that, it has been proved in [20] that the minimax rate of the problem is given by

(10)

The minimax rate (10) can be understood as the product of and . The second term is clearly a special case of (8). The first term can be understood as the modulus of continuity between the squared subspace distance used in (10) and the intrinsic loss function of the problem (e.g. Kullback-Leibler), because the principle (8) generally holds for an intrinsic loss function. In addition to the sparse PCA problem, the minimax rate that exhibits the form of (8) or (9) can also be found in sparse canonical correlation analysis (sparse CCA) [44, 49].

3 Community detection

3.1 Problem settings

The problem of community detection is to recover the clustering labels from the observed adjacency matrix in the setting of SBM (2.2). It has wide applications in various scientific areas. Community detection has received growing interests in past several decades. Early contributions to this area focused on various cost functions to find graph clusters, in particular those based on graph cuts or modularity [51, 87, 86]. Recent research has put more emphases on fundamental limits and provably efficient algorithms.

In order for the clustering labels to be identifiable, we impose the following conditions in addition to (2.2),

(11)

This is referred to as the assortative condition, which implies that it is more likely for two nodes in the same cluster to share an edge compared with the situation where they are from two different clusters. Relaxation of the condition (11) is possible, but will not be discussed in this survey. Given an estimator , we consider the following loss function

The loss function measures the misclassification proportion of . Since permutations of labels correspond to the same clustering structure, it is necessary to take infimum over in the definition of .

In ground-breaking works by [85, 83, 79], it is shown that the necessary and sufficient condition to find a that is positively correlated with (i.e. ) when is . Moreover, the necessary and sufficient condition for weak consistency () when is [84]. Optimal conditions for strong consistency () were studied by [84, 3]. When , it is possible to construct a strongly consistent if and only if , and extensions to more general SBM settings were investigated in [2]. We refer the readers to a thorough and comprehensive review by [1] for those modern developments.

Here we will concentrate on the minimax rates and algorithms that can achieve them. We favor the framework of statistical decision theory to derive minimax rates of the problem because the results automatically imply optimal thresholds in both weak and strong consistency. To be specific, the necessary and sufficient condition for weak consistency is that the minimax rate converges to zero, and the necessary and sufficient condition for strong consistency is that the minimax rate is smaller than , because of the equivalence between and . In addition, the minimax framework is very flexible and it allows us to naturally extend our results to more general degree corrected block models (DCBMs).

3.2 Results for SBMs

We first formally define the parameter space that we will work with,

where the notation stands for the size of the th cluster, defined as . We introduce a fundamental quantity that determines the signal-to-noise ratio of the community detection problem,

This is the Rényi divergence of order between and . The next theorem gives the minimax rate for under the loss function .

Theorem 3.1 (Zhang and Zhou [107]).

Assume , and then

(12)

where with some large constant and with some small constant . In addition, if , then we have .

Theorem 3.1 recovers some of the optimal thresholds for weak and strong consistency results in the literature. When , weak consistency is possible if and only if , which is equivalently the condition [84]. Similarly, strong consistency is possible if and only if when and when is not growing too fast [84, 3]. Between the weak and strong consistency regimes, the minimax misclassification proportion converges to zero with an exponential rate.

To understand why Theorem 3.1 gives a minimax rate in an exponential form, we start with a simple argument that relates the minimax lower bound to a hypothesis testing problem. We only consider the case where and are satisfied, and refer the readers to [107] for the more general argument. We choose a sequence that satisfies and . Then, we choose a such that for any and . Recall the notation . Then, we choose some and such that . Define

The set corresponds to a sub-problem that we only need to estimate the clustering labels . Given any , the values of are known, and for each , there are only two possibilities that or . The idea is that this sub-problem is simple enough to analyze but it still captures the hardness of the original community detection problem. Now, we define the subspace

We have by the construction of . This gives the lower bound

(13)

The last inequality above holds because for any , we have so that . Continuing from (13), we have

(14)

Note that for each ,

(15)

Thus, it is sufficient to lower bound the testing error between each pair by the desired minimax rate in (12). Note that with a that satisfies . So the ratio in (14) can be absorbed into the in the exponent of the minimax rate.

The above argument leading to (15) implies that we need to study the fundamental testing problem between the pair . That is, given the whole vector but its th entry, we need to test whether or . This simple vs simple testing problem can be equivalently written as

(16)

The optimal testing error of (16) is given by the following lemma.

Lemma 3.1 (Gao, Ma, Zhang and Zhou [50]).

Suppose that as , , , and , we have

Lemma 3.1 is an extension of the classical Chernoff–Stein theory of hypothesis testing for constant and (see Chapter 11 of [31]). The error exponent is a consequence of calculating the Chernoff information between the two hypotheses in (16). In the setting of (15), we have , which implies the desired minimax lower bound for in (12). For , we can slightly modify the result of Lemma 3.1 with asymptotically different and but of the same order. In this case, one obtains as the optimal testing error, which explains why the minimax rate in (12) for does not depend on