# Strong consistency of Krichevsky-Trofimov estimator for the number of communities in the Stochastic Block Model

In this paper we introduce the Krichevsky-Trofimov estimator for the number of communities in the Stochastic Block Model (SBM) and prove its eventual almost sure convergence to the underlying number of communities, without assuming a known upper bound on that quantity. Our results apply to both the dense as well as the sparse regimes. To our knowledge this is the first strong consistency result for the estimation of the number of communities in the SBM, even in the bounded case.

## Authors

• 3 publications
• 3 publications
• ### Determining the Number of Communities in Degree-corrected Stochastic Block Models

We propose to estimate the number of communities in degree-corrected sto...
09/04/2018 ∙ by Shujie Ma, et al. ∙ 0

• ### Assortative-Constrained Stochastic Block Models

Stochastic block models (SBMs) are often used to find assortative commun...
04/21/2020 ∙ by Daniel Gribel, et al. ∙ 0

• ### Bootstrap percolation on the stochastic block model

We analyze the bootstrap percolation process on the stochastic block mod...
12/21/2018 ∙ by Giovanni Luca Torrisi, et al. ∙ 0

• ### Minimum entropy stochastic block models neglect edge distribution heterogeneity

The statistical inference of stochastic block models as emerged as a mat...
10/17/2019 ∙ by Louis Duvivier, et al. ∙ 0

• ### Stochastic block model entropy and broadcasting on trees with survey

The limit of the entropy in the stochastic block model (SBM) has been ch...
01/29/2021 ∙ by Emmanuel Abbe, et al. ∙ 0

• ### Consistency of Spectral Clustering on Hierarchical Stochastic Block Models

We propose a generic network model, based on the Stochastic Block Model,...
04/30/2020 ∙ by Lihua Lei, et al. ∙ 0

• ### Estimation and Clustering in Popularity Adjusted Stochastic Block Model

The paper considers the Popularity Adjusted Block model (PABM) introduce...
02/01/2019 ∙ by Majid Noroozi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper we address the model selection problem for the Stochastic Block Model (SBM); that is, the estimation of the number of communities given a sample of the adjacency matrix. The SBM was introduced by Holland et al.  (1983)

and has rapidly popularized in the literature as a model for random networks exhibiting blocks or communities between their nodes. In this model, each node in the network has associated a latent discrete random variable describing its community label, and given two nodes, the possibility of a connection between them depends only on the values of the nodes’ latent variables.

From a statistical point of view, some methods have been proposed to address the problem of parameter estimation or label recovering for the SBM. Some examples include maximum likelihood estimation (Bickel & Chen, 2009; Amini et al. , 2013), variational methods (Daudin et al. , 2008; Latouche et al. , 2012)

(Rohe et al. , 2011)(van der Pas et al. , 2017). The asymptotic properties of these estimators have also been considered in subsequent works such as Bickel et al.  (2013) or Su et al.  (2017). All these approaches assume the number of communities is known a priori.

The model selection problem for the SBM, that is the estimation of the number of communities, was also addressed before, see for example the recent work Le & Levina (2015) and references therein. But to our knowledge it was not until Wang et al.  (2017)

that a consistency result was obtained for such a penalized estimator. In the latter, the authors propose a penalized likelihood criterion and show its convergence in probability (weak consistency) to the true number of communities. Their proof only applies to the case where the number of candidate values for the estimator is finite (it is upper bounded by a known constant) and the network average degree grows at least as a polylog function on the number of nodes. From a practical point of view, the computation of the log-likelihood function and its supremum is not a simple task due to the hidden nature of the nodes’ labels.

Wang et al.  (2017) propose a variational method as described in Bickel et al.  (2013) using the EM algorithm of Daudin et al.  (2008), a profile maximum likelihood criterion as in Bickel & Chen (2009) or the pseudo-likelihood algorithm in Amini et al.  (2013). The method introduced in Wang et al.  (2017) has been subsequently studied in Hu et al.  (2016), where the authors propose a modification of the penalty term. However, in practice, the computation of the suggested estimator still remains a demanding task since it depends on the profile maximum likelihood function.

In this paper we take an information-theoretic perspective and introduce the Krichevsky-Trofimov (KT) estimator, see Krichevsky & Trofimov (1981), in order to determine the number of communities of a SBM based on a sample of the adjacency matrix of the network. We prove the strong consistency of this estimator, in the sense that the empirical value is equal to the correct number of communities in the model with probability one, as long as the number of nodes in the network is sufficiently large. The strong consistency is proved in the dense regime, where the probability of having an edge is considered to be constant, and in the sparse regime where this probability goes to zero with having order . The study of the second regime is more interesting in the sense that it is necessary to control how much information is required (in the sense of the number of edges in the network) to estimate the parameters of the model. We prove that the consistency in the sparse case is guaranteed when the expected degree of a random selected node grows to infinity as a function of order , weakening the assumption in Wang et al.  (2017) that proves consistency in the regime . We also consider a smaller order penalty function and we do not assume a known upper bound on the true number of communities. To our knowledge this is the first strong consistency result for an estimator of the number of communities, even in the bounded case.

The paper is organized as follows. In Section 2 we define the model and the notation used in the paper, in Section 3 we introduce the KT estimator for the number of communities and state the main result. The proof of the consistency of the estimator is presented in Section 4.

## 2 The Stochastic Block Model

Consider a non-oriented random network with nodes , specified by its adjacency matrix that is symmetric and has diagonal entries equal to zero. Each node has associated a latent (non-observed) variable on , the community label of node .

The SBM with communities is a probability model for a random network as above, where the latent variables are independent and identically distributed random variables over and the law of the adjacency matrix , conditioned on the value of the latent variables

, is a product measure of Bernoulli random variables whose parameters depend only on the nodes’ labels. More formally, there exists a probability distribution over

, denoted by , and a symmetric probability matrix such that the distribution of the pair is given by

 Pπ,P(zn,xn×n)=k∏a=1πnaak∏a,b=1POa,b/2a,b(1−Pa,b)(na,b−Oa,b)/2, (2.1)

where the counters , and are given by

 na(zn) =n∑i=11{zi=a},1≤a≤k na,b(zn) ={na(zn)nb(zn),1≤a,b≤k;a≠bna(zn)(na(zn)−1)1≤a,b≤k;a=b

and

 Oa,b(zn,xn×n)=n∑i,j=11{zi=a,zj=b}xij,1≤a,b≤k.

As it is usual in the definition of likelihood functions, by convention we define in (2.1) when some of the parameters are 0.

We denote by the parametric space for a model with communities, given by

 Θk={(π,P):π∈(0,1]k,k∑a=1πa=1,P∈[0,1]k×k,P symmetric}.

The order of the SBM is defined as the smallest for which the equality (2.1) holds for a pair of parameters and will be denoted by . If a SBM has order then it cannot be reduced to a model with less communities than ; this specifically means that does not have two identical columns.

When is fixed and does not depend on , the mean degree of a given node grows linearly in and this regime produces very connected (dense graphs). For this reason in this paper we also consider the regime producing sparse graphs (with less edges), that is we allow to decrease with

to the zero matrix. In this case we write

, where does not depend on and is a function decreasing to 0 at a rate .

## 3 The KT order estimator

The Krichevsky-Trofimov order estimator in the context of a SBM is a regularized estimator based on a mixture distribution for the adjacency matrix . Given a sample from the distribution (2.1) with parameters , where we assume we only observed the network , the estimator of the number of communities is defined by

 ^k\tiny{KT}(xn×n)=argmaxk{logKTk(xn×n)−pen(k,n)}, (3.1)

where is the mixture distribution for a SBM with communities and is a penalizing function that will be specified later.

As it is usual for the KT distributions we choose as “prior” for the pair a product measure obtained by a Dirichlet() distribution (the prior distribution for ) and a product of Beta() distributions (the prior for the symmetric matrix ). In other words, we define the distribution on

 (3.2)

and we construct the mixture distribution for , based on , given by

 KTk(xn×n)=Eνk[Pπ,P(xn×n)]=∫ΘkPπ,P(xn×n)νk(π,P)dπdP, (3.3)

where stands for the marginal distribution obtained from (2.1), and given by

 Pπ,P(xn×n)=∑zn∈[k]nPπ,P(zn,xn×n). (3.4)

As in other model selection problems where the KT approach has proved to be very useful, as for example in the case of Context Tree Models (Csiszar & Talata, 2006)

(Gassiat & Boucheron, 2003), in the case of the SBM there is a closed relationship between the KT mixture distribution and the maximum likelihood function. The following proposition shows a non asymptotic uniform upper bound for the log ratio between these two functions. Its proof is postponed to the Appendix.

###### Proposition 3.1.

For all and all we have

where

 ck,n=k(k+1)2logΓ(12)+k(k−1)4n+112n+logΓ(12)Γ(k2)+7k(k+1)12.

Proposition 3.1 is at the core of the proof of the strong consistency of defined by (3.1). By strong consistency we mean that the estimator equals the order of the SBM with probability one, for all sufficiently large (that may depend on the sample ). In order to derive the strong consistency result for the KT order estimator, we need a penalty function in (3.1) with a given rate of convergence when grows to infinity. Although there are a range of possibilities for this penalty function, the specific form we use in this paper is

 pen(k,n)=k−1∑i=1(i(i+2)+3+ϵ)2logn=[k(k−1)(2k−1)12+k(k−1)2+(3+ϵ)(k−1)2]logn (3.5)

for some . The convenience of the expression above will be make clear in the proof of the consistency result. Observe that the penalty function defined by (3.5) is dominated by a tern of order and then it is of smaller order than the function used in Wang et al.  (2017), so our results also apply in this case. It remains an open question which is the smallest penalty function for a strongly consistent estimator.

We finish this section by stating the main theoretical result in this paper.

###### Theorem 3.2 (Consistency Theorem).

Suppose the SBM has order with parameters . Then, for a penalty function of the form (3.5) we have that

 ^k\tiny{KT}(xn×n)=k0

eventually almost surely as .

The proof of this and other auxiliary results are given in the next section and in the Appendix.

## 4 Proof of the Consistency Theorem

The proof of Theorem 3.2 is divided in two main parts. The first one, presented in Subsection 4.1, proves that does not overestimate the true order , eventually almost surely when , even without assuming a known upper bound on . The second part of the proof, presented in Subsection 4.2, shows that does not underestimate , eventually almost surely when . By combining these two results we prove that eventually almost surely as .

### 4.1 Non-overestimation

The main result in this subsection is given by the following proposition.

###### Proposition 4.1.

Let be a sample of size from a SBM of order , with parameters and . Then, the order estimator defined in (3.1) with penalty function given by (3.5) does not overestimate , eventually almost surely when .

The proof of Proposition 4.1 follows straightforward from Lemmas 4.2, 4.3 and 4.4 presented below. These lemmas are inspired in the work Gassiat & Boucheron (2003) which proves consistency for an order estimator of a Hidden Markov Model.

###### Lemma 4.2.

Under the hypotheses of Proposition 4.1 we have that

 ^k\tiny{KT}(xn×n)∉(k0,logn]

eventually almost surely when .

###### Proof.

First observe that

 Pπ0,P0(^k\tiny{KT}(xn×n)∈(k0,logn])=logn∑k=k0+1Pπ0,P0(^k\tiny{KT}(xn×n)=k). (4.1)

Using Lemma A.2 we can bound the sum in the right-hand side by

 logn∑k=k0+1 exp{(k0(k0+2)−1)2logn+ck0,n+pen(k0,n)−pen(k,n)} ≤eck0,nlognexp{(k0(k0+2)−1)2logn+pen(k0,n)−pen(k0+1,n)}

where the last inequality follows from the fact that is an increasing function in . Moreover, a simple calculation using the specific form in (3.5) gives

 (k0(k0+2)−1)2 logn+pen(k0,n)−pen(k0+1,n) =((k0(k0+2)−1)2−(k0(k0+2)−1+4+ϵ))2)logn =−(2+ϵ/2)logn.

By using this expression in the right-hand side of the las inequality to bound (4.1) we obtain that

 ∞∑n=1Pπ0,P0(^k\tiny{KT}(xn×n)∈(k0,logn])≤Ck0∞∑n=1lognn2+ϵ/2<∞,

where denotes an upper-bound on . Now the result follows by the first Borel Cantelli lemma. ∎

###### Lemma 4.3.

Under the hypotheses of Proposition 4.1 we have that

 ^k\tiny{KT}(xn×n)∉(logn,n]

eventually almost surely when .

###### Proof.

As in the proof of Lemma 4.2 we write

 Pπ0,P0(^k\tiny{KT}(xn×n)∈(logn,n])=n∑k=lognPπ0,P0(^k\tiny{KT}(xn×n)=k)

and we use again Lemma A.2 to bound the sum in the right-hand side by

 n∑k=logn exp{(k0(k0+2)−1)2logn+ck0,n+pen(k0,n)−pen(k,n)} ≤eck0,nnexp{−logn[−(k0(k0+2)−1)2−pen(k0,n)logn+% pen(logn,n)logn]}.

Since does not depend on and increases cubically in we have that

 liminfn→∞pen(logn,n)logn−(k0(k0+2)−1)2−pen(k0,n)logn>3

and thus

 ∞∑n=1nexp{−logn[−(k0(k0+2)−1)2−pen(k0,n)logn+pen(logn,n)logn]}<∞.

Using the fact that is decreasing on , the result follows from the first Borel Cantelli lemma. ∎

###### Lemma 4.4.

Under the hypotheses of Proposition 4.1 we have that

 ^k\tiny{KT}(xn×n)∉(n,∞)

eventually almost surely when .

###### Proof.

Observe that it is enough to prove that

 logKTn+m(xn×n)−pen(n+m,n)≤logKTn(xn×n)−pen(n,n)

for all . By using Proposition 3.1 we have that

 −logKTn(xn×n)≤−logsup(π,P)∈ΘnPπ,P(xn×n)+(n(n+2)2−12)logn+cn,n

and by (3.3) we obtain

 KTn+m(xn×n)≤sup(π,P)∈Θn+mPπ,P(xn×n).

Thus, as

 sup(π,P)∈Θn+mPπ,P(xn×n)=sup(π,P)∈ΘnPπ,P(xn×n)

we obtain

 logKTn+m(xn×n)−logKTn(xn×n)≤(n(n+2)2−12)logn+cn,n ≤(n(n+2)−1)2logn+n(n+1)(logΓ(12)2+712)+n(n−1)4n+112n−logΓ(n2)Γ(12) ≤pen(n+m,n)−pen(n,n)

where the last inequality holds for big enough. ∎

### 4.2 Non-underestimation

In this subsection we deal with the proof of the non-underestimation of . The main result of this section is the following

###### Proposition 4.5.

Let be a sample of size from a SBM of order with parameters . Then, the order estimator defined in (3.1) with penalty function given by (3.5) does not underestimate , eventually almost surely when .

In order to prove this result we need Lemmas 4.6 and 4.7 below, that explore limiting properties of the under-fitted model. That is we handle with the problem of fitting a SBM of order in the parameter space .

An intuitive construction of a ()-block model from a -block model is obtained by merging two given blocks. This merging can be implemented in several ways, but here we consider the construction given in Wang et al.  (2017), with the difference that instead of using the sample block proportions we use the limiting distribution of the original -block model.

Given we define the merging operation which combines blocks with labels and . For ease of exposition we only show the explicit definition for the case and . In this case, the merged distribution is given by

 π∗i =πifor 1≤i≤k−2, (4.2) π∗k−1 =πk−1+πk.

On the other hand, the merged matrix is obtained as

 P∗l,r =Pl,rfor 1≤l,r≤k−2, P∗l,k−1 =πlπk−1Pl,k−1+πlπkPl,kπlπk−1+πlπkfor 1≤l≤k−2, (4.3) P∗k−1,k−1 =πk−1πk−1Pk−1,k−1+2πk−1πkPk−1,k+πkπkPk,kπk−1πk−1+2πk−1πk+πkπk.

For arbitrary and the definition is obtained by permuting the labels.

Given originated from the SBM of order and parameters , we define the profile likelihood estimator of the label assignment under the ()-block model as

 z⋆n=argmaxzn∈[k0−1]nsup(π,P)∈Θk0−1Pπ,P(zn,xn×n). (4.4)

The next lemmas show that the logarithm of the ratio between the maximum likelihood under the true order and the maximum profile likelihood under the under-fitting order model is bounded from below by a function growing faster than , eventually almost surely when . Each lemma consider one of the two possible regimes (dense regime) or at a rate (sparse regime).

###### Lemma 4.6 (dense regime).

Let be a sample of size from a SBM of order with parameters , with not depending on . Then there exist such that for we have that almost surely

 liminfn→∞1n2log sup(π,P)∈Θk0Pπ,P(zn,xn×n)sup(π,P)∈Θk0−1Pπ,P(z⋆n,xn×n) ≥12[k0∑a,b=1π0aπ0bγ(P0ab)−k0−1∑a,b=1π∗aπ∗bγ(P∗a,b)] (4.5) >0,

where .

###### Proof.

Given and define the empirical probabilities

 ^πa(¯zn)=na(¯zn)n,1≤a≤k^Pa,b(¯zn,xn×n)=Oa,b(¯zn,xn×n)na,b(¯zn),1≤a,b≤k. (4.6)

Then the maximum likelihood function is given by

 logsup(π,P)∈Θk0Pπ,P(zn,xn×n)=nk0∑a=1^πa(zn)log^πa(zn)+12k0∑a,b=1na,b(zn)γ(^Pa,b(zn,xn×n)).

Using that for and the last expression is equal to

 nk0∑a=1^πa(zn)log^πa(zn)−n2k0∑a=1^πa(zn)γ(^Pa,a(zn,xn×n))+n22k0∑a,b=1^πa(zn)^πb(zn)γ(^Pa,b(zn,xn×n)). (4.7)

The first two terms in (4.7) are of smaller order compared to

, so by the Strong Law of Large Numbers we have that almost surely

 limn→∞1n2logsup(π,P)∈Θk0Pπ,P(zn,xn×n)=12k0∑a,b=1π0aπ0bγ(P0a,b). (4.8)

Similarly for and we have that almost surely

 limsupn→∞1n2logsup(π,P)∈Θk0−1Pπ,P(z⋆n,xn×n)=12k0−1∑a,b=1~πa~πbγ(~Pa,b), (4.9)

for some . Combining (4.8) and (4.9) we have that almost surely

 liminfn→∞1n2logsup(π,P)∈Θk0Pπ,P(zn,xn×n)sup(π,P)∈Θk0−1Pπ,P(z⋆n,xn×n)=12k0∑a,b=1π0aπ0bγ(P0a,b)−12k0−1∑a,b=1~πa~πbγ(~Pa,b). (4.10)

To obtain a lower bound for (4.10) we need to compute that minimizes the right-hand side. This is equivalent to obtain that maximizes the second term

 k0−1∑a,b=1~πa~πbγ(~Pa,b). (4.11)

Denote by a -order SBM with distribution . By definition

 ~P~a,~b=P(~Xi,j=1,~Zi=~a,~Zj=~b)P(~Zi=~a,~Zj=~b).

Observe that when , the numerator equals

 k0∑a,b=1 P(Xi,j=1|Zi=a,Zj=b)P(Zi=a,Zj=b,~Zi=~a,~Zj=~b) =k0∑a,b=1P(Zi=a,~Zi=~a)P0a,bP(Zj=b,~Zj=~b) =(QP0QT)~a,~b,

where

denotes a joint distribution on

(a coupling) with marginals and , respectively. Similarly, the denominator can be written as

 k0∑a,b=1P(Zi=a,~Zi=~a)P(Zj=b,~Zj=~b)=(Q(11T)QT)~a,~b,

where denotes the matrix with dimension and all entries equal to 1. Then we can rewrite (4.11) as

 k0−1∑a,b=1(Q(11T)QT)a,bγ[(QP0QT)a,b(Q(11T)QT)a,b]. (4.12)

Therefore, finding a pair maximizing (4.11) is equivalent to finding an optimal coupling maximizing (4.12). Wang et al.  (2017) proved that there exist such that (4.12) achieves its maximum at , see Lemma A.2 there. This concludes the proof of the first inequality in (4.6). In order to prove the second strict inequality in (4.6), we consider for convenience and without loss of generality, and (the other cases can be handled by a permutation of the labels). Notice that in the right-hand side of (4.10), with substituted by the optimal value defined by (4.2) and (4.2), all the terms with cancel. Moreover, as is a convex function, Jensen’s inequality implies that

 π∗aπ∗k0−1γ(P∗a,k0−1)≤π0aπ0k0−1γ(P0a,k0−1)+π0aπ0k0γ(P0a,k0) (4.13)

for all and similarly

 (π∗k0−1)2γ(P∗k0−1,k0−1)≤k0∑a,b=k0−1π0aπ0bγ(P0a,b). (4.14)

The equality holds for all in (4.13) and in (4.14) simultaneously if and only if

 P0a,k0=Pa,k0−1 for all a=1,…,k0,

in which case the matrix would have two identical columns, contradicting the fact that the sample originated from a SBM with order . Therefore the strict inequality must hold in (4.13) for at least one or in (4.14), showing that the second inequality in (4.6) holds. ∎

###### Lemma 4.7 (sparse regime).

Let be a sample of size from a SBM of order with parameters , where at a rate . Then there exist such that for we have that almost surely

 liminfn→∞1ρnn2log sup(π,P)∈Θk0Pπ,P(zn,xn×n)sup(π,P)∈Θk0−1Pπ,P(z⋆n,xn×n) ≥12[k0∑a,b=1π0aπ0bτ(S0a,b)−k0−1∑a,b=1π∗aπ∗bτ(P∗a,b)] (4.15) >0,

where .

###### Proof.

This proof follows the same arguments used in the proof of Lemma 4.6, but as in this case decreases to 0 some limits must be handled differently. As shown in (4.7) we have that

 logsup(π,P)∈Θk0Pπ,P(zn,xn×n)= nk0∑a=1^πa(zn)log^πa(zn) −n2k0∑a=1^πa(zn)γ(^Pa,a(zn,xn×n)) (4.16) +n22