## 1 Introduction

In this paper we address the model selection problem for the Stochastic Block Model (SBM); that is, the estimation of the number of communities given a sample of the adjacency matrix. The SBM was introduced by Holland et al. (1983)

and has rapidly popularized in the literature as a model for random networks exhibiting blocks or communities between their nodes. In this model, each node in the network has associated a latent discrete random variable describing its community label, and given two nodes, the possibility of a connection between them depends only on the values of the nodes’ latent variables.

From a statistical point of view, some methods have been proposed to address the problem of parameter estimation or label recovering for the SBM. Some examples include maximum likelihood estimation (Bickel & Chen, 2009; Amini et al. , 2013), variational methods (Daudin et al. , 2008; Latouche et al. , 2012)

(Rohe et al. , 2011)(van der Pas et al. , 2017). The asymptotic properties of these estimators have also been considered in subsequent works such as Bickel et al. (2013) or Su et al. (2017). All these approaches assume the number of communities is known*a priori*.

The model selection problem for the SBM, that is the estimation of the number of communities, was also addressed before, see for example the recent work Le & Levina (2015) and references therein. But to our knowledge it was not until Wang et al. (2017)

that a consistency result was obtained for such a penalized estimator. In the latter, the authors propose a penalized likelihood criterion and show its convergence in probability (weak consistency) to the true number of communities. Their proof only applies to the case where the number of candidate values for the estimator is finite (it is upper bounded by a known constant) and the network average degree grows at least as a polylog function on the number of nodes. From a practical point of view, the computation of the log-likelihood function and its supremum is not a simple task due to the hidden nature of the nodes’ labels.

Wang et al. (2017) propose a variational method as described in Bickel et al. (2013) using the EM algorithm of Daudin et al. (2008), a profile maximum likelihood criterion as in Bickel & Chen (2009) or the pseudo-likelihood algorithm in Amini et al. (2013). The method introduced in Wang et al. (2017) has been subsequently studied in Hu et al. (2016), where the authors propose a modification of the penalty term. However, in practice, the computation of the suggested estimator still remains a demanding task since it depends on the profile maximum likelihood function.In this paper we take an information-theoretic perspective and introduce the Krichevsky-Trofimov (KT) estimator, see Krichevsky & Trofimov (1981), in order to determine the number of communities of a SBM based on a sample of the adjacency matrix of the network. We prove the strong consistency of this estimator, in the sense that the empirical value is equal to the correct number of communities in the model with probability one, as long as the number of nodes in the network is sufficiently large. The strong consistency is proved in the dense regime, where the probability of having an edge is considered to be constant, and in the sparse regime where this probability goes to zero with having order . The study of the second regime is more interesting in the sense that it is necessary to control how much information is required (in the sense of the number of edges in the network) to estimate the parameters of the model. We prove that the consistency in the sparse case is guaranteed when the expected degree of a random selected node grows to infinity as a function of order , weakening the assumption in Wang et al. (2017) that proves consistency in the regime . We also consider a smaller order penalty function and we do not assume a known upper bound on the true number of communities. To our knowledge this is the first strong consistency result for an estimator of the number of communities, even in the bounded case.

## 2 The Stochastic Block Model

Consider a non-oriented random network with nodes , specified by its adjacency matrix that is symmetric and has diagonal entries equal to zero. Each node has associated a latent (non-observed) variable on , the *community* label of node .

The SBM with communities is a probability model for a random network as above, where the latent variables are independent and identically distributed random variables over and the law of the adjacency matrix , conditioned on the value of the latent variables

, is a product measure of Bernoulli random variables whose parameters depend only on the nodes’ labels. More formally, there exists a probability distribution over

, denoted by , and a symmetric probability matrix such that the distribution of the pair is given by(2.1) |

where the counters , and are given by

and

As it is usual in the definition of likelihood functions, by convention we define in (2.1) when some of the parameters are 0.

We denote by the parametric space for a model with communities, given by

The *order* of the SBM is defined as the smallest for which
the equality (2.1) holds for a pair of parameters
and will be denoted by .
If a SBM has order then it cannot be reduced to a model with less communities than ; this specifically means that does not have two identical columns.

When is fixed and does not depend on , the mean degree of a given node grows linearly in and this regime produces very connected (dense graphs). For this reason in this paper we also consider the regime producing sparse graphs (with less edges), that is we allow to decrease with

to the zero matrix. In this case we write

, where does not depend on and is a function decreasing to 0 at a rate .## 3 The KT order estimator

The Krichevsky-Trofimov order estimator in the context of a SBM is a regularized estimator based on a mixture distribution for the adjacency matrix . Given a sample from the distribution (2.1) with parameters , where we assume we only observed the network , the estimator of the number of communities is defined by

(3.1) |

where is the mixture distribution for a SBM with communities and is a penalizing function that will be specified later.

As it is usual for the KT distributions we choose as “prior” for the pair a product measure obtained by a Dirichlet() distribution (the prior distribution for ) and a product of Beta() distributions (the prior for the symmetric matrix ). In other words, we define the distribution on

(3.2) |

and we construct the mixture distribution for , based on , given by

(3.3) |

where stands for the marginal distribution obtained from (2.1), and given by

(3.4) |

As in other model selection problems where the KT approach has proved to be very useful, as for example in the case of Context Tree Models (Csiszar & Talata, 2006)

(Gassiat & Boucheron, 2003), in the case of the SBM there is a closed relationship between the KT mixture distribution and the maximum likelihood function. The following proposition shows a non asymptotic uniform upper bound for the log ratio between these two functions. Its proof is postponed to the Appendix.###### Proposition 3.1.

For all and all we have

where

Proposition 3.1 is at the core of the proof of the strong consistency of defined by (3.1). By strong consistency we mean that the estimator equals the order of the SBM with probability one, for all sufficiently large (that may depend on the sample ). In order to derive the strong consistency result for the KT order estimator, we need a penalty function in (3.1) with a given rate of convergence when grows to infinity. Although there are a range of possibilities for this penalty function, the specific form we use in this paper is

(3.5) |

for some . The convenience of the expression above will be make clear in the proof of the consistency result. Observe that the penalty function defined by (3.5) is dominated by a tern of order and then it is of smaller order than the function used in Wang et al. (2017), so our results also apply in this case. It remains an open question which is the smallest penalty function for a strongly consistent estimator.

We finish this section by stating the main theoretical result in this paper.

###### Theorem 3.2 (Consistency Theorem).

Suppose the SBM has order with parameters . Then, for a penalty function of the form (3.5) we have that

eventually almost surely as .

The proof of this and other auxiliary results are given in the next section and in the Appendix.

## 4 Proof of the Consistency Theorem

The proof of Theorem 3.2 is divided in two main parts. The first one, presented in Subsection 4.1, proves that does not overestimate the true order , eventually almost surely when , even without assuming a known upper bound on . The second part of the proof, presented in Subsection 4.2, shows that does not underestimate , eventually almost surely when . By combining these two results we prove that eventually almost surely as .

### 4.1 Non-overestimation

The main result in this subsection is given by the following proposition.

###### Proposition 4.1.

The proof of Proposition 4.1 follows straightforward from Lemmas 4.2, 4.3 and 4.4 presented below. These lemmas are inspired in the work Gassiat & Boucheron (2003) which proves consistency for an order estimator of a Hidden Markov Model.

###### Lemma 4.2.

###### Proof.

First observe that

(4.1) |

Using Lemma A.2 we can bound the sum in the right-hand side by

where the last inequality follows from the fact that is an increasing function in . Moreover, a simple calculation using the specific form in (3.5) gives

By using this expression in the right-hand side of the las inequality to bound (4.1) we obtain that

where denotes an upper-bound on . Now the result follows by the first Borel Cantelli lemma. ∎

###### Lemma 4.3.

###### Proof.

###### Lemma 4.4.

### 4.2 Non-underestimation

In this subsection we deal with the proof of the non-underestimation of . The main result of this section is the following

###### Proposition 4.5.

In order to prove this result we need Lemmas 4.6 and 4.7 below, that explore limiting properties of the under-fitted model. That is we handle with the problem of fitting a SBM of order in the parameter space .

An intuitive construction of a ()-block model from a -block model is obtained by merging two given blocks. This merging can be implemented in several ways, but here we consider the construction given in Wang et al. (2017), with the difference that instead of using the sample block proportions we use the limiting distribution of the original -block model.

Given we define the merging operation which combines blocks with labels and . For ease of exposition we only show the explicit definition for the case and . In this case, the merged distribution is given by

(4.2) | ||||

On the other hand, the merged matrix is obtained as

(4.3) | ||||

For arbitrary and the definition is obtained by permuting the labels.

Given originated from the SBM of order and parameters , we define the profile likelihood estimator of the label assignment under the ()-block model as

(4.4) |

The next lemmas show that the logarithm of the ratio between the maximum likelihood under the true order and the maximum profile likelihood under the under-fitting order model is bounded from below by a function growing faster than , eventually almost surely when . Each lemma consider one of the two possible regimes (dense regime) or at a rate (sparse regime).

###### Lemma 4.6 (dense regime).

Let be a sample of size from a SBM of order with parameters , with not depending on . Then there exist such that for we have that almost surely

(4.5) | ||||

where .

###### Proof.

Given and define the empirical probabilities

(4.6) |

Then the maximum likelihood function is given by

Using that for and the last expression is equal to

(4.7) |

The first two terms in (4.7) are of smaller order compared to

, so by the Strong Law of Large Numbers we have that almost surely

(4.8) |

Similarly for and we have that almost surely

(4.9) |

for some . Combining (4.8) and (4.9) we have that almost surely

(4.10) |

To obtain a lower bound for (4.10) we need to compute that minimizes the right-hand side. This is equivalent to obtain that maximizes the second term

(4.11) |

Denote by a -order SBM with distribution . By definition

Observe that when , the numerator equals

where

denotes a joint distribution on

(a coupling) with marginals and , respectively. Similarly, the denominator can be written aswhere denotes the matrix with dimension and all entries equal to 1. Then we can rewrite (4.11) as

(4.12) |

Therefore, finding a pair maximizing (4.11) is equivalent to finding an optimal coupling maximizing (4.12). Wang et al. (2017) proved that there exist such that (4.12) achieves its maximum at , see Lemma A.2 there. This concludes the proof of the first inequality in (4.6). In order to prove the second strict inequality in (4.6), we consider for convenience and without loss of generality, and (the other cases can be handled by a permutation of the labels). Notice that in the right-hand side of (4.10), with substituted by the optimal value defined by (4.2) and (4.2), all the terms with cancel. Moreover, as is a convex function, Jensen’s inequality implies that

(4.13) |

for all and similarly

(4.14) |

The equality holds for all in (4.13) and in (4.14) simultaneously if and only if

in which case the matrix would have two identical columns, contradicting the fact that the sample originated from a SBM with order . Therefore the strict inequality must hold in (4.13) for at least one or in (4.14), showing that the second inequality in (4.6) holds. ∎

###### Lemma 4.7 (sparse regime).

Let be a sample of size from a SBM of order with parameters , where at a rate . Then there exist such that for we have that almost surely

(4.15) | ||||

where .