For the last decade, statistical network analyses has a been a very active research topic and the statistical modeling of networks has found many applications in social sciences and biology for example Aicher et al. (2014), Barbillon et al. (2015), Mariadassou et al. (2010), Wasserman and Faust (1994) and Zachary (1977).
Many random graphs models have been widely studied, either from a theoretical or an empirical point of view. The first model studied was Erdős-Rényi model (Erdős and Renyi, 1959) which assumes that each pair of nodes (dyad) is connected independently to the others with the same probability. This model assumes homogeneity of all nodes across the network. In order to alleviate this constraint, many families of models have been introduced. Most are endowed with a latent structure (reviewed in Matias and Robin, 2014) to capture heterogeneity across nodes. Among those, the Stochastic Block Model (in short SBM, see Frank and Harary, 1982; Holland et al., 1983) is one of the oldest and most studied as it is highly flexible and can capture a large variety of structures (affiliation, hub, bipartite and many other). In order to estimate this model, Bayesian approaches were first proposed (Snijders and Nowicki, 1997; Nowicki and Snijders, 2001) but have been superseded by variational methods (Daudin et al., 2008; Latouche et al., 2012). The former class of approaches are exact but lack the computational efficiency and scalability that the latter offers.
Theoretical guarantees concerning maximum likelihood estimators (in short MLE) and variational methods for the binary SBM estimation is not an easy task and have been widely studied. In Celisse et al. (2012), consistency of MLE and variational estimates is proven but asymptotic normality requires that the estimators converges at rate at least , which is not proven in the paper, although some results were available for some particular cases (affiliation for example). Ambroise and Matias (2012) tackles the specific case of affiliation model with equal group proportion and proves the consistency and asymptotic normality of parameter estimates. Bickel et al. (2013) extends those results to arbitrary binary SBM graphs and improves Celisse et al. (2012) by removing the condition on the convergence rate. Following along the path of Bickel et al. (2013), Brault et al. (2017)
proved consistency and asymptotic normality of estimators (MLE and variational) to weighted Latent Block Models where the weights distribution belongs to a regular one-dimensional exponential family. In particular, considering non-bounded edge values invalidates several parts of the proofs for binary graphs and requires substantial adaptations and additional results, notably concentration inequalities for sums of unbounded, non-gaussian random variables.
Some results are also available for the related semi-parametric problem of assignment reconstruction. Mariadassou and Matias (2015) show that the conditional distribution of the (latent) assignments converge to a degenerate distribution and Rohe et al. (2010) prove that, when the data are generated according to a SBM model, spectral methods are consistent. Choi et al. (2012) extend those results to settings where the density of the graph goes to as (for large enough) and/or the number of groups goes to as . Finally, Wang and Bickel (2017) and Hu et al. (2017) also show that model selection for the number of groups is consistent for dense graphs, they suggest using a penalized likelihood criteria with penalty of the form where is a tuning parameter.
In this paper we consider a simple setting with fixed number of groups and fixed density but weighted edges and missing values. In most network studies, there is a strong asymmetry between the presence of an edge and its absence: the lack of proof that an edge exists is taken as proof that the edge does not exist and edges with uncertain status are considered as non existent in the graph. This is the strategy adopted in most sparse asymptotic settings where the density of edges goes to asymptotically Bickel et al. (2013). We adopt a different point of view where edges with uncertain status are considered as missing, rather than absent and explicitly accounted for their missing nature. We use the framework of Rubin (1976) and its application to network data, see Kolaczyk (2009) and Handcock and Gile (2010), for parameter inference in presence of missing values and more specifically its applications to SBM Tabouy et al. (2019). We prove that, in the MCAR setting where each dyad is missing independently and with the same probability, the MLE and variational estimates are still consistent and asymptotically normal.
The article is organized as follows. We first present the model and missing data theory applied to our context with some examples of sampling designs. We then posit some definitions and discuss the assumptions required for our results in Section 2. In Section 3 we establish asymptotic normality for the complete-observed model (i.e. observed SBM where latent variables are known). Section 4 is the main result of this paper and states that the observed-likelihood behaves like the complete-observed likelihood (i.e. joint likelihood of the observed data and latent variables) close to its maximum. The proof is sketched in Section 5. Consequences for the MLE and variational estimator, as well as comparison to existing results, are in discussed in Section 6. Technical lemmas and details of the proofs are available in the appendices.
2 Statistical framework
2.1 Stochastic Block Model
In SBM, nodes from a set are distributed among a set of hidden blocks that model the latent structure of the graph. The block-memberships are encoded by where the
are independant random variables with prior probabilities, such that , for all . The value of any dyad in , with , only depends on the blocks and belong to. The variables s are thus independent conditionally on the s:
In the following, is the adjacency matrix of the random graph, the
-vector of the latent blocks. With a slight abuse of notation, we associate toa binary vector such that , for all . In this case is a matrix.
We note the complete parameter set as where stands for the parameter space. When performing inference from data, we note the true parameter set, i.e. the parameter values used to generate the data, and the true (and usually unobserved) memberships of nodes. For any , we also note:
the size of block for membership
its counterpart for .
2.2 Missing data for SBM
Regarding SBM inference, a missing value corresponds to a missing entry in the adjacency matrix , typically denoted by NA’s. We rely on the sampling matrix to record the missing state of each entry:
As a shortcut, we use and to respectively denote the observed and missing dyads. The sampling design is the description of the stochastic process that generates . It is assumed that the network exists before the sampling design acts upon it, which is fully characterized by the conditional distribution , the parameters of which are such that and live in a product space . In this paper we are going to focus on a specific type of missingness, called missing completely at random (MCAR) for which and leave aside more complex forms of dependencies such as Missing at random (MAR) and Not missing at random (NMAR).
for missing data and define the joint probability density function as
2.3 Sampling design examples
We present here some examples of sampling designs to illustrate differences between notions of MCAR, MAR and NMAR.
Definition 2.2 (Random dyad sampling).
Each dyad has the same probability of being observed, independently of the others. This design is MCAR.
Definition 2.3 (Random node sampling).
The random node sampling consists in selecting independently with probability a set of nodes and then observing the corresponding rows and columns of matrix .
The major point in both examples is that the probability ( in random dyad sampling and in the random node sampling) of observing a dyad does not depend on its value. In contrast, the following dyad-centered sampling design adapted to binary networks is NMAR since the probability to observe a dyad depends on its value:
Definition 2.4 (Double standard sampling).
Each dyad is observed, independently of other dyads, with a probability depending on its value: and .
For non-binary networks, specifying the sampling design is more involved and requires defining the sampling density for every possible value of , e.g. for Poisson-valued edges.
When the labels are known, the complete-observed log-likelihood is given by:
But the labels are usually unobserved, and the observed log-likelihood is obtained by integration over all memberships:
2.5 Models and Assumptions
We focus here on parametric models wherebelongs to a regular one-dimension exponential family in canonical form:
where belongs to the space , so that is well defined for all . Classical properties of exponential families ensure that is convex, infinitely differentiable on , that is well defined on . Furthemore, when , and .
In the following, we assume that missing data are produced according to a random dyad sampling with parameter .
Moreover, we make the following assumptions on the parameter space :
: There exists a positive constant , and a compact interval such that
: The true parameter lies in the interior of .
: The map is injective.
: The coordinates of , where is applied component-wise, are pairwise distinct.
The previous assumptions are standard. Assumption ensure that the group proportions and the sampling parameter are bounded away from and so that no group disappears when goes to infinity. It also ensures that is bounded away from the boundaries of the . This is essential for the subexponential properties of Propositions 2.8 and 2.9. and are necessary for identifiability purposes: the model is trivially not identifiable if the map is not injective. states the identifiability of SBM parameters under random dyad sampling. Note that, combined with , it implies that all columns and all rows of are distincts and therefore there are no two groups with identical connectivity profiles. In the following, we consider that , the number of classes (or groups) is known.
Since is independant on , the identifiability of SBM with emission law in the one-dimension exponential family under random dyad sampling can be stated in two steps. First the sampling parameter and secondly the SBM parameters given .
The sampling parameter of random dyad sampling is identifiable w.r.t. the sampling distribution.
Let and assume that for any , , and that the coordinates of , where is applied component-wise, are pairwise distinct. Then, under random dyad sampling, SBM parameters are identifiable w.r.t. the distribution of the observed part of the SBM up to label switching.
The proof is nearly identical to the one written in Tabouy et al. (2019) and inspired by Celisse et al. (2012) for the binary SBM under random dyad sampling. However, substituting to in the proof ensures that is identifiable. Finally, the fact that is a one-to-one map ensures that is identifiable. ∎
Note that asymptotically, the assumption is always satisfied since is fixed and grows to infinity.
2.7 Subexponential variables
Since we restricted in a bounded subset of
, the variance ofis bounded away from and . We note
Similarly, since belongs to a bounded subset of a open interval, there exists a constant , such that uniformly over all
With the previous notations, if and , then is subexponential with parameters .
Considering (we recall that ), with independant of and bounded. There are non-negative numbers and such that is subexponential with parameters .
These results derive directly from theorem C.1 (statement 2.). ∎
We now introduce the concepts of assignments and parameter symmetries, that must be accounted for when studying the asymptotic properties of the MLE. Complications stemming from symmetries are related to but no equivalent to the problem of label-switching in mixture models.
Definition 2.10 (permutation).
Let be a permutation on . If is a matrix with columns and rows, we define as the matrix obtained by permuting the columns of according to , i.e. for any row and column of , . If is a matrix with rows and columns, is defined similarly:
Definition 2.11 (equivalence).
We define the following equivalence relationships:
Two assignments and are equivalent, noted , if they are equal up to label permutation, i.e. there exists a permutation such that .
Two parameters and are equivalent, noted , if they are equal up to label permutation, i.e. there exists a permutation such that .
and are equivalent, noted , if they are equal up to label permutation on and , i.e. there exists a permutation such that . This is label-switching.
Definition 2.12 (symmetry).
We say that the parameter exhibits symmetry for the permutation if
exhibits symmetry if it exhibits symmetry for any non trivial permutations . Finally the set of permutations for which exhibits symmetry is noted .
The set of parameters that exhibit symmetry is a manifold of null Lebesgue measure in . The notion of symmetry allows us to deal with a notion of non-identifiability of the class labels that is subtler than and different from label switching. More precisely
In particular, in label-switching, and have the same likelihood but under equivalent yet different parameters s. In contrast, in the presence of symmetry, multiple assignments can have exactly the same likelihood under .
The issue of symmetry forces us to use a notion of distance between assignment that is invariant to label permutation.
Definition 2.14 (distance).
We define the following distance, up to equivalence, between configurations and :
where, for all matrix , we use the Hamming norm defined by
Definition 2.15 (Set of local assignments).
We note the set of configurations that have a representative (for ) within relative radius of :
2.9 Other definitions
We finally introduce a few useful notions that will be instrumental in the proofs. The first is “regular” assignments, for which each group has “enough” nodes:
Definition 2.16 (-regular assignments).
Let . For any , we say that is c-regular if
Class distinctness captures the differences between groups: lower values of means that at least two classes are very similar. is intrisically linked to the convergence rate of several estimates.
Definition 2.17 (class distinctness).
For . We define:
with the Kullback divergence between and , when comes from an exponential family.
Since all have distinct rows and columns, .
Finally, the confusion matrix allows to compare groups between assignments:
Definition 2.19 (confusion matrix).
For given assignments and , we define the confusion matrix between and , noted , as follows:
For more conciseness, we define
3 Complete-observed Model
In the following we study the asymptotic properties of the complete-observed data model, i.e. when the true assignment is known.
Under random dyad sampling, defining and the set of nodes with at least one dyaddy observed. Then
This proposition is a direct consequence of Borel-Cantelli’s theorem. Details are available in appendix A. ∎
This result shows that, with high probability, the network has no unobserved node. In the remainder, we work conditionnally on .
Let be the MLE of in the complete-observed data model. Simple manipulations of Equation (2.3) yield:
Since there are missing values in the adjacency matrix, we need the following technical lemma to prove asymptotic normality of ’s in the complete data model.
The proof of this lemma is based on Hoeffding’s decomposition for U-statistics and on the proof of Hoeffding’s concentration inequality. Details are postponed to appendix A. ∎
Let . is semi-definite positive, of rank , and is asymptotically normal:
Similarly, let be the matrix defined by and
. Then the estimates are independent and asymptotically Gaussian with limit distribution:
Proposition 3.5 (Local asymptotic normality).
Let be the complete likelihood function defined on by . For any , and in a compact set, we have:
where denote the Hadamard product of two matrices (element-wise product) and and are defined in Proposition 3.4. is asymptotically Gaussian with zero mean and variance matrix . is a random matrix with independent entries that are asymptotically gaussian zero mean and variance
is a random matrix with independent entries that are asymptotically gaussian zero mean and variance.
This result is based on a Taylor expansion of in a neighborhood of . Details are available in appendix A. ∎
4 Main Result
Our main result compares the observed likelihood ratio with the complete likelihood to show that they have the same argmax. To ease the comparison, we work only on the high probablity set of -regular configurations, i.e. that have nodes in each group as defined in Section 2,
Define as the subset of made of -regular assignments, with defined in assumption . Note the event , then:
This proposition is a consequence of Hoeffding’s inequality. See appendix A for more details. ∎
We can now state our main result:
Theorem 4.2 (complete-observed).
Assume that to with random-dyad sampling hold for the Stochastic Block Model of known order with observations coming from an univariate exponential family and define as the set of permutation for which exhibits symmetry. Then, for tending to infinity, the observed likelihood ratio behaves like the complete likelihood ratio, up to a bounded multiplicative factor:
where the is uniform over all .
The maximum over all that are equivalent to stems from the fact that because of label-switching, is only identifiable up to its -equivalence class from the observed likelihood, whereas it is completely identifiable from the complete likelihood. The multiplicative factor arises from the fact that equivalent assignments have exactly the same complete likelihood and contribute equally to the observed likelihood.
If contains only parameters with no symmetry:
where the is uniform over all .
5 Proof Sketch
The proof of theorem relies on controlling deviations of the log-likelihood ratios from their expectations. We introduce a few notations for those quantities.
5.1 log-likelihood ratios
We define the conditional log-likelihood ratio and its expectation as:
We also define the profile ratio and its counterpart as:
Conditionally on , we have
with for such that or .
Note the absence of the random variable in .
The following decomposition of highlights the importance of :
Since , the profile ratio is useful to remove the dependency on and reduce the study to a series of problems depending only on . The following propositions show when those quantities reach their maximum values and what the corresponding values are.
Proposition 5.4 (maximum of and in ).
The functions and are maximum respectively in for and defined by:
Proposition 5.5 (Local upperbound for ).
Conditionally upon , there exists a positive constant such that for all :
Proposition 5.6 (maximum of and in ).
can be written:
Conditionally on the set of regular assignments and for ,
is maximized at and its equivalence class and .
is maximized at and its equivalence class and .
The maximum of (and hence the maximum of ) is well separated.
5.2 High level view of the proof
The proof proceeds with an examination of the asymptotic behavior of on three types of configurations that partition :
equivalent assignments: Proposition 5.11 examines which of the remaining assignments, all equivalent to , contribute to the sum.
These results are presented in next section 5.3 and their proofs postponed to Appendix B. They are then put together in section 5.4 to prove our main result. The remainder of the section is devoted to the asymptotics of the ML and variational estimators as a consequence of the main result.
5.3 Different asymptotic behaviors
5.3.1 Global Control
Proposition 5.7 (large deviations of ).
Let . For all and large enough that
Proposition 5.8 (contribution of global assignments).
Choose decreasing to such that . Then conditionally on and for large enough that , we have:
5.3.2 Local Control
Proposition 5.9 (small deviations ).
Conditionally upon ,
Proposition 5.10 (contribution of local assignments).
With the previous notations and the positive constant defined in Proposition 5.5:
5.3.3 Equivalent assignments
It remains to study the contribution of equivalent assignments.
Proposition 5.11 (contribution of equivalent assignments).
For all , we have
where the is uniform in .
5.4 Proof of the main result
We work conditionally on . Choose and a sequence decreasing to but satisfying . According to Proposition 5.8,
And therefore the observed likelihood ratio reduces as:
And Proposition 5.11 allows us to conclude
6 Variational and Maximum Likelihood Estimates
This section is devoted to the asymptotic of the ML and variational estimators in the incomplete data model as a consequence of the main result 4.2. Note that, with high probability, ML and variational estimators have no symmetry since the set is a manifold of null Lebesque’s mesure in .
6.1 ML estimator
Corollary 6.1 (Asymptotic behavior of ).
Denote the maximum likelihood estimator and use the notations of Proposition 3.4. There exist permutations of such that
Hence, the maximum likelihood estimator for the SBM under random-dyad sampling condition is consistent and asymptotically normal, with the same behavior as the maximum likelihood estimator in the complete data model. The proof is postponed to appendix B.10.
6.2 Variational estimator
Due to the complex dependency structure of the observations, the maximum likelihood estimator of the SBM is not numerically tractable, even with the Expectation Maximisation algorithm. In practice, a variational approximation is often used (see Daudin et al., 2008)
: for any joint distributionon a lower bound of is given by
where . Choosing to be the set of product distributions, such that for all
allows us to obtain tractable expressions of . The variational estimate of is defined as
The following corollary states that has the same asymptotic properties as and , in particular is consistent and asymptotically normal.
Theorem 6.2 (Variational estimate).
Under the assumptions of Theorem 4.2, there exist permutations of such that
Close examination of the different proofs, especially of Prop. 5.10, reveals that the quantities driving convergence of the estimates are , which must go to with to ensure validity of Prop. 5.8, and , which must be larger than while , to ensure validity of Prop. 5.10. Both conditions are met as soon as , allowing for a large fraction of missing edges. Note that this limiting rate for missingness is the same as the one found for graph density in sparse settings to achieve consistency and local asymptotic normality of (Bickel et al., 2013).
In this paper, we focused on data sampled according to random dyad sampling. However, as described in section 2.3, there are many other ways to sample a network. In the case of node-centered sampling design, like random node sampling, the main difficulty to prove consistency and asymptotic normality is the dependency between the variables. Indeed, in random node sampling, the variable depends on all and (for all ). As a consequence, many results proved in this paper are not valid under random node sampling. NMAR sampling designs raises problem of their own: each design requires its own estimation procedure (Tabouy et al., 2019) and therefore its own analysis. For example, even parameter estimation under the double standard sampling for binary networks mentioned in section 2.3 is still an unsolved problem: numerical experiments suggest that