On identifying unobserved heterogeneity in stochastic blockmodel graphs with vertex covariates

07/04/2020
by   Cong Mu, et al.
University of Pittsburgh
0

Both observed and unobserved vertex heterogeneity can influence block structure in graphs. To assess these effects on block recovery, we present a comparative analysis of two model-based spectral algorithms for clustering vertices in stochastic blockmodel graphs with vertex covariates. The first algorithm directly estimates the induced block assignments by investigating the estimated block connectivity probability matrix including the vertex covariate effect. The second algorithm estimates the vertex covariate effect and then estimates the induced block assignments after accounting for this effect. We employ Chernoff information to analytically compare the algorithms' performance and derive the Chernoff ratio formula for some special models of interest. Analytic results and simulations suggest that, in general, the second algorithm is preferred: we can better estimate the induced block assignments by first estimating the vertex covariate effect. In addition, real data experiments on a diffusion MRI connectome data set indicate that the second algorithm has the advantages of revealing underlying block structure and taking observed vertex heterogeneity into account in real applications. Our findings emphasize the importance of distinguishing between observed and unobserved factors that can affect block structure in graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

01/03/2018

Accounting for unobserved covariates with varying degrees of estimability in high dimensional experimental data

An important phenomenon in high dimensional biological data is the prese...
01/03/2018

Accounting for unobserved covariates with varying degrees of estimability in high dimensional biological data

An important phenomenon in high dimensional biological data is the prese...
08/18/2019

Spectral inference for large Stochastic Blockmodels with nodal covariates

In many applications of network analysis, it is important to distinguish...
07/10/2018

Pairwise Covariates-adjusted Block Model for Community Detection

One of the most fundamental problems in network study is community detec...
02/14/2018

Vertex nomination: The canonical sampling and the extended spectral nomination schemes

Suppose that one particular block in a stochastic block model is deemed ...
03/26/2021

Beyond the adjacency matrix: random line graphs and inference for networks with edge attributes

Any modern network inference paradigm must incorporate multiple aspects ...
12/10/2013

Vertex nomination schemes for membership prediction

Suppose that a graph is realized from a stochastic block model where one...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In network inference applications, it is important to distinguish different factors such as vertex covariates and underlying vertex block assignments that can lead to networks with different latent communities. As a special case of random graph models, stochastic blockmodel (SBM) graphs are popular in the literature for community detection [Abbe2018, Holland1983, Karrer2011]. Inference in SBMs extended to include vertex covariates relies on either variational methods [Choi2012, Roy2019, Sweet2015] or spectral approaches that promise applicability to large graphs [Binkiewicz2017, Huang2018, Mele2019]. Spectral methods [Von2007] have been widely used in random graph models for a variety of subsequent inference tasks such as community detection [Lyzinski2014, Lyzinski2016, McSherry2001, Rohe2011], vertex nomination [Lyzinski2019], nonparametric hypothesis testing [Tang2017], and multiple graph inference [Wang2019]. Two particular spectral embedding methods, adjacency spectral embedding (ASE) and Laplacian spectral embedding (LSE), are popular since they enjoy nice propertices including consistency [Sussman2012] and asymptotic normality [Athreya2016, Tang2018]. To compare the performance of these two embedding methods, the concept of Chernoff information is first employed for SBMs [Tang2018] and then extended to consider the underlying graph structure [Cape2019].

One problem of interest in hypothesis testing framework is to assess the influence of unobserved vertex heterogeneity on outcome variables, controlling for vertex covariate effect [Hao2020, Shalizi2016]. In a -block SBM, that is to test whether for given , where are outcome variables and is the induced block assignment for vertex . To achieve this goal, it is crucial to estimate the block structure after accounting for the vertex covariate effect. Here we use “induced block assignment” to refer to the block assignment after accounting for the vertex covariate effect, since the number of blocks can change. An “induced” SBM, but with each of the two blocks split into two via the effect of a binary vertex covariate, becomes a SBM. We shall address this concept in detail in Section II.

In this article, we investigate two model-based spectral algorithms for clustering vertices in stochastic blockmodel graphs with vertex covariates. Analytically, we compare the algorithms’ performance via Chernoff information and derive the Chernoff ratio formula for special models of interest. We shall address the notion of Chernoff information for comparing algorithms in detail in Section IV. Practically, we compare the algorithms’ actual clustering performance by simulations and real data experiments on a diffusion MRI connectome data set.

The structure of this article is summarized as follows. Section II reviews relevant models for random graphs and the basic idea of spectral methods. Section III introduces our model-based spectral algorithms for clustering vertices in stochastic blockmodel graphs with vertex covariates. Section IV analytically compares the algorithms’ performance via Chernoff information and derives the Chernoff ratio formula for special models of interest. Section V provides simulations and real data experiments on a diffusion MRI connectome data set to compare the algorithms’ performance in terms of actual clustering performance. Section VI discusses the findings and presents some open questions for further investigation. Appendix A and Appendix B provide technical details for latent position geometry and analytic derivations of the Chernoff ratio.

Ii Models and Spectral Methods

We consider the latent position model [Hoff2002, Handcock2007] for edge-independent random graphs in which each vertex is associated with a latent position where is some latent space such as , and edges between vertices arise independently with probability for some kernel function . In particular, we focus on the generalized random dot product graph (GRDPG) where the kernel function is taken to be the (indefinite) inner product, which can include more flexible SBMs as special cases.

Definition 1 (Generalized Random Dot Product Graph [Rubin-Delanchy2017]).

Let with and . Let be a -dimensional inner product distirbution with on satisfying for all . Let be an adjacency matrix and where , i.i.d. for all . Then we say if where for any .

As a special case of the GRDPG model, the SBM can be used to model block structure in edge-independent random graphs.

Definition 2 (-block Stochastic Blockmodel Graph [Holland1983]).

The -block stochastic blockmodel (SBM) graph is an edge-independent random graph with each vertex belonging to one of blocks. It can be parametrized by a block connectivity probability matrix

and a nonnegative vector of block assignment probabilities

summing to unity. Let be an adjacency matrix and be a vector of block assignments with if vertex is in block (occuring with probability ). We say if where for any .

Let as in Definition 2 where with

strictly positive eigenvalues and

strictly negative eigenvalues. To represent this SBM in the GRDPG model, we can choose where such that for all . For example, we can take where is the spectral decomposition of after re-ordering. Then we have the latent position of vertex as if . As an illustration, consider the prototypical 2-block SBM with rank one block connectivity probability matrix where with . Let be the latent position of vertex where if and if . Then we can represent this SBM in the GRDPG model with latent positions as

(1)

An extension of GRDPG taking vertex covariates into consideration is available.

Definition 3 (GRDPG with Vertex Covariates [Mele2019]).

Consider GRDPG as in Definition 1. Let denote observed vertex covariates. Then we say if where for any with link functions and .

Remark 1.

A special case of the model in Definition 3 is to use the indicator function as and the identity function as with one binary covariate. I.e. for any or with . In the case of an SBM, we have .

Example 1 (2-block Rank One Model with One Binary Covariate).

As an illustration, consider the rank one matrix in Eq. (1) and the SBM model in Remark 1. Let denote the observed binary covariate. Assume with . Then we have the block connectivity probability matrix with the vertex covariate effect as

(2)
Example 2 (2-block Homogeneous Model with One Binary Covariate).

As a second illustration, consider the rank two matrix where with . The SBMs parametrized by this lead to the notion of the homogeneous model [Abbe2018, Cape2019]. For -block homogeneous model, we have for and for . Assume with . We then have the block connectivity probability matrix with the vertex covariate effect as

(3)

Note that in both of these examples, an induced 2-block SBM becomes a 4-block SBM via the effect of a binary vertex covariate. The goal is to cluster each vertex into one of the two induced blocks after accounting for the vertex covariate effect.

Definition 4 (Adjacency Spectral Embedding).

Let be an adjacency matrix with eigendecomposition where are the magnitude-ordered eigenvalues and

are the corresponding orthonormal eigenvectors. Given the embedding dimension

, the adjacency spectral embedding (ASE) of into is the matrix where and .

Remark 2.

There are different methods for choosing the embedding dimension [Hastie2009, Jolliffe2016]; we adopt the simple and efficient profile likelihood method [Zhu2006] to automatically identify “elbow”, which is the cut-off between the signal dimensions and the noise dimensions in scree plot.

In this article, we will focus on applying ASE for our inference task. The adaptation of our algorithms and analytic derivations to the Laplacian spectral embedding can be a valuable future contribution.

Iii Model-based Subsequent Inference via Spectral Methods

We are interested in the inference task of estimating the induced block assignments in a SBM with vertex covariates. To that end, we also consider algorithms for estimating the vertex covariate effect, which can be further used to estimate the induced block assignments. For simplicity, we consider all algorithms with identity link and one binary covariate as in Remark 1. Generalization to the case with other link functions and more than one covariate can be a valuable future contribution.

Input: Adjacency matrix
Output: Block assignments including the vertex covariate effect as ; induced block assignments after accounting for the vertex covariate effect as .
1 Estimate latent positions including the vertex covariate effect as using ASE of where is chosen as in Remark 2.
2 Cluster

using Gaussian mixture modeling (GMM) to estimate the block assignments including the vertex covariate effect as

where is chosen via Bayesian Information Criterion (BIC).
3 Compute the estimated block connectivity probability matrix including the vertex covariate effect as
where is the estimated means of all clusters.
4 Cluster the diagonal of using GMM to estimate the cluster assignments of the diagonal as .
Estimate the induced block assignments as by for and .
Algorithm 1 Estimation of induced block assignment including the vertex covariate effect

Note that in Algorithm 1, the estimation of the induced block assignments, i.e., , highly depends on the estimated block connectivity probability matrix . This suggests that we may not obtain an accurate estimate of the induced block assignments if is not well-structrued, which is often the case in real applications. Thus we propose a modified algorithm that will use additional information from vertex covariates to estimate the induced block assignments along with vertex covariate effect.

Input: Adjacency matrix ; observed vertex covariates
Output: Block assignments including the vertex covariate effect as ; induced block assignments after accounting for the vertex covariate effect as ; estimated vertex covariate effect as .
1 1 - 4 in Algorithm 1.
2 Estimate the vertex covariate effect as using one of the following procedures [Mele2019]. (a) Assign the block covariates as for each block using the mode, i.e.,
where
Construct pair set . Estimate the vertex covariate effect as
(b) Compute the probability that two entries from form a pair as
where
Construct pair set . Estimate the vertex covariate effect as
3 Account for the vertex covariate effect by
where is either or .
4 Estimate latent positions after accounting for the vertex covariate effect as using ASE of where is chosen as in Remark 2.
Cluster using GMM to estimate the induced block assignments after accounting for the vertex covariate effect as .
Algorithm 2 Estimation of induced block assignment after accounting for the vertex covariate effect

As an illustration of estimating (Step 2 in Algorithm 2), consider the block connectivity probability matrix as in Eq. (3). To get , we can subtract two specific entries of . For example,

(4)

Then we can get by subtracting two specific entries of . However, the ASE and GMM under GRDPG model can lead to the re-ordering of . Thus we need to identify pairs first so that we subtract the correct entries.

In Step 2(a), we find pairs in by first assigning each block common covariates using the mode. However, it is possible that we can not find any pairs using this approach, especially in the unbalanced case where the size of each block is different and/or the distribution of the vertex covariate is different. For example, one block size is much larger than the others and/or vertex covariates are all the same within one block.

In Step 2(b), instead of first finding pairs using mode, we only compute the probability that two entries of form a pair. This will make the estimation more robust to extreme cases or special structure.

Iv Spectral Inference Performance

Iv-a Chernoff Ratio

There are different metrics for comparing spectral inference performance such as within-class covariance and Chernoff information [Athreya2017, Karrer2011, Tang2018]. The within-class covariance will depend on which clustering procedure is used, specifically -means. Chernoff information is independent of the clustering procedure and intrinsically related to the Bayes risk. We employ Chernoff information to compare the performance of Algorithm 1 and Algorithm 2 for estimating the induced block assignments in SBMs with vertex covariates. Let and be two continuous multivariate distributions on with density functions and . The Chernoff information [Chernoff1952, Chernoff1956] is defined as

(5)

Consider the special case where we take and ; then the corresponding Chernoff information is

(6)

where . For a given embedding method such as ASE in Algorithm 1 and Algorithm 2, comparsion via Chernoff information is based on the statistical information between the limiting distributions of the blocks and smaller statistical information implies less information to discriminate between different blocks of the SBM. To that end, we also review the limiting results of ASE for SBM, essential for investigating Chernoff information.

Theorem 1 (CLT of ASE for SBM [Rubin-Delanchy2017]).

Let be a sequence of adjacency matrices and associated latent positions of a -dimensional GRDPG as in Definition 1 from an inner product distribution where is a mixture of point masses in , i.e.,

(7)

where is the Dirac delta measure at . Let

denote the cumulative distribution function (CDF) of a multivariate Gaussian distribution with mean

and covariance matrix , evaluated at . Let be the ASE of with as the -th row (same for ). Then there exists a sequence of matrices satisfying such that for all and fixed index i,

(8)

where for

(9)
Remark 3.

If the adjacency matrix is sampled from an SBM parameterized by the block connectivity probability matrix in Eq. (1) and block assignment probabilities with , then as a speical case for Theorem 1 [Athreya2017, Tang2018], we have for each fixed index ,

(10)

where

(11)

Now for a -block SBM, let be the block connectivity probability matrix and be the vector of block assignment probabilities. Given an vertex instantiation of the SBM parameterized by and , for sufficiently large , the large sample optimal error rate for estimating the block assignments using ASE can be measured via Chernoff information as [Athreya2017, Tang2018]

(12)

where , and are defined as in Eq. (9). Also note that as , the logarithm term in Eq. (12) will be dominated by the other term. Then we have the Chernoff ratio as

(13)

Here and are associated with the Algorithm 1 and Algorithm 2 respectively. If , then Algorithm 1 is preferred, otherwise Algorithm 2 is preferred.

Iv-B 2-block Rank One Model with One Binary Covariate

As an illustration of using Chernoff ratio in Eq. (13) to compare the performance of Algorithm 1 and Algorithm 2 for estimating the induced block assignments, we consider the 2-block SBM with one binary covariate parametrized by the block connectivity probability matrix as in Eq. (2). In addition, we consider the balanced case where and with the assumption that and for and . Via the idea of Cholesky decomposition, we can re-write as

(14)

where . Elementary calculations yield the canonical latent positions as

(15)

For this model, the block connectivity probability matrix as in Eq. (2) is positive semidefinite with . Then we have and we can omit it in our analytic derivations. With the canonical latent positions in Eq. (15), the only remaining term to derive for Chernoff ratio is in Eq. (13). For , define

(16)

where

(17)

Then we can re-write in Eq. (9) as

(18)

and from Eq. (13) as

(19)

To evaluate the Chernoff ratio, we also define for

(20)

By the symmetric structure of as in Eq. (2) and the balanced assumption, we observe that . Thus we need only to evaluate . Subsequent calculations and simplification yield

(21)

where for

(22)

Then we have the approximate Chernoff information for Algorithm 1 as

(23)

where for are defined as in Eq. (21). For this model, there is no tractable closed-form for and but numerical experiments can be used to obtain . By the Remark 3 and similar calculations [Athreya2017, Tang2018], we have the approximate Chernoff information for Algorithm 2 as

(24)

where are defined as in Eq. (11) and are defined as in Eq. (22).

Figure 1 shows the Chernoff ratio when we fix and take in the 2-block rank one models with one binary covariate. We can see that for most of the region while only when and are relatively large. Recall that the performance of Algorithm 1 highly depends on the estimated block connectivity probability matrix . Large and lead to a relatively well-structured and thus Algorithm 1 can have better performance in this region.

Fig. 1: Chernoff ratio as in Eq. (13) for 2-block rank one model, .

Iv-C 2-block Homogeneous Model with One Binary Covariate

Now we consider the 2-block SBM with one binary covariate parametrized by the block connectivity probability matrix as in Eq. (3). We also consider the balanced case where and with the assumption that and for and . Similarly, the idea of Cholesky decomposition and elementary calculations yield the canonical latent positions as

(25)

Observe that for this model, the block connectivity probability matrix as in Eq. (3) is also positive semidefinite with . Then we have and we can omit it in the derivations as for 2-block rank one model. To evaluate the Chernoff ratio, we also investigate the as defined in Eq. (20). Similar observations suggest that . Thus we only need to evaluate . Subsequent calculations and simplification yield

(26)

where for and

(27)

Then we have the approximate Chernoff information for Algorithm 1 as

(28)

where for are defined as in Eq. (26). Also observe that

(29)

where

(30)

Then we can further simplify as

(31)

By the same derivations [Cape2019], we have the approximate Chernoff information for Algorithm 2 as

(32)

where and are defined as in Eq. (27). We then have the general Chernoff ratio formula as follows.

Corollary 1.

For 2-block homogeneous balanced model with one binary covariate parametrized by as in Eq. (3) and , the Chernoff ratio as in Eq. (13) can be derived analytically as