Restricted Boltzmann Stochastic Block Model: A Generative Model for Networks with Attributes

11/11/2019 ∙ by Shubham Gupta, et al. ∙ indian institute of science 0

In most practical contexts network indexed data consists not only of a description about the presence/absence of links, but also attributes and information about the nodes and/or links. Building on success of Stochastic Block Models (SBM) we propose a simple yet powerful generalization of SBM for networks with node attributes. In a standard SBM the rows of latent community membership matrix are sampled from a multinomial. In RB-SBM, our proposed model, these rows are sampled from a Restricted Boltzmann Machine (RBM) that models a joint distribution over observed attributes and latent community membership. This model has the advantage of being simple while combining connectivity and attribute information, and it has very few tuning parameters. Furthermore, we show that inference can be done efficiently in linear time and it can be naturally extended to accommodate, for instance, overlapping communities. We demonstrate the performance of our model on multiple synthetic and real world networks with node attributes where we obtain state-of-the-art results on the task of community detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we propose a communities based statistical model for networks that have observable attributes associated with each node (also known as node covariates). We use the phrase networks with attributes to refer to such networks. Many real world networks can be modeled in this way. For example, consider a social network: the nodes represent people and the edges represent friendship. Additionally, each person has observable attributes like age, current affiliation, gender, hometown, and so on.

The notion of community relies (at least partially) on the similarity between nodes of a network. If one makes a reasonable assumption that similar nodes have a higher tendency to connect to each other, then the observed connections offer a surrogate to this similarity measure. Traditionally, community detection algorithms like spectral clustering

[Luxburg:2007:ATutorialOnSpectralClustering] exploit the observed connectivity pattern in a network to discover communities. However, networks are usually sparse, and thus looking only at the adjacency matrix of a network provides a very noisy signal about node similarities. For example, for Cora dataset that we use in our experiments, a spectral clustering method will always yield poor performance irrespective of how the number of communities is chosen (performance has been reported in Table 2). This is mainly due to the presence of many nodes that only have degree one (i.e., many isolated links).

On the other hand, node attributes often provide additional information about similarity between nodes [YangEtAl:2013:CommunityDetectionInNetworksWithNodeAttributes]. A community detection method that takes these observed attributes into account is thus likely to find more meaningful communities as opposed to a method that only considers the underlying connectivity pattern. Hence, development of models and algorithms tailored towards networks with attributes is important. There are at least two approaches for incorporating node attributes while detecting communities: (i) modifying the cost function (like modularity score [Newman:2003:TheStructureAndFunctionOfComplexNetworks]) of an existing community detection algorithm to include information about node attributes, and (ii) posing the problem of community detection as the problem of performing inference in an appropriate statistical model that can incorporate domain knowledge (see Section 2 for more details).

In this paper, we consider the second approach, but rather than making domain specific assumptions, we propose a simple and flexible generative model that can be used to model networks of different types. Our approach combines the well known Stochastic Block Model (SBM) [HollandEtAl:1983:StochasticBlockmodelsFirstSteps] for modeling network connections with a variant of Restricted Boltzmann Machines (RBM) [FischerEtAl:2012:AnIntroductionToRestrictedBoltzmannMachines] for modeling the relationship between community membership of a node and its observable attributes. We call this model Restricted Boltzmann Stochastic Block Model (RB-SBM).

While other generative models that extend SBMs to networks with attributes have been proposed [BinkiewiczEtAl:2017:CovariateAssistedSpectralClustering], these approaches usually employ a very simple model of attributes that limits model flexibility. Owing to the use of RBMs, our approach is more flexible in terms of modeling the joint distribution between node attributes and community membership. Another class of models assumes that node attributes are fixed and known and hence these models are not truly generative in nature [NewmanEtAl:2016:StructureAndInferenceInAnnotatedNetworks, ZhangEtAl:2018:NodeFeaturesAdjustedStochasticBlockModel]. Finally, there are multiple application specific generative models like the models for document networks [LiuEtAl:2009:TopicLinkLDA, ChangBlei:2009:RelationalTopicModelingForDocumentNetworks] as we discuss in Section 2.

Our model applies to both directed and undirected networks. We assume that the edges are unweighted, the node attributes are binary and there are no self loops or multiple edges. The proposed model can be very naturally extended beyond this (albeit at a higher computational cost), but for sake of concreteness and clarity we do not consider these extensions here and only briefly mention them in Appendix D. Furthermore, many real world problems can be modeled using networks that satisfy these constraints.

Our model can be used to generate networks with attributes along with ground truth communities that can help in testing the performance of new and existing community detection algorithms that take node attributes into account. Furthermore, we derive an inference procedure for our model using the Variational Expectation Maximization strategy (Section

4). This procedure runs in linear time in the number of nodes and edges, and is therefore suitable for application in large networks. We demonstrate the utility of our model on several synthetic and real world networks with attributes in Section 5.

Our contributions: (i) We propose a simple and flexible generative model for networks with attributes (Section 3) that may provide a stepping stone for development of more sophisticated models (Section 6, Appendix D), (ii) We derive an efficient approximate inference method for the proposed model (Section 4), (iii) We empirically validate the proposed model on the task of community detection and demonstrate through a qualitative case study that it can provide interpretable insights about the data (Section 5). Our approach outperforms existing approaches on Cora and Citeseer networks in terms of NMI score with respect to known ground truth community memberships.

2 Related Work

Approaches for community detection in networks with attributes can be divided into two main categories - those that are more algorithmic in nature and those based on probabilistic models. This paper belongs to the second category.

Algorithmic approaches usually modify existing algorithms for community detection in networks without attributes to make them suitable for networks with attributes. As an example, [ZhouEtAl:2010:ClusteringLargeAttributedGraphs, RuanFuhryEtAl:2013:EfficientCommunityDetectionInLargeNetworksUsingContentAndLinks] use node attributes to augment the set of edges in the network. Existing community detection algorithms are then used on this augmented network. Covariate Assisted Spectral Clustering (CASC) [BinkiewiczEtAl:2017:CovariateAssistedSpectralClustering] uses spectral clustering algorithm [Luxburg:2007:ATutorialOnSpectralClustering] on a similarity matrix that combines both node attributes and connectivity information. Spectral clustering has been adapted to networks with attributes in many different ways, see references within [BinkiewiczEtAl:2017:CovariateAssistedSpectralClustering] for more details.

Some other approaches modify the objective function of an existing algorithm to include node attributes. For example, [ZhangEtAl:2016:CommunityDetectionInNetworksWithNodeFeatures] propose joint community detection criteria, which can be seen as a modification of modularity score for networks with attributes. See references within [LiEtAl:2011:GeneralizedLatentFactorModelsForSocialNetworkAnalysis, AkogluEtAl:2012:PICS, ZhangEtAl:2016:CommunityDetectionInNetworksWithNodeFeatures] for other similar approaches. Along similar lines, [LiEtAl:2018:CommunityDetectionInAttributedGraphsAnEmbeddingsApproach] proposed Community Detection in attributed graphs: an Embedding approach (CDE) that relies on simultaneous non-negative matrix factorization of node-attribute and adjacency matrices.

In general, approaches based on probabilistic models offer more insights as compared to algorithmic approaches. For example, using our approach one can naturally find the role played by different node attributes in characterizing various communities. Both discriminative [YangEtAl:2009:CombiningLinkAndContentForCommunityDetection] as well as generative probabilistic models [CohnHofmann:2001:TheMissingLinkAProbabilisticModelOfDocumentContentAndHypertextConnectivity, EroshevaEtAl:2004:MixedMembershipModelsOfScientificPublications, ChangBlei:2009:RelationalTopicModelingForDocumentNetworks, LiuEtAl:2009:TopicLinkLDA, BalasubramanyanEtAl:2011:BlockLDA, XuEtAl:2012:AModelBasedApproachToAttributedGraphClustering, YangEtAl:2013:CommunityDetectionInNetworksWithNodeAttributes] have been proposed.

However, statistical models are usually more domain specific. For instance, most of the probabilistic models for this task have been tailored towards document networks. In this context, each node represents a document and these documents are connected to each other via hyperlinks or citations. These methods usually employ variants of Latent Dirichlet Allocation (LDA) [BleiNgJordan:2003:LatentDirichletAllocation] to model textual node attributes. For example, [EroshevaEtAl:2004:MixedMembershipModelsOfScientificPublications] proposed a model that we will call LDA-Link-Word (LLW) following [YangEtAl:2009:CombiningLinkAndContentForCommunityDetection]. The model uses LDA for modeling documents and uses the notion of communities for modeling links. [YangEtAl:2009:CombiningLinkAndContentForCommunityDetection] use a node popularity based conditional link model (PCL) and combine it with PLSA (Probabilistic Latent Semantic Analysis, which is similar to LDA) to model documents. They call this model PCL-PLSA. They also have a discriminative variant of the model which replaces PLSA with a discriminative content model to obtain PCL-DC.

Not all probabilistic models are specific to document networks. As an example, [YangEtAl:2013:CommunityDetectionInNetworksWithNodeAttributes] proposed Community Detection in Networks with Node Attributes (CESNA) for networks where node attributes are binary and communities can overlap.

All of these models either assume that community memberships determine attributes or the other way round. In contrast, we model the joint distribution between nodes attributes and community memberships directly thereby avoiding this assumption. Also, we believe that the use of RBM makes our approach applicable across multiple domains as opposed to being restricted to for example, document networks.

3 Proposed Model

The RB-SBM model is generative and describes a network with nodes represented by a simple graph, where each node/vertex is endowed with binary attributes. Furthermore, there is a latent structure that specifies the community membership of each of the nodes, namely, each node belongs to exactly one of the communities.

Let be the binary adjacency matrix of the network (no self loops are allowed), be the binary attribute matrix and be the community membership matrix that satisfies for all . Each node belongs to exactly one community. We will use to denote the parameters of our statistical model (these will be specified later).

Our model combines a Restricted Boltzmann Machine (RBM) [FischerEtAl:2012:AnIntroductionToRestrictedBoltzmannMachines] to model the joint distribution over and and a Stochastic Block Model (SBM) [HollandEtAl:1983:StochasticBlockmodelsFirstSteps] to model the connections in given and . A high level graphical description of the model is provided in Figure 1.

3.1 Modeling Interaction between and

We assume that the joint probability mass function

factorizes into where and represent the row of and respectively. We model the joint distribution over and using a RBM. However, due to the restriction that each node belongs to exactly one community,

will be a one-hot encoded vector and hence the usual formulation of RBM which uses binary values for both visible and hidden units can not be used directly. A trivial change in the computation of partition function solves this issue. With

, , and as the parameters of RBM, the joint distribution of and is given by:

(1)

Here is the normalization constant (also known as partition function). Unlike the commonly used RBM, in our case does not involve a sum over all possible binary values of , therefore

(2)

where, is treated as a row vector and and are the terms of the vectors and respectively. Further, unlike in a usual RBM, we can efficiently compute in time (see Appendix A.2). It is also easy to see that (see Appendix A.1):

(3)
(4)

One can use (3) and (4) for Gibbs sampling to draw a sample from the joint distribution over attributes and community membership that is modeled by the RBM.

3.2 Modeling Connections in the Network

We use the Stochastic Block Model to model the presence/absence of edges in the network. Usually, in a SBM it is assumed that the community membership of nodes are sampled independently and identically distributed (i.i.d.) from a multinomial distribution over communities. Then, conditioned on the community membership of the two end points and , an edge is sampled with probability independently for all values of and . Here, is the block matrix that is used in the SBM. In our case, rather than obtaining the community membership of nodes from , we sample them from the RBM using (3) and (4).

The block matrix plays a crucial role in determining the properties of the network that is being modeled. For instance, if one is looking for the traditional assortative communities, then entries on the leading diagonal of should be higher than other entries. One can similarly impose different structures on to capture different type of properties like hierarchical communities, disassortative communities and so on. Although different formulations are possible, we take a Bayesian point of view and impose a Beta prior on all entries of the matrix. The user can specify the and hyperparameters of the prior on by using domain specific knowledge about the problem.

3.3 Full description of RB-SBM

The generative process of our model can be summarized as follows:

(i) Sample for

(ii) Sample for all from the RBM in (1)

(iii) Sample for all .

Using the independence assumptions implied by the graphical model given in Fig. 1

, one can write the probability density function

111Formally this is the Radon-Nikodym derivative with respect to a dominating measure consisting of the product of counting measures (for the first three arguments) and Lebesgue measure (for the last argument). of as:

(5)

The parameters are . These parameters are used by the RBM. Additionally, the model uses three hyperparameters - , and . The first two describe the prior on whereas is the number of communities. Although, (5) depends on the hyperparameters as well, we have suppressed this in the notation to avoid clutter. While performing inference in the model, we consider , and to be fixed constants that are provided by the user. Note that both RBM and SBM are known to have good representation power and hence our model can be used to generate a large class of networks with various properties. In addition, as we will see in Section 4, this way of combining a RBM and a SBM allows for an efficient inference procedure. Next we highlight some noteworthy features of this model.

Figure 1: Graphical model for RB-SBM: are the model parameters that are used by the RBM whereas , and are hyperparameters. and represent the row of and respectively.

Directly modeling : While in certain settings it is reasonable to assume a direction on the edge that connects and in the graphical model, such an assumption need not always hold in practice. For example, in a friendship network, while people form communities based on shared attributes, they also acquire attributes because of the membership to different communities due to the influence of peers. Modeling the joint distribution directly naturally captures both type of interactions between and .

Reduction to SBM: RB-SBM reduces to SBM if one sets and , where is the multinomial distribution from which community memberships are sampled i.i.d. in a SBM. It can be verified that in such a setting the marginal distribution for the RBM is and hence if one ignores the observed attributes, the generated network will come from an SBM with parameters (see Appendix A.3).

Constraints on and : We assume that attributes are binary since a large class of attributes can be encoded using binary vectors. Even the continuous valued attributes can be discretized and represented as binary attributes. Although, continuous relaxations of RBMs do exist, we opted for the binary variant in favor of simplicity and ease of inference (also see Appendix D).

Utility of the model: Note that in our model, while the attributes and community membership come from a joint distribution, the observed edges in the network are only a function of community membership. This is a reasonable assumption to make considering the traditional definition of a community. If the observed network is dense enough (or equivalently, the entries of are sufficiently large) one can ignore the attributes and use a traditional community detection algorithm like Spectral Clustering [Luxburg:2007:ATutorialOnSpectralClustering] to recover the communities. However, most real world networks are very sparse. In such cases, information from the node attributes can also be exploited to recover the communities, and in these cases the model becomes especially useful. Note that there are other approaches that explicitly augment the observed networks based on to alleviate the sparsity problem [RuanFuhryEtAl:2013:EfficientCommunityDetectionInLargeNetworksUsingContentAndLinks]. However, in our approach we do not use any explicit augmentation (refer to Table 2 for a comparison against [RuanFuhryEtAl:2013:EfficientCommunityDetectionInLargeNetworksUsingContentAndLinks] on community detection task). Our approach naturally combines the node attributes and network connectivity to recover the underlying communities as we will show in the next section.

4 Inference

In practice, one observes only the connectivity structure and node attributes (i.e. and ), while the community membership and block structure is hidden from us. The main objective of inference is to discover the underlying community structure (i.e. and ) that summarizes the network. Inference is concerned with the computation of , which can then be used for tasks like assigning each node to the most probable community. For RB-SBM, exact computation of is intractable since while calculating one needs to perform a summation over choices. Thus, we resort to approximate inference techniques and use Variational Inference [BleiEtAl:2017:VariationalInferenceAReviewForStatisticians].

Given and we would like to find a distribution over the unobserved variables and

along with a point estimate of the parameters

and . We use a variational EM algorithm [Bishop:2006:PRML] that alternates between the following two steps:

(i) E-step: Find an approximation to the posterior distribution over and keeping and fixed.

(ii) M-step: Find point estimates of and while assuming the distribution over and found in step (i) fixed.

To understand this in more detail, let be the probability of the observed data for fixed and . It is intractable to compute and hence rather than maximizing directly, we will maximize a lower bound on that can be computed easily. We use to refer to the distribution that will approximate the true posterior . One can write

(6)

where, . The term is also known as ELBO [BleiEtAl:2017:VariationalInferenceAReviewForStatisticians]. In the E-step, is held constant and ELBO is maximized over , while in the M-step, is held constant and ELBO is maximized over . Next we describe these two steps.

4.1 E-step

We assume that the distribution belongs to the mean field family of distributions [BleiEtAl:2017:VariationalInferenceAReviewForStatisticians], i.e.,

(7)

Under this assumption, one can show that a coordinate ascent technique can be used to get the distributions and , for and , that approximately maximize for a fixed by setting [BleiEtAl:2017:VariationalInferenceAReviewForStatisticians]:

(8)

where represents expectation with respect to all distributions on the right hand side of (7) except for the given value of . Similarly represents the expectation with respect to all distributions except . One can iterate over and in some specific order and update the corresponding distributions using (4.1).

The expectations given in (4.1) can be solved in closed form. These expressions have been derived in Appendix B. This relies on conjugacy arguments for exponential families and shows that

is a Beta distribution for all

and .

In a single E-step, we update only for a randomly chosen subset of nodes. One would want to be small enough so that E-step can be completed in a reasonable amount of time while at the same time being large enough to make sufficient amount of progress before the next M-step. We experimentally found that the value of works well in all our experiments even in the case when is as large as (Section 5.1). Since is a fixed constant which is not dependent on the number of nodes in the network, the overall complexity of the E-step is , where is the number of edges in the network. Details about complexity analysis have been given in Appendix B.

4.2 M-step

In the M-step, the distribution is held constant and is maximized over and . We do this by computing the gradient of with respect to these parameters and performing gradient ascent. In a single M-step, we perform gradient ascent updates to approximately maximize . We empirically observed that suffices for all our experiments and hence we use that value. Larger values of can be used, but without any significant gains.

The gradients can be expressed as (see Appendix C):

(9)

Above

denotes the joint probability distribution over attributes and community memberships that is encoded by the RBM learned so far, and

has been used as a shorthand notation for evaluated at a one hot vector for which .

One can either explicitly compute the expectation with respect to in the second term of (4.2) for all parameters or approximate it using Monte-Carlo estimation. We present the computation of gradients using both strategies in Appendix C. While the exact gradient computation is slightly faster, the gradient approximation via Gibbs sampling is numerically more stable.

When Monte-Carlo estimation is used, as is the standard practice with RBMs [FischerEtAl:2012:AnIntroductionToRestrictedBoltzmannMachines], we use persistent Gibbs chains which sample from using (3) and (4) to obtain samples at each M-step. These samples are then used to approximate the gradients in (4.2). Once the gradients have been computed, the standard gradient ascent updates can be made with learning rate which is a user specified hyperparameter. The choice of learning rate is important, but there is a wide range of values for which the optimization procedure is stable and converges at a reasonable pace. Namely we used in all experiments (as the ELBO scales linearly with ). The complexity of a single M-step is (see Appendix C). Thus, each iteration of E and M steps runs in time that is linear in the number of nodes and edges which makes the method scalable to large networks. Algorithm 1 outlines the inference procedure.

  Input: , , , , , , and
  Initialize , , and
  for  iterations do
     E-step:
     Use (4.1) to update for all community pairs and a random subset of nodes
     M-step:
     for  iterations do
        Obtain gradients in and update and
     end for
  end for
Algorithm 1 Inference in RB-SBM

As shown in Algorithm 1, we run the E and M steps for a fixed number of iterations that we denote by . One can also use other stopping criteria like a minimum improvement in the value of ELBO. In our experiments with real world networks, we fixed and observed that ELBO stabilizes well before this upper limit is reached.

Note that each iteration of Algorithm 1 will ensure that we approach a local optima of . When the algorithm stops we will have an approximation to the true posterior and estimates of and . One can now set where ranges over all possible one-hot vectors of dimension to assign each node to a community. The values of and the posterior over can be used to gain additional insights about the network like the interaction pattern among various communities and the defining attributes of a community.

5 Experiments

Figure 2: Average time taken by one iteration of E and M steps as a function of the number of edges on the -axis. The running time is linear in the number of edges.

5.1 Synthetic Networks

We generated synthetic networks with attributes using our model following the sampling procedure outlined in Section 3 for different values of . We used for all and if and otherwise. This choice roughly implies that the sparsity of sampled networks will be . The factor of ensures that as becomes large, within community edges are roughly times more likely as compared to across community edges. To choose the parameters of the RBM we suppose each of the attributes might have an assortative role (two nodes with the same attribute are more likely in the same community), a disassortative role, or a neutral role. For each community , each attribute was assigned the corresponding role with probabilities respectively , and (where ). The value of was set to , or respectively. We used the value of and for these experiments. The vector was set to for all elements and was set to .

Figure 3: NMI scores of the detected communities against the number of iterations for the first iterations. Different curves correspond to different values of .

We used , , and for these experiments. Gradients were computed by using exact calculations for expectation in (4.2) as given in Appendix C.2

. We also used gradient clipping where all the gradients were clipped in the range

to increase the numerical stability of inference procedure. We have also experimented with the Gibbs sampling approach here and the results were qualitatively the same. All the experiments were executed on an Intel Core i7-6700 machine with 4 GB of usable main memory.

Figure 2 shows the average total time taken by the inference procedure to execute one iteration of E and M steps as a function of the number of edges in the network. We have also indicated the number of nodes in the network for each data point in Fig. 2. It can be seen that the running time scales linearly with the number of edges.

Fig. 3 shows the Normalized Mutual Information (NMI) scores [DanonEtAl:2005:ComparingCommunityStructureIdentification] between detected community memberships and ground truth communities as a function of the number of iterations for the first iterations. The NMI takes values between 0 and 1, where larger values correspond to better performance. Since the batch size , when is large the NMI increases slowly but steadily. For example, when , after iterations each node’s posterior over community membership was updated less than three times on an average.

Dataset
Cora
Citeseer
Philosophers Unknown
Table 1: Details about real world network datasets used in our experiments. Citations: [LuGetoor:2003:LinkBasedClassification], [YangEtAl:2013:CommunityDetectionInNetworksWithNodeAttributes]
Method Cora Citeseer
SC (only network)
SC (only attributes)
CASC
CODICIL
CESNA
LLW
PCL-PLSA
PCL-DC
CDE
RB-SBM (Gibbs)
RB-SBM (Exact)
Table 2:

Performance of RB-SBM on community detection. We have reported the mean NMI scores with standard deviation. For all other approaches the figures have been taken from the respective papers and it is not clear if the reported scores are mean-scores or max-scores. Cosine similarity kernel is used for spectral clustering when only node attributes are considered. Generative models have been prefixed with

. It can be seen that our approach outperforms all other approaches that we have compared against. Citations: Spectral Clustering [Luxburg:2007:ATutorialOnSpectralClustering], [BinkiewiczEtAl:2017:CovariateAssistedSpectralClustering], [RuanFuhryEtAl:2013:EfficientCommunityDetectionInLargeNetworksUsingContentAndLinks], [YangEtAl:2013:CommunityDetectionInNetworksWithNodeAttributes], [EroshevaEtAl:2004:MixedMembershipModelsOfScientificPublications], [YangEtAl:2009:CombiningLinkAndContentForCommunityDetection], [LiEtAl:2018:CommunityDetectionInAttributedGraphsAnEmbeddingsApproach].

5.2 Real World Networks

(a) Members
(b) Attributes
(c) Members
(d) Attributes
Figure 4: Prominent members and attributes for two communities: panels (a) and (b) regard a community that can be interpreted as “Islamic Philosophers” and panels (c) and (d) correspond to a community that can be interpreted as “Legal Philosophers”.

We performed community detection on datasets given in Table 1 (more details are given in Appendix E) to partition the set of nodes into non-overlapping communities using RB-SBM. For all datasets, we initialized and if and otherwise. While in Section 5.1, and were set to generate sparse networks, here we use the values given above as a generic prior that indicates that we are looking for assortative communities. The initial value of was set to for all and all one hot vectors . For Cora and Citeseer, the number of ground truth communities is known and hence we fixed to that value for these datasets. For Philosophers network we selected . For all datasets, the inference procedure was run for iterations. The final values of were used to infer community memberships of all nodes by setting where .

For these experiments, gradients in (4.2) were approximated by using Gibbs sampling with persistent chains by accepting every sample. These values of and were chosen because beyond these chosen values, the running time of the inference procedure increases without affecting the final performance significantly. The performance of our method is not very sensitive to these choices.

The order in which the factors of in (7) are updated while performing coordinate ascent decides whether one should initialize or at the beginning of the inference procedure. A bad initialization of can lead to numerical overflow problems while computing the exponential in (4.1) for updating (see Appendix B). Thus, in all our experiments we update for all before updating any of the ’s. The initialization of ’s that has been discussed above is generic and has been empirically observed to work on multiple datasets.

If becomes very small for all and a particular for which during the first few iterations, then community effectively dies. We observed that: (i) this was a common occurrence because of numerical issues and (ii) a dead community never comes back to life. Thus, we would like to avoid very small values in during the initial stages when

and the RBM parameters are essentially random. To do so we use the following simulated annealing heuristic. After E-step, we apply the following transformation:

(10)

to for all one hot vectors and , where is the set indices of nodes that are being updated in the current E-step. After applying the transformation , is re-normalized so that its entries sum up to one. Note that at , and for , if and if . For , this achieves the regularization effect that was desired above. In our experiments, we start with and increase it linearly to as the number of iterations increases. Empirically, we observed that this gives us better results on all datasets.

Table 2 shows the NMI scores obtained by RB-SBM on Cora and Citeseer datasets where the ground truth communities are available. It can be seen that our approach outperforms all existing approaches that we have compared against with a much simpler model. A brief description for each of the competing approaches has been given in Section 2.

For the Philosophers network, the ground truth communities are not available and hence we present a qualitative analysis of the discovered communities. Due to space constraints, we only present two of these communities. Figure 4 shows the members and attributes of two communities that can be interpreted as “Islamic Philosophers” and “Legal Philosophers”. Most relevant attributes for each community were selected based on the weights of the RBM for these two communities. It can be seen that the model was able to discover meaningful communities while at the same time highlighting the importance of various attributes for these communities.

6 Conclusion

In this paper, we presented a generative model for networks with attributes. Our model combines a RBM for modeling node attributes with a SBM for modeling node relationships. As both of these models are fairly expressive, the resultant model is simple yet flexible. One of the most attractive feature of our approach is that the derived inference procedure runs in time that is linear in the number of nodes and edges in the network.

We believe that our proposed model serves as a stepping-stone for generalization of SBMs for networks with attributes and one can consider many extensions (Appendix D). For instance, it is possible to use other variants of SBM like Degree-Corrected SBM [KarrerNewman:2011:StochasticBlockmodelsAndCommunityStructureInNetworks] or Mixed-Membership SBM [Airoldi:2008:MixedMembershipStochasticBlockmodels] instead of the basic SBM currently used.

References

Appendix A Supplementary Material for Section 3

In this section, we will present proofs for the claims made in Section 3.

a.1 Conditional Probabilities in RBM

Here, we will derive the expressions presented in (3) and (4). In RBMs, conditioned on the value of (hidden units), the elements of (visible units) are independent of each other. Hence, one can write:

Here is a random vector that has all entries from except the entry. Similarly is a vector of binary values that represents an arbitrary value that can be taken by .

In our case, we have a RBM that has binary units on one side (node attributes) and one categorical unit on the other side (community membership). We can derive the conditional probability of community membership given the node attributes as:

Note that given the value of , all the elements of can be sampled independently in parallel. These conditional probabilities are used by the Gibbs sampling procedure for obtaining samples from the joint distribution that is modeled by the RBM.

a.2 Computation of Partition Function

Suppose that the RBM models a joint distribution over and such that as described in Section 3.1. First, we find the marginal distribution over :

(11)

One can use the fact that to conclude that

(12)

a.3 Reduction to SBM

Consider the case when and for . Using these values in (A.2) and cancelling out the common terms in numerator and denominator one gets . Thus, if one samples from this distribution, then the sampled community membership vectors will follow . Following the generative process outlined in Section 3 and ignoring the node attributes, the generated will be as if it has been sampled from a SBM with parameters .

Appendix B Supplementary Material for E-step

In this section, we will derive the expressions for updating and that are used in Section 4. Using the independence assumptions implied by the graphical model given in Fig. 1, we can write the following expression for :

(13)

Here, is the Beta function.

b.1 Updating

The optimal value of is given by (4.1). In this section, we evaluate . Using (B), the fact that belongs to mean field family of distributions and by linearity of expectation, one arrives at the following expression:

(14)

Note that and are observed quantities and hence they are fixed. All the terms that do not depend on have been absorbed in the constant. We have used to denote . Using (B.1) in (4.1) and by omitting the constants, we get:

(15)

This is the exponential family form of a Beta distribution. Thus, we can conclude that , where

(16)
(17)

Naively computing and will incur cost which is prohibitively large for big networks, but there is an efficient way to compute these terms, namely

(18)

where is the set of edges in the network. Now both (16) and (17) can be computed in time. One needs to compute (16) and (17) for all . Thus, the total cost of updating for all community pairs is . However, the computation over all community pairs can be done in parallel and hence the effective time needed is . Note that all of these computations are exact and we have not used any approximation.

b.2 Updating

Proceeding as in the case of we get

(19)

All the terms that do not depend on have been absorbed in the constant.

is the digamma function and we have used the fact that if a random variable

, then . Also, note that if , then .

Using (B.2) in (4.1) and by omitting the constants, one can write

(20)

One can compute and for all once at the beginning of the E-step. This can be done in time. The inner summations over in (B.2) can also be computed once at the beginning of the E-step in time. These values can be reused while computing the outer summations over and in (B.2). Thus, given these quantities, the unnormalized value of can be computed in time. Since we need to compute for all values of , the total time needed to update is .

As before, the computations over all community pairs can be done in parallel. Also, the unnormalized value of can be computed in parallel for all values of . Thus the effective running time for these updates it where is the number of nodes in the random subset for which will be updated during the current E-step. Since, is a constant independent of , we can omit it. Using all of this information, it can be seen that as claimed in Section 4.1, one iteration of E-step takes time.

Appendix C Supplementary Material for M-step

We will first derive the expression for gradients given in (4.2) and then present two ways of computing these gradients. Recall that

(21)

The second term has been treated as a constant since it does not depend on or . As done in Appendix B, using the mean field assumption on and linearity of expectation one can compute

(22)

All the terms that are independent of and have been absorbed in the constant. Differentiating (C) with respect to we get: