Theoretical and Computational Guarantees of Mean Field Variational Inference for Community Detection

10/30/2017 ∙ by Anderson Y. Zhang, et al. ∙ Yale University 0

The mean field variational Bayes method is becoming increasingly popular in statistics and machine learning. Its iterative Coordinate Ascent Variational Inference algorithm has been widely applied to large scale Bayesian inference. See Blei et al. (2017) for a recent comprehensive review. Despite the popularity of the mean field method there exist remarkably little fundamental theoretical justifications. To the best of our knowledge, the iterative algorithm has never been investigated for any high dimensional and complex model. In this paper, we study the mean field method for community detection under the Stochastic Block Model. For an iterative Batch Coordinate Ascent Variational Inference algorithm, we show that it has a linear convergence rate and converges to the minimax rate within n iterations. This complements the results of Bickel et al. (2013) which studied the global minimum of the mean field variational Bayes and obtained asymptotic normal estimation of global model parameters. In addition, we obtain similar optimality results for Gibbs sampling and an iterative procedure to calculate maximum likelihood estimation, which can be of independent interest.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A major challenge of large scale Bayesian inference is the calculation of posterior distribution. For high dimensional and complex models, the exact calculation of posterior distribution is often computationally intractable. To address this challenge, the mean field variational method [2, 19, 30]

is used to approximate posterior distributions in a wide range of applications in many fields including natural language processing

[6, 22], computational neuroscience [14, 26], and network science [1, 8, 17]

. This method is different from Markov chain Monte Carlo (MCMC)

[13, 28], another popular approximation algorithm. The variational inference approximation is deterministic for each iterative update, while MCMC is a randomized sampling algorithm, so that for large-scale data analysis, the mean field variational Bayes usually converges faster than MCMC [7], which is particularly attractive in the big data era.

In spite of a wide range of successful applications of the mean field variational Bayes, its fundamental theoretical properties are rarely investigated. The existing literature [3, 8, 31, 33, 34] is mostly on low dimensional parameter estimation and on the global minimum of the variational Bayes method. For example, in a recent inspiring paper, Wang and Blei [32] studied the frequentist consistency of the variational method for a general class of latent variable models. They obtained consistency for low dimensional global parameters and further showed asymptotic normality, assuming the global minimum of the variational Bayes method can be achieved. However, it is often computationally infeasible to attain the global minimum when the model is high-dimensional or complex. This motivates us to investigate the statistical properties of the mean field in high dimensional settings, and more importantly, to understand the statistical and computational guarantees of the iterative variational inference algorithms.

The success and the popularity of the mean field method in Bayesian inference mainly lies in the success of its iterative algorithm: Coordinate Ascent Variational Inference (CAVI) [7], which provides a computationally efficient way to approximate the posterior distribution. It is important to understand what statistical properties CAVI has and how do they compare to the optimal statistical accuracy. In addition, we want to investigate how fast CAVI converges for the purpose of implementation. With the ambition of establishing a universal theory of the mean field iterative algorithm for general models in mind, in this paper, we consider the community detection problem [4, 24, 25, 1, 12, 35] under the Stochastic Block Model (SBM) [18, 4, 29, 21] as our first step.

Community detection has been an active research area in recent years, with the SBM as a popular choice of model. The Bayesian framework and the variational inference for community detection are considered in [3, 11, 1, 8, 17, 27]. For high dimensional settings, Celisse et al. [8] and Bickel et al. [3] are arguably the first to study the statistical properties of the mean field for SBMs. The authors built an interesting connection between full likelihood and variational likelihood, and then studied the closeness of maximum likelihood and maximum variational likelihood, from which they obtained consistency and asymptotic normality for global parameter estimation. From a personal communication with the authors of Bickel et al. [3], an implication of their results is that the variational method achieves exact community recovery under a strong signal-to-noise (SNR) ratio. Their analysis idea is fascinating, but it is not clear whether it is possible to extend the analysis to other SNR conditions under which exact recovery may never be possible. More importantly, it may not be computationally feasible to maximize the variational likelihood for the SBM, as seen from Theorem 2.1.

In this paper, we consider the statistical and computational guarantees of the iterative variational inference algorithm for community detection. The primary goal of community detection problem is to recover the community membership in a network. We measure the performance of the iterative variational inference algorithm by comparing its output with the ground truth. Denote the underlying ground truth by . For a network of nodes and communities, is an matrix with each row a standard Euclidean basis in . The index of non-zero coordinate of each row gives the community assignment information for the corresponding node. We propose an iterative algorithm called Batch Coordinate Ascent Variational Inference (BCAVI), a slight modification of CAVI with batch updates, to make parallel and distributed computing possible. Let denote the output of the -th iteration, an matrix with nonnegative entries. The summation of each row

is equal to 1, which is interpreted as an approximate posterior probability of assigning the corresponding node of each row into

communities. The performance of is measured by an loss compared with .  

An Informal Statement of Main Result: Let be the estimation of community membership from the iterative algorithm BCAVI after iterations. Under weak regularity condition, for some , with high probability, we have for all ,

(1)

The main contribution of this paper is Equation (1). The coefficient is and is independent of , which implies decreases at a fast linear rate. In addition, we show that BCAVI converges to the statistical optimality [35]. It is worth mentioning that after iterations BCAVI attains the minimax rate, up to an error for any constant . The conditions required for the analysis of BCAVI are relatively mild. We allow the number of communities to grow. The sizes of the communities are not assumed to be of the same order. The separation condition on global parameters covers a wide range of settings from consistent community detection to exact recovery.

To the best of our knowledge this provides arguably the first theoretical justification for the iterative algorithm of the mean field variational method in a high-dimensional and complex setting. Though we focus on the problem of community detection in this paper, we hope the analysis would shed some light on analyzing other models, which may eventually lead to a general framework of understanding the mean field theory.

The techniques of analyzing the mean field can be extended to providing theoretical guarantees for other iterative algorithms, including Gibbs sampling and an iterative procedure for maximum likelihood estimation, which can be of independent interest. Results similar to Equation (1) are obtained for both methods under the SBM.

Organization

The paper is organized as follows. In Section 2 we introduce the mean field theory and the implementation of BCAVI algorithm for community detection. All the theoretical justifications for the mean field method are in Section 3. Discussions on the convergence of the global minimizer and other iterative algorithms are presented in Section 4. The proofs of theorems are in Section 5. We include all the auxiliary lemmas and propositions and their corresponding proofs in the supplemental material.

Notation

Throughout this paper, for any matrix , its

norm is defined in analogous to that of a vector. That is,

. We use the notation and to indicate its -th row and column respectively. For matrices of the same dimension, their inner product is defined as . For any set , we use for its cardinality. We denote

for a Bernoulli random variable with success probability

. For two positive sequences and , means for some constant not depending on . We adopt the notation if and . To distinguish from the probabilities , we use bold and

to indicate distributions. The Kullback-Leibler divergence between two distributions is defined as

. We use for the digamma function, which is defined as the logarithmic derivative of Gamma function, i.e., . In any , we denote to be the standard Euclidean basis with . We let be a vector of length whose entries are all . We use to indicate the set . Throughout this paper, the superscript “pri” (e.g.,

) indicates that this is a hyperparameters of priors.

2 Mean Field Method for Community Detection

In this section, we first give a brief introduction to the variational inference method in Section 2.1. Then we introduce the community detection problem and the Stochastic Block Model in Section 2.2. The Bayesian framework is presented in Section 2.3. Its mean field approximation and CAVI updates are given in Section 2.4 and Section 2.5 respectively. The BCAVI algorithm is introduced in Section 2.6.

2.1 Mean Field Variational Inference

We first present the mean field method in a general setting and then consider its application to the community detection problem. Let be an arbitrary posterior distribution for , given observation . Here can be a vector of latent variables, with coordinates . It may be difficult to compute the posterior exactly. The variational Bayes ignores the dependence among , by simply taking a product measure to approximate it. Usually each is simple and easy to compute. The best approximation is obtained by minimizing the Kullback-–Leibler divergence between and :

(2)

Despite the fact that every measure has a simple product structure, the global minimizer remains computationally intractable.

To address this issue, an iterative Coordinate Ascent Variational Inference (CAVI) is widely used to approximate the global minimum. It is a greedy algorithm. The value of decreases in each coordinate update:

(3)

The coordinate update has an explicit formula

(4)

where indicates all the coordinates in except , and the expectation is over . Equation (4) is usually easy to compute, which makes CAVI computationally attractive, although CAVI only guarantees to achieve a local minimum.

In summary, the mean field variational inference via CAVI can be represented in the following diagram:

where , the global minimum, serves mainly as an intermediate step in the mean field methodology. What is implemented in practice to approximate global minimum is an iterative algorithm like CAVI. This motivates us to consider directly the theoretical guarantees of the iterative algorithm in this paper.

We refer the readers to a nice review and tutorial by Blei et al. [7] for more detail on the variational inference and CAVI. The derivation from Equation (3) to Equation (4) can be found in many variational inference literatures [7, 5]. We include it in Appendix D in the supplemental material for completeness.

2.2 Community Detection and Stochastic Block Model

The Stochastic Block Model (SBM) has been a popular model for community detection.

Consider an -node network with its adjacency matrix denoted by . It is an unweighted and undirected network without self-loops, with , and . Each edge is an independent Bernoulli random variable with In the SBM, the value of connectivity probability depends on the communities the two endpoints and belong to. We assume if both nodes come from the same community and otherwise. There are communities in the network. We denote , as the assignment vector, with indicating the index of community the -th node belongs to. Thus, the connectivity probability matrix can be written as

where with diagonal entries as and off-diagonal entries as . That is, . Let be the assignment matrix where

In each row there is only one 1 with all the other coordinates as 0, indicating the assignment of community for the corresponding node. Then can be equivalently written as , or in a matrix form

The goal of community detection is to recover the assignment vector , or equivalently, the assignment matrix . The equivalence can be seen by observing that there is a bijection between and which is defined as follows,

(5)

Since they are uniquely determined by each other, in our paper we may use directly without explicitly defining (or vice versa) when there is no ambiguity.

2.3 A Bayesian Framework

Throughout the whole paper, we assume , the number of communities, is known. We observe the adjacency matrix . The global parameters and and the community assignment are unknown. From the description of the model in Section 2.2, we can write down the distribution of as follows:

(6)

with and . We are interested in Bayesian inference for estimating , with prior to be given on both and .

We assume that have independent categorical (a.k.a. multinomial with size one) priors with hyperparameters , where . In other words, are independently distributed by

where are the coordinate vectors. Here we allow the priors for to be different for different . If additionally for all is assumed, and then this is reduced to the usual case of i.i.d. priors.

Since are Bernoulli, it is natural to consider a conjugate Beta prior for and . Let and

. Then the joint distribution is

(7)

Our main interest is to infer , from the posterior distribution . However, the exact calculation of is computationally intractable.

2.4 Mean Field Approximation

Since the posterior distribution is computationally intractable, we apply the mean field approximation to approximate it by a product measure,

where

are independent categorical variables with parameters

, i.e., with

and and are Beta with parameters due to conjugacy. See Figure 1 for the graphical presentation of .

Figure 1: Graphical model presentations of full Bayesian inference (left panel) and the mean field approximation (right panel) for community detection. The edges show the dependence among variables.

Note that the distribution class of is fully captured by the parameters , and then the optimization in Equation (2) is equivalent to minimize over the parameters as

(8)

Here can be viewed as a relaxation of : it uses an constraint on each row instead of the constraint used in . The global minimizer

gives approximate probabilities to classify every node to each community. The optimization in Equation (

8) can be shown to be equivalent to a more explicit optimization as follows. Recall is the digamma function with .

Theorem 2.1.

The mean field estimator defined in Equation (8) is equivalent to

where

and

(9)
(10)

The explicit formulation in Theorem 2.1 is helpful to understand the global minimizer of the mean field method. However, the global minimizer remains computationally infeasible as the objective function is not convex. Fortunately, there is a practically useful algorithm to approximate it.

2.5 Coordinate Ascent Variational Inference

CAVI is possibly the most popular algorithm to approximate the global minimum of the mean field variational Bayes. It is an iterative algorithm. In Equation (8), there are latent variables . CAVI updates them one by one. Since the distribution class of is uniquely determined by the parameters , equivalently we are updating those parameters iteratively. Theorem 2.2 gives explicit formulas for the coordinate updates.

Theorem 2.2.

Starts with some , the CAVI update for each coordinate (i.e., Equation (3) and Equation (4)) has an explicit expression as follows:

  • Update on :

  • Update on :

  • Update on :

    where and are defined in Equation (9) and Equation (10) respectively, and the normalization satisfies .

All coordinate updates in Theorem 2.2 have explicit formulas, which makes CAVI a computationally attractive way to approximate the global optimum for the community detection problem.

2.6 Batch Coordinate Ascent Variational Inference

The Batch Coordinate Ascent Variational Inference (BCAVI) is a batch version of CAVI. The difference lies in that CAVI updates the rows of sequentially one by one, while BCAVI uses the value of to update all rows according to Theorem 2.2. This makes BCAVI especially suitable for parallel and distributed computing, a nice feature for large scale network analysis.

We define a mapping as follows. For any , we have

(11)

with parameters and . For BCAVI, we update by in each batch iteration, with defined in Equations (14) and (15). See Algorithm 1 for the detailed implementation of BCAVI algorithm.

Input: Adjacency matrix , number of communities , hyperparameters , initializer , number of iterations .
Output: Mean variational Bayes approximation .
for  do
      21 Update by
(12)
(13)
Define
(14)
(15)
where is the digamma function. Then update with
where the mapping is defined as in Equation (11).
end for
3We have .
Algorithm 1 Batch Coordinate Ascent Variational Inference (BCAVI)
Remark 2.1.

The definitions of and in Equations (14) and (15) involve the digamma function, which costs a non-negligible computational resources each time called. Note that we have for all . For the computational purpose, we propose to use the logarithmic function instead of digamma function in Algorithm 1, i.e., Equations (14) and (15) are replaced by

(16)

Later we show that are all at least in the order of , which goes to infinity, and thus the error caused by using the logarithmic function to replace the digamma function is negligible. All theoretical guarantees obtained in Section 3 for Algorithm 1 (i.e., Theorem 3.1, Theorem 3.2) still hold if we use Equation (16) to replace Equations (14) and (15).

3 Theoretical Justifications

In this section, we establish theoretical justifications for BCAVI for community detection under the Stochastic Block Model. Though , and are all unknown, the main interest of community detection is on the recovery of the assignment matrix , while and are nuisance parameters. As a result, our main focus is on developing convergence rate of BCAVI for .

3.1 Loss Function

We use norm to measure the performance of recovering . Let be the set of all the bijections from to . Then for any

, the loss function is defined as

(17)

Note that the infimum over addresses the issue of identifiability over the labels. For instance, in the case of , the assignment vector and give the same partition. In Equation (17) two equivalent assignments give the same loss.

There are a few reasons for the choise of the norm. When both , the distance between and is equal to the norm, i.e., the Hamming distance between the corresponding assignment vectors and , which is the default metric used in community detection literature [12, 35]. The other reason is related to the interpretation of . Since each row of corresponds to a categorical distribution, it is natural to use the norm, the total variation distance, to measure their diffidence.

3.2 Ground Truth

We use the superscript asterisk to indicate the ground truth. The ground truth of connectivity matrix is

where is the within community connection probability and is the between community connection probability. Throughout the paper, we assume such that the network satisfies the so-called “assortative” property, with the within-community connectivity probability larger than the between-community connectivity probability.

We further assume the network is generated by the true assignment matrix in the sense that for all . We are interested in deriving a statistical guarantee of . Throughout this section we consider cases or , where is defined to be a subset of with all the community sizes bounded between and . That is,

It is worth mentioning that are not necessarily constants. We allow the community sizes not to be of the same order in the theoretical analysis.

3.3 Theoretical Justifications for BCAVI

In Theorem 3.1, we present theoretic guarantees of the convergence rate of BCAVI when initialized properly. Define

When , the priors for are i.i.d. and when there exist only two communities. The following quantity plays a key role in the minimax theory [35]

which is the Rényi divergence of order

between two Bernoulli distributions:

and . The proof of Theorem 3.1 is deferred to Section 5.3.

Theorem 3.1.

Let . Let be any constant. Assume ,

(18)

Under the assumption that the initializer satisfies for some sufficiently small constant with probability at least , there exist some constant and some such that in each iteration for the BCAVI algorithm, we have

holds uniformly with probability at least .

Theorem 3.1 establishes a linear convergence rate for BCAVI algorithm. The coefficient is independence of , and goes to 0 when grows. The following theorem is an immediate consequence of Theorem 3.1.

Theorem 3.2.

Under the same condition as in Theorem 3.1, for any , we have

with probability at least .

Theorem 3.2 shows that BCAVI provably attains the statistical optimality from the minimax lower bound in Theorem 3.3 after at most iterations. When the network is sparse, i.e., and are at most in an order of , the quantity can be shown to be , and then BCAVI converges to be minimax rate within iterations. When the network is dense, i.e., and are far bigger than , iterations are not enough to attain the minimax rate. However, for any when , and thus all the nodes can be correctly clustered with high probability by clustering each note to a community with the highest assignment probability. Therefore, it is enough to pick the number of iterations to be in implementing BCAVI.

Theorem 3.3.

Under the assumption , we have

Theorem 3.3 gives the minimax lower bound for community detection problems with respect to the loss. In Theorem 3.2, under the additional assumption that , it immediately reveals that BCAVI converges to the minimax rate after iterations. As a consequence, BCAVI is not only computationally efficient, but also achieves statistical optimality. The minimax lower bound in Theorem 3.3 is almost identical to the minimaxity established in [35]. The only difference is that [35] consider a loss function. The proof of Theorem 3.3 is just a routine extension of that in [35]. Therefore, we omit the proof.

To help understand Theorem 3.1, we add a remark on conditions on model parameters and priors, and a remark on initialization.


Remark 1 (Conditions on model parameters and priors). The community sizes are not necessarily of the same order in Theorem 3.1. If we further assume are constants, and the prior (for example, uniform prior), and then the first condition in Equation (18) is equivalent to

noting that and . This condition is necessary for consistent community detection [35] when is finite. The assumptions in Equation (18) is slightly stronger than the assumption in [23], which is essentially for a sufficient large constant .

Under the assumption , since we have , it can be shown that are far bigger than , and then the second part of Equation (18) can also be easily satisfied. For instance, we can simply set all equals to 1, i.e., consider non-informative priors.


Remark 2 (Initialization). The requirement on the initializers for BCAVI in Theorem 3.1 is relatively weak. When is a constant and the community sizes are of the same order, the condition needed is for some small constant

. Many existing methodologies in community detection literature can be used. One popular choice is spectral clustering. Established in

[21, 12, 9], the spectral clustering has a mis-clustering error bound as . From Equation (18), the error is , and then the condition that Theorem 3.1 requires for initialization is satisfied. The semidefinite programming (SDP), another popular method for community detection, also enjoys satisfactory theoretical guarantees [16, 10], and is suitable as an initializer.

4 Discussion

4.1 Statistical Guarantee of Global Minimizer

Though it is often challenging to obtain the global minimizer of the mean field method, it is still interesting to understand the statistical property of the global minimizer . Assume that both and are known, the optimization problem stated in Theorem 2.1 can be further simplified. The posterior distribution becomes . We use a product measure for approximation, and then . Theorem 4.1 reveals that is rate-optimal, not surprisingly given the theoretical results obtained for BCAVI, an approximation of .

Theorem 4.1.

Assume and are known. Under the assumption , there exist some constant and such that