Heron Inference for Bayesian Graphical Models

02/19/2018 ∙ by Daniel Rugeles, et al. ∙ 0

Bayesian graphical models have been shown to be a powerful tool for discovering uncertainty and causal structure from real-world data in many application fields. Current inference methods primarily follow different kinds of trade-offs between computational complexity and predictive accuracy. At one end of the spectrum, variational inference approaches perform well in computational efficiency, while at the other end, Gibbs sampling approaches are known to be relatively accurate for prediction in practice. In this paper, we extend an existing Gibbs sampling method, and propose a new deterministic Heron inference (Heron) for a family of Bayesian graphical models. In addition to the support for nontrivial distributability, one more benefit of Heron is that it is able to not only allow us to easily assess the convergence status but also largely improve the running efficiency. We evaluate Heron against the standard collapsed Gibbs sampler and state-of-the-art state augmentation method in inference for well-known graphical models. Experimental results using publicly available real-life data have demonstrated that Heron significantly outperforms the baseline methods for inferring Bayesian graphical models.



page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Bayesian graphical models refer to a family of probabilistic unifying models in which nodes represent random variables, and edges represent possible dependency between individual pairs of nodes. The graphical models have become a powerful tool for discovering uncertainty and causal structure from real-world data. They have been widely used in computer vision

(Cao and Fei-Fei, 2007), text data mining (Blei et al., 2003), (Chang and Blei, 2009), and (Mcauliffe and Blei, 2008)

, natural language processing

(Hai et al., 2017), music information retrieval (Hu and Saul, 2009), and computational biology (Chen et al., 2012). Given a graphical model, one of the most fundamental tasks could be inference, simply, computing the marginal distribution of one or a couple of random variables in the model.

Generally, two categories of approaches are mainly adopted for inferring Bayesian graphical models, i.e., variational inference and Gibbs sampling methods. Variational approaches mainly rely on stochastic optimization to fit an approximate model out of a postulated family of models. The choice for the family is known to facilitate efficiency as well as online learning process of the algorithms (Hoffman et al., 2010). However, the approximation for the targeted true model via a different model inevitably sacrifices accuracy. In contrast, based on the theory of stationary distribution that guarantees a convergence to the desired distributions, Gibbs sampling methods have been shown to result in decent predictive accuracy. Unfortunately, their convergence is often costly and difficult to assess. The Gibbs sampling methods are typically slow compared to variational inference, due to the fact that the posterior distribution must be computed and sampled for every single observation. Moreover, both collapsed Gibbs sampling and variational inference are known to be sequential algorithms, meaning that every step in the methods depends on the results of the previous step. Then, these algorithms are basically not distributable.

Recently, a state augmentation inference method, called state augmentation for marginal estimation (SAME)

(Zhao et al., 2015), has shown that by drawing a number of samples as opposed to extracting one sample from each conditional distribution, it is possible to improve the predictive perplexity of the Gibbs sampler and reduce the negative impact on perplexity caused by distributing the datasets into batches. Therefore, a higher sampling replication rate is able to yield better predictive performance. Unfortunately, SAME’s computational complexity depends on the number of replications. As the number of replications increases, its computational cost will eventually surpass the computational cost of the Gibbs sampler. To keep the computational cost reasonable, SAME typically performs replications for sampling from the conditional distribution. Arguably, SAME could be the fastest and most accurate inference algorithm under the Gibbs sampling framework.

In this paper, we aim at improving the computational cost and predictive power in inference for Bayesian graphical models. We extend SAME, and develop a new Heron inference approach. In particular, we develop a new learning framework that drives the number of replications of latent state up to infinity. By doing so, we then yield a deterministic inference algorithm even within the Gibbs framework. Determinism is typically a major source for computational speed-up. Though modern GPUs support hardware implementations of some samplers, sampling still limits the performance of randomized inference algorithms. Our proposed heron Inference method does not require sampling, which leads to much better efficiency than sampling counterparts. In addition, maximizing the number of replications of the sampling procedure naturally leads to an improvement for the predictive power of the algorithm. In addition, the easy assessment for convergence is possible thanks to the determinism.

With analyzing the convergence properties, we found that the proposed deterministic inference method is ultimately solving a fixed-point system of equations. This is perhaps one of the major benefits of Heron inference, as fixed-point methods have been well studied, and their iterative solvers can be clearly distributable. An additional advantage from the fixed-point system is that the computation of repeated document-word tuples is unnecessary, which actually leads to an improvement for the efficiency. Although recently proposed inference algorithms also avoid this cost (Teh et al., 2007), the fixed-point system provides a theoretical explanation within the Gibbs framework.

The proposed Heron inference can be applied to a family of graphical models that satisfy the following conditions: 1) The posterior distributions of the model need to have an analytical form, 2) The posterior distributions need to be discrete distributions,, and 3) The latent random variables need to be independent of each other, given the data and related parameters of the model. Note that we coin the term Heron Inference in honor of Heron of Alexandria. Heron is perhaps thought to be the first person who solved iteratively a fixed-point function with the objective of computing the square root of a number.

We have made the following main contributions in this work:

  1. We propose a novel inference method, which is applicable to a family of probabilistic models, with the same convergence guarantees as Gibbs sampling but easier convergence assessment.

  2. Our proposed method is an order of magnitude faster method than the state-of-the-art inference method available within the Gibbs framework, and at the same time achieving an improved perplexity.

  3. We demonstrate theoretical support for the distributability of the proposed algorithm.

  4. We transform the Gibbs sampling inference approach into a new deterministic domain where optimization techniques can be further explored.

2. Related Work

Along rich history of Bayesian graphical models, several main categories of inference algorithms have been developed, i.e., variational Bayesian inference

(Blei et al., 2003), expectation propagation (Minka and Lafferty, 2002), Gibbs sampling (Griffiths and Steyvers, 2004), and belief propagation (Zheng, 2012). Among these, Gibbs sampling and variational inference are perhaps the most popular algorithms, possibly due to their efficacy or efficiency. Specifically, Gibbs sampling methods have been known to guarantee to converge to the true posterior via a sampling scheme, while the variational inference basically relies on a theoretically-backed optimization approach to approximate the true posterior.

Existing approaches in the category of variational inference aim to improve the predictive perplexity. These methods approximate intractable integrals, which then results in inaccuracy. Collapsed variational Bayesian schemes (Teh et al., 2007) use a second-order Taylor expansion to approximate the integrals, while other variational approaches use the zero-order information in order to obtain more accurate inference (Sato and Nakagawa, [n. d.]). With respect to the distributability of these methods, empirical work has demonstrated effective approximations that avoid a major impact on the predictable performance (Hoffman et al., 2010).

Previous methods in the category of Gibbs sampling are mainly concerned with improving efficiency. This is natural given their powerful predictive capability. FastLDA was shown to be times faster than the latent Dirichlet allocation via collapsed Gibbs sampling (LDA-CGS), while maintaining the exact same predictive power (Porteous et al., 2008). The high computational cost of LDA-CGS stems from the time complexity incurred at every sampling step (K: number of latent topics). One key observation is that many words are often assigned to only a small number of different topics. Hence, by keeping track of these words, FastLDA requires significantly less than operations per sample on average.

SparseLDA (Yao et al., 2009)

factorizes the posterior equation into a sum of three factors. The sampling scheme is then replaced by a uniform sampling, where it is observed that the probability mass falls in one of the buckets 90% of the time. By making an appropriate data structure, the computation of the mass can be optimized. As a result, the model achieves an order of magnitude improvement in computational complexity without affecting its predictive power, when compared to collapsed Gibbs sampling. A faster approach, which exploits sparsity, relis on a variant of the Fenwick tree to encode the posterior, so that sampling can be performed in

time (F+LDA) (Yu et al., 2015).

Based on efficient Metropolis-Hastings, AliasLDA (Li et al., 2014) and LightLDA (Yuan et al., 2015) can reduce the complexity of sampling to via Walker’s alias method (Walker, 1977). Unfortunately, these methods result in problems of frequent cache misses caused by random accesses to the parameter matrices. WarpLDA deals with the Metropolis-Hasting’s problem by fitting the use of randomly accessed memory per-document in the cache. Remarkably, WarpLDA is consistently times faster than LightLDA, and it also outperforms F+LDA (Chen et al., 2016).

The sparsity exploited by recent algorithms has been used to develop GPU implementations of inference algorithms (Li et al., 2017)

. Alternatively, the SAME method is the inference algorithm that leverages a Poisson distribution and efficiently replaces sampling several times a categorical distribution with sampling a single time a Poisson distribution

(Zhao et al., 2015). Their GPU implementation is much faster than the GPU algorithm first introduced in (Yan et al., 2009). Different from other GPU based methods, Poisson sampling is done via a hardware implementation available on NVIDIA GPUs, which makes SAME a very efficient inference algorithm.

Existing inference methods reside in the Pareto optima, where variational approaches are optimal in efficiency, while sampling approaches are optimal in predictive power. Other inference methods for Bayesian graphical models have different trade-offs between efficiency and accuracy. Recently, SAME has been shown to improve the optima using the state augmentation technique. The resulting algorithm is more accurate and efficient than existing CPU or GPU implementations (Zhao et al., 2015). In this work, we propose to maximize the augmentation of the latent state, and then develop a deterministic algorithm whose convergence properties can be exploited to further improve the efficiency and efficacy for inferring graphical models in practice.

3. Preliminaries

In this section, we present preliminaries about existing well-known graphical models and the unified collapsed Gibbs sampling algorithm, followed by the state-of-the-art state augmentation inference method.

Latent Dirichlet allocation (LDA), one of the most popular graphical models, has been widely used to discover latent topical structure of a given text collection of documents. Simply, LDA assumes the following generative process:


where corresponds to per-document topic distribution of document (), corresponds to per-topic word distributions for topic (), corresponds to latent topic assignment for word (, number of word tokens in document ), and and are hyper-parameters.

Relational topic model (RTM) (Chang and Blei, 2009) extends LDA by taking into account the citation links between pairwise documents. RTM adds one more step to the generative process of LDA:

where is an indicator variable of the citation relationship between documents and , and its probability is given by the following function:

where and are hyper-parameters, (: one-hot based topic assignment), and means element-wise product.

Supervised LDA (sLDA) (Mcauliffe and Blei, 2008) is a statistical model of labeled documents, and it extends LDA by adding an observed variable that ndicates the label of a document, for example, the rating given to a movie. The sLDA adds the following step to the generative process of LDA:


where is the label of document , and both and are hyper-parameters.

1:Input: Dataset (size: ), latent assignment
2:Initialize , = computeStats()
4:   for  to  do
5:      updateStats(, )
6:       from Table 1
7:      =sample()
8:      updateStats(, ) from Equation 2
9:   end for
10:until convergence
Algorithm 1 Collapsed Gibbs Sampling (CGS)

The inference for these graphical models, can be typically performed using collapsed Gibbs sampling (CGS). Algorithm 1 describes the unified sampling method by sequentially processing each of the records in the dataset. Table 1 lists the posterior density of each graphical model as a function of the statistics . Some main statistics are defined as follow: : the number of times that assignment and document co-occur in the training data. : the number of counts that latent assignment and word co-occur in the data. : the number of counts that latent assignment k occurs in the given data. Note that “-j” in a quantity means that the contribution of the record is not considered in the quantity.

Model Unnormalized posterior density
Table 1. Posterior density as a function of the statistics

The unified CGS algorithm shares some commonality across several graphical models. It first computes the statistics for each model, which are required to compute the posterior equation for each record in the dataset. Rather than recomputing all the parameters of the statistics for every record in the dataset, it is more desired and more efficient to just update the statistics as required. In step 5, CGS subtracts the contribution of the record from each of the statistics before computing the conditional distribution . Then, at step 8, it computes the statistics with the updated assignment .

Equation 2 shows the update for the statistics of the LDA model at the record, corresponding to the function updateStats in Algorithm 1:


This algorithm explicitly reveals problems caused by Gibbs sampling. First, the statistics used to compute the posterior in Line 6 are updated after every tuple in the dataset. As a consequence, the algorithm is non-distributable. Though approximate methods have been proposed to solve the distributability of the inference, they are only applicable to some specific graphical models, and sacrifice the predictive power of the collapsed Gibbs sampler (Smola and Narayanamurthy, 2010; Yan et al., 2009). In addition, Line 6 and 7 show that the posterior must be computed and sampled for every row in the dataset, even when the content of multiple observed tuples is repeated. For example, the same word dould appear multiple times in a document in reality. The repetition of tuples can considerably increase the computational complexity. For instance, the Cora dataset contains 23.01% repeated tuples, while the Diggs repeats 18.17% of its tuples.

Figure 1. Approximation of the posterior ) via sampling. The figure shows examples of draws from the posterior. As the number of samples grows, we obtain better estimate of the distribution. The samples approach the posterior itself when grows up to infinity.

To cope with these problems, a state augmentation inference method, called state augmentation for marginal estimaton (SAME), introduces the idea of combining multiple independent samples of each posterior (Zhao et al., 2015). This effectively corresponds to “cooling” the posterior, allowing an annealed search of the MAP parameters. As a result, the inference often yields higher-quality estimates than the single sample based approaches. To deal with the higher cost of obtaining additional samples, SAME employs a coordinated-factored approximation, in which it replaces sampling multinomial distributions with a single Poisson sample per category, hence the label of coordinated approximation. The efficiency of a Poisson sampler remains similar to sampling from a multinomial distribution as long as takes a value of or less. The sampling scheme replaces Line 7 and 8 in Algorithm 1, as shown below:


We note that the statistics of the method take into account the replication of the latent state as follows:


4. Heron Inference

Following SAME, the proposed Heron inference method leverages multiple samples to “cool” the posterior. Differently, Heron introduces a new mathematical manipulation to drive the number of samples up to infinity, yielding a deterministic fixed-point algorithm.

We follow three main steps to derive Heron inference method. Firstly, we show that by maximizing the replication of samples, we obtain an update that is no longer random, i.e., transforming Gibbs sampling into a deterministic algorithm. Secondly, we study the convergence properties of the novel deterministic algorithm, and realize that the method is implicitly solving a fixed-point system of equations, which presents a new framework for doing inference for probabilistic graphical models. Lastly, we observe that repeated document-word tuples converge to the same fixed-point equations, hence they only need to be solved once by the algorithm.

First, the two steps in Equation 3

can be reduced to one step by deriving the distribution from the linear transformation of the Poisson distribution when multiplied by

. Let be a random variable, and let represent a linear transformation of such that . Then, the probability mass function of can be expressed in terms of any outcome as :


As we increase

, we observe that the variance decreases. In effect the variance tends to zero as

grows up to infinity. Instead, the mean for each outcome remains as:


The probability mass starts to concentrate in the mean value . Figure 2 illustrates this effect.

Figure 2. Behavior of as for

As a result, by maximizing , we obtain an update that does not require any sampling. Then, Line 7 and 8 of Algorithm 1 can be replaced by:




Second, we examine the convergence properties of Gibbs sampling. As shown by (Casella and George, 1992)

, Gibbs sampling converges to the stationary distribution of a Markov chain, where the transition matrix corresponds to the multiplication of the conditional distributions

. Hence, labeling as the stationary distribution, we have:


Estimation of these marginal distributions is based merely on the topic assignments , . Hence, upon convergence, there must exist a set of values that satisfy or approximate the system of Equations 9. Furthermore, the solution to the system of equations presented in Equation 9 as necessary condition approximates the conditional distributions to the marginal distributions . Therefore, we obtain:


where is the posterior equation derived for the model of interest (see Table 1). Note that is computed based on the topic assignments . To transform Equation 10 into a fixed-point equation, we need to find a representation such that the posterior is written in terms of deterministic updates as opposed of the random topic assignments of the collapsed Gibbs sampler.

To that end, we use the deterministic Equation 7 obtained above. This equation explicitly expresses how the update of is given by the parameters of the posterior equation. As a result, we obtain the fixed-point condition that must hold at convergence of the given model. To simplify the notation, we define , where indexes the document-word tuple composed by .



We have transformed the inference problem into a system of non-linear equations whose solutions correspond to all possible outcomes of an infinitely replicated latent state augmentation independent of the initialization.

Finally, we can further simplify this system of equations by observing that the same document-word tuples will also converge to the same values. Thus, we have:


This is easy to check using Equation 11. Upon convergence, repeated tuples will have the same fixed-point equations. Now the problem has been transformed into solving a system of non-linear equations. As a solution, we leverage the fixed-point iterative method. The fixed-point method finds the solution to a system of equations as long as is a continuous function.

The algorithm is similar to CGS presented in Algorithm 1 with the main differences: 1) There is no need to iterate over steps, and 2) The algorithm is distributable since we only need to update the statistics until the end of each iteration. Theoretically, we can partition the dataset into as many pieces as . The end result will be the same as long as we combine the results at the end of each iteration. We present Heron inference in Algorithm 2. Note that the definition of in computeStats is overridden with .

1:Input: data , latent assignment , dataset size , number of repeated tuples
3:= computeStats(,) from Equation 8
5:   for to do in parallel
6:      for to do in parallel
7:          from Table 1
8:      end for
9:   end for
10:   = computeStats(,) from Equation 8
11:until convergence (Equation 13)
Algorithm 2 Heron Inference

An additional advantage of the deterministic Heron method is that the assessment for convergence is now possible within the Gibbs framework. We are allowed to explore the assessment for convergence using the average Kullback-Leibler divergence (KL) or the average Chi-Squared distance on the learned

and . We find that the latter metric oscillates and it does not have a monotonic behavior. We find that it is often more practical to measure the change in either or . Equation 13 presents the metric used to assess convergence of Heron at the iteration.


Finally, the SAME based methods are under the premise of independence among topics, therefore the methods such as hidden Markov model

(Baum and Petrie, 1966) and stochastic block mixture model (Airoldi et al., 2008) are not guaranteed to work using these inference approaches.

5. Experiments

We evaluate our proposed Heron inference method against two well-established baseline methods. In this section we aim to answer the following research questions (RQs)


Is Heron able to fit better a test set than the state-of-the-art inference method SAME?


What is the effect on perplexity as we increase the number of batches, and does the moving average estimate of improve Heron’s learning?


How sensitive is Heron to the setting of the hyperparameters

and ?


How does the efficiency of Heron compare against the arguably fastest GPU-based inference method within the Gibbs framework?


How does the efficiency of Heron inference get affected by the batch size?


Does Heron help extract topics with acceptable coherence?

Datasets We used three popular and publicly available datasets for our experimental evaluation.

  • Cora dateset contains research papers and their citation networks. After preprocessing, we obtained documents with a vocabulary composed of 1,400 words. The number of citation links is 5,212. The dataset was first used for testing the RTM model (Chang and Blei, 2009) 111https://people.cs.umass.edu/~mccallum/data.html.

  • Diggs dataset consists of document-rating pairs with a vocabulary composed of 3,145 words after preprocessing. Note that, following (Mcauliffe and Blei, 2008), we normalized the ratings into the range to avoid numerical overflows when calculating sLDA’s posterior. The dataset was originally used for evaluating sLDA (Mcauliffe and Blei, 2008) 222http://www.cs.columbia.edu/~blei/papers/digg-data.tgz.

  • Movielens is a popular benchmark for rating prediction and recommendation. We used the MovieLens 20M dataset 333https://grouplens.org/datasets/movielens/, and focused on the most popular 100,000 users who rated 25,646 movies. We treated the movies and users as pseudo documents and words, respectively. The average rating of all the related users on a movie was used as the rating of the movie (document).

Baselines We compared Heron inference with two strong baselines, i.e., standard collapsed Gibbs sampling and SAME inference method (Zhao et al., 2015). Collapsed Gibbs sampling is perhaps one of the most commonly used methods for inferrring graphical models, while SAME has been shown to attain the best predictive perplexity for inferring LDA model.

Topic Models To test the performance of Heron for inferring graphical models, we used three well-known topic models, i.e., LDA (Latent Dirichlet Allocation) (Blei et al., 2003), RTM (Relational Topic Model) (Chang and Blei, 2009) and sLDA (Supervised Latent Dirichlet Allocation) (Mcauliffe and Blei, 2008). For LDA, all the datasets can be used to evaluate the inference methods. Inference for RTM can be only tested on the Cora dataset, as only Cora has a citation networks. We used Movielens and Diggs to test the inference for sLDA, since in addition to documents/movies, corresponding ratings are also available.

Implementation Details All the algorithms were evaluated on a single PC equipped with a single 2-core CPU (Intel Xeon @ 2.20 GHz) and a Nvidia tesla K80 GPU with dual stream processors. Each GPU processor comes with cores and GB of GDDR5 clocked at GHz. Only one processor of the GPU was used in the benchmark. The standard Gibbs sampler was implemented using CPU resources. In all our experiments we fixed both the grid size and block size.

We assess our proposed inference method in terms of running time, predictability, as well as coherence of the extracted topics. In particular, we conduct the following three experiments.

Cora 0.4 0.3 0.75 N/A
Diggs 0.6 0.5 0.5 0.25
Movielens 0.6 0.6 0.5 0.4
Table 2. Hyperparameter setting per dataset.
Model Inference Perplexity
K=10 K=25 K=50 K=100


LDA CGS 1709.63 1837.46 1999.25 2623.86
SAME 1706.27 1821.20 1977.26 2549.43
Heron 1683.03 1770.66 1977.51 2550.21
RTM CGS 1700.12 1830.02 1995.37 2245.96
SAME 1699.20 1811.96 1991.12 2043.35
Heron 1682.32 1806.09 1991.02 1971.07


LDA CGS 952.18 1166.93 1500.51 2075.65
SAME 948.57 1161.40 1492.53 2067.24
Heron 947.27 1160.35 1489.43 2062.12
sLDA CGS 940.09 1125.85 1452.94 2007.00
SAME 919.31 1123.82 1446.41 2005.28
Heron 914.77 1121.96 1442.54 2002.61


LDA CGS 1085.47 1022.25 1011.62 1065.02
SAME 924.12 932.43 993.81 1045.75
Heron 922.20 931.34 989.92 1045.09
sLDA CGS 926.55 905.56 924.75 963.45
SAME 923.28 903.63 921.26 963.65
Heron 920.22 899.68 918.51 954.28
Table 3. Predictive perplexity versus the nunber of topics.

5.1. Predictive Perplexity

The problem defined is given as follows: given a fraction of the original data, we evaluate our model’s ability to generate the withheld portion. As such, the evaluation metric adopted in our experiments is the predictive perplexity. The number of topics

is tuned amongst . The training-testing split is , and we run all models for iterations. Furthermore, we conducted five trials for each setting in the spirit of robust experimentation, and we reported the average results. The hyperparameters are given in Table 2. Models applied to the same dataset use the same hyperparameter settings.

Figure 7. Predictive perplexity on the GPU based inference methods versus the number of batches.

Table 3 shows the comparison results of all the inference methods when applied to the three models. Answering RQ1, our proposed Heron inference outperforms CGS and SAME method, i.e., both the best and second best performance on each configuration comes from Heron model across all values. As expected, SAME method consistently outperforms CGS. This is due to the fact that SAME has been derived from CGS by applying a finite replication of the latent states, and Heron maximizes this effect using a mathematical manipulation. SAME method improves Gibbs sampling because the estimation for the parameters is done based on the document-topic and word-topic counts. Hence, the estimation is based on discrete values. By using multiple replications of the latent state, it is possible to make this estimation more granular so that the counts can be thought to reside in the space of real numbers. As we increase the number of replications, the estimation becomes more granular, and Heron becomes the most flexible approach approximating the local optima closer than the other methods.

The benefit of the distributed methods lies in their ability to split the dataset into batches, so that several GPUs can be used to boost the speed of inference. As such, we propose RQ2 and solve it by studying the effect of splitting the datasets into different batch sizes. We fix and split the dataset into exponentially larger number of batches. Similar to SAME method, Heron inference may treat the batches as an infinite stream, and it is not necessary to denote passes over the dataset. In addition, we also study SAME’s moving average estimate for (Zhao et al., 2015), which has shown improvements when applied to other learning algorithms (Hoffman et al., 2010). We label the results of applying this technique as Heron-LDA, Heron-sLDA, and Heron-RTM.

We answer RQ2 using Figure 7. We conclude that the batched versions converge to a better estimation as measured by the predictive perplexity. As the dataset is split into smaller pieces, the predictive perplexity improves. This may be due to the fact that splitting the dataset into batches allows the inference to reach a better local optima with the use of a moving average estimate for (Zhao et al., 2015; Hoffman et al., 2010). Nonetheless, after a certain number of batches the predictive perplexity is affected, indicating that the learning has become compromised. Note that Heron can be distributed into as many batches as the number of document-word tuples. The final inference will result in the same estimation as if we partitioned the dataset into any number of pieces or if we did not partition the dataset. This is because Heron inference is ultimately solving a fixed-point system of equations, given by Equation 11. The results show that Heron can yield better local optima if we decide to use the moving average updates. In that case, the learning of Heron may also be compromised if we split the dataset into a large number of batches.

LDA - Cora


2118.8 1974.42 1766.27 1741.98 1667.46
1954.09 1831.44 1751.84 1725.55 1677.72
1880.51 1806.89 1747.28 1735.68 1748.78
1912.38 1851.29 1784.29 1784.60 1784.55
1916.95 1882.72 1845.47 1834.62 1845.15


2102.48 1916.14 1721.83 1649.48 1656.24
1942.4 1775.68 1713.5 1664.11 1656.55
1824.89 1794.71 1735.07 1707.69 1719.56
1840.68 1779.86 1796.25 1787.89 1798.68
1898.44 1850.68 1845.86 1860.4 1866.48


1858.19 1738.36 1666.94 1647.12 1635.79
1751.62 1704.31 1656.84 1661.41 1656.67
1708.78 1655.16 1671.78 1692.64 1702.59
1745.76 1739.17 1737.66 1750.46 1775.52
1801.18 1767.2 1782.23 1831.43 1847.63
Table 4. Predictive perplexity versus the hyperparameters and .

As proposed in RQ3, we study the hyperparameter sensitivity of the different inference methods. We made a grid search on and starting at 0.1 and stopping at 0.5 with steps of 0.1. We applied the inference methods to LDA on the Cora dataset.

Table 4 shows that the predictive perplexity of Heron is superior to that of SAME, and as expected, SAME has more predictive power than CGS. The results are solid, Heron lower bounds SAME, and SAME lower bounds CGS. Based on the range of the perplexity across different hyperparameters, we conclude that Heron is more robust to the hyperparameter settings, while CGS and SAME have similar sensitivity.

Model Inference Running time ()
K=10 K=25 K=50 K=100


LDA CGS 1350 1390 1460 1728
SAME 190 214 231 291
Heron 3 7 11 18
RTM CGS 1540 1620 1690 1720
SAME 210 218 222 240
Heron 5 12 19 33


LDA CGS 2590 3140 3030 3300
SAME 260 290 270 330
Heron 13 15 18 31
sLDA CGS 4060 4090 4470 5120
SAME 270 280 280 320
Heron 15 23 30 50


LDA CGS 16405 17356 18511 21286
SAME 780 790 813 833
Heron 19 44 62 95
sLDA CGS 7053 7818 8954 10367
SAME 954 976 997 1028
Heron 23 46 67 102
Table 5. Average running time per iteration versus the number of topics.

5.2. Learning Time

In this section, we report the runtime to convergence of the inference algorithms normalized by the number of iterations.

First, we study the effect of chaning the number of topics on running time, as shown in Table 5. Remarkably, Heron is about one order of magnitude faster than the state-of-the-art SAME method across all the values of . The main reason lies in Heron does not rely on the sampling operations for inference. In contrast, SAME method not only incurs the sampling costs but also is limited at every clock cycle by the available number of samplers within the GPU. Benefitting from purely arithmetic operations, Heron does not suffer from the limitations and sampling costs. We also observe that the running time increases linearly as we increase for the three inference methods. While this increment is larger for Heron, Heron can also benefit from tuning the grid size and block size (GPU Kernel parameters) to favor different values of . The flexibility of tuning GPU parameters is reduced for SAME, since at its optimal configuration, the system will be still limited by the availability of hardware based samplers. This answers RQ4.

In Figure 12, we show the effect of splitting the datasets into batches. The results show that both inference methods increase its running time logarithmically to the number of batches. This is outstanding given that Heron’s convergence is invariant to the number of batches, when the moving average estimate is not applied. It means that Heron can converge to accurate inference results dramatically fast, given that we have enough cores running simultaneously. Answering RQ5, Heron efficiency reduces logarithmically to the number of batches slightly.

Figure 12. Average running time per iteration versus the number of batches.

5.3. Automatic Topic Coherence

Besides assessing how the methods fit the data, it is important to evaluate the quality of the topics inferred by the methods. Following the evaluation introduced by Newman et al. (Newman et al., 2010), we computed the automatic topic coherence based on the top N most likely words for each of the extracted topics. Specifically, we relied on word co-occurrence metrics on the corpus to compute the average coherence of each topic. The automatic topic coherence has been studied extensively, and the normalized point-wise mutual information (NPMI), the probabilistic mutual information (PMI), and the pairwise log-conditional probability (LCP) are co-occurrence metrics that have been shown to correlate positively to the human coherence evaluation (Lau et al., 2014). Thus, we use these metrics to assess the coherence of the extracted topics.

For the experimental settings, we fixed and to 0.5, and the number of topics to 25. We varied the value of to study the effect of defining the coherence with different sets of topics. Finally, we normalized the measurements to be invariant to changes in , so that it facilitates the comparison of the results.

mean std mean std mean std


CGS 20 -6.496 0.031 0.043 0.003 0.641 0.120
SAME -6.083 0.024 0.058 0.006 0.793 0.056
Heron -6.032 0.033 0.055 0.002 0.823 0.070
CGS 50 -6.819 0.029 0.030 0.003 0.477 0.051
SAME -6.678 0.039 0.037 0.001 0.503 0.019
Heron -6.619 0.021 0.030 0.002 0.548 0.021


CGS 20 -8.693 0.001 0.046 0.001 0.941 0.002
SAME -8.524 0.066 0.051 0.006 1.062 0.121
Heron -7.483 0.040 0.159 0.004 3.047 0.069
CGS 50 -8.593 0.001 0.052 0.001 1.025 0.001
SAME -8.176 0.024 0.085 0.002 1.681 0.004
Heron -8.165 0.014 0.109 0.002 2.118 0.042


CGS 20 -6.889 0.121 0.247 0.007 4.865 0.136
SAME -6.527 0.000 0.295 0.000 4.905 0.000
Heron -6.797 0.107 0.249 0.009 5.838 0.178
CGS 50 -6.853 0.033 0.247 0.003 4.855 0.064
SAME -7.157 0.000 0.231 0.000 4.543 0.000
Heron -6.812 0.029 0.250 0.002 4.921 0.043
Table 6. Automatic topic coherence of LDA inferred by CGS, SAME, and Heron. Best results per dataset use bold font. From top to bottom, the datasets are Cora, Diggs, and Movielens, respectively.

Results presented in Table 6 show that Heron is able to infer latent topics with much better or comparable coherence compared to that by CGS or SAME. Interestingly, SAME achieves the best topic coherence in Movielens for . Although, simultaneously, it also obtains the worst results for . Apparently, SAME assigns a high ranking to the most frequent words in the dataset which leads to an improved coherence as measured by the co-occurrence based metrics.

In all datasets, a smaller value of results in better coherence. This can be justified by the fact that as we increase the number of words inside a topic, the harder it is to maintain the coherence of the topic. In other words, the top ranked words per topics typically result in better representation for the topic.

Given that Heron is able to improve CGS in all datasets, we conclude that Heron is able to infer coherent latent topics, answering RQ6.

6. Conclusion

In this paper, we have presented a probabilistic inference method within Gibbs framework named Heron. The derivation for Heron begins with maximizing the replications of the latent state, which is known to approximate better local optima. As such, our method achieves a higher predictive power as measured by the predictive perplexity. We studied the properties of convergence of the proposed algorithm and we found that ultimately we are solving a non-linear system of equations. This is beneficial since the fixed-point iterative algorithm that solves the system of equations allow the inference to be distributed into as many processes as the number of observed records in the dataset without affecting the estimation of the parameters. Our proposed Heron is deterministic which reduces the learning process to the pure computation of arithmetic operations and as result, our method runs faster than the state of the art GPU implementation within the Gibbs framework. GPU and CPU implementations of Heron are available at https://github.com/danrugeles/Heron/.


  • (1)
  • Airoldi et al. (2008) Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. 2008. Mixed Membership Stochastic Blockmodels. J. Mach. Learn. Res. 9 (June 2008), 1981–2014. http://jmlr.csail.mit.edu/papers/volume9/airoldi08a/airoldi08a.pdf
  • Baum and Petrie (1966) Leonard. E. Baum and Ted Petrie. 1966. Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics 37 (1966), 1554–1563.
  • Blei et al. (2003) David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993–1022. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
  • Cao and Fei-Fei (2007) Liangliang Cao and Li Fei-Fei. 2007. Spatially Coherent Latent Topic Model for Concurrent Segmentation and Classification of Objects and Scenes. IEEE 11th International Conference on Computer Vision (2007), 1–8.
  • Casella and George (1992) George Casella and Edward I. George. 1992. Explaining the Gibbs Sampler. The American Statistician 46, 3 (1992), 167–174.
  • Chang and Blei (2009) Jonathan Chang and David M. Blei. 2009. Relational Topic Models for Document Networks.. In AISTATS (JMLR Proceedings), David A. Van Dyk and Max Welling (Eds.), Vol. 5. JMLR.org, 81–88. http://dblp.uni-trier.de/db/journals/jmlr/jmlrp5.html#ChangB09
  • Chen et al. (2016) Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. 2016. WarpLDA: A Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation. Proc. VLDB Endow. 9, 10 (June 2016), 744–755. https://doi.org/10.14778/2977797.2977801
  • Chen et al. (2012) Xin Chen, Xiaohua Hu, Tze Yee Lim, Xiajiong Shen, E. K. Park, and Gail L. Rosen. 2012. Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic Topic Modeling. IEEE/ACM Trans. Comput. Biol. Bioinformatics 9, 4 (July 2012), 980–991. https://doi.org/10.1109/TCBB.2011.113
  • Griffiths and Steyvers (2004) Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101 (2004), 5228–5235. https://doi.org/10.1073/pnas.0307752101
  • Hai et al. (2017) Zhen Hai, Gao Cong, Kuiyu Chang, Peng Cheng, and Chunyan Miao. 2017. Analyzing Sentiments in One Go: A Supervised Joint Topic Modeling Approach. IEEE Trans. on Knowl. and Data Eng. 29, 6 (June 2017), 1172–1185. https://doi.org/10.1109/TKDE.2017.2669027
  • Hoffman et al. (2010) Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online Learning for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.). Curran Associates, Inc., 856–864. http://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation.pdf
  • Hu and Saul (2009) Diane Hu and Lawrence K. Saul. 2009.

    A Probabilistic Topic Model for Unsupervised Learning of Musical Key-Profiles.. In

    ISMIR, Keiji Hirata, George Tzanetakis, and Kazuyoshi Yoshii (Eds.). International Society for Music Information Retrieval, 441–446.
  • Lau et al. (2014) Jey Han Lau, David Newman, and Timothy Baldwin. 2014. Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. In EACL. The Association for Computer Linguistics, 530–539.
  • Li et al. (2014) Aaron Q. Li, Amr Ahmed, Sujith Ravi, and Alexander J. Smola. 2014. Reducing the Sampling Complexity of Topic Models. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14). ACM, New York, NY, USA, 891–900. https://doi.org/10.1145/2623330.2623756
  • Li et al. (2017) Kaiwei Li, Jianfei Chen, Wenguang Chen, and Jun Zhu. 2017. SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs. SIGOPS Oper. Syst. Rev. 51, 2 (April 2017), 497–509. https://doi.org/10.1145/3093315.3037740
  • Mcauliffe and Blei (2008) Jon D. Mcauliffe and David M. Blei. 2008. Supervised Topic Models. In Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Eds.). Curran Associates, Inc., 121–128. http://papers.nips.cc/paper/3328-supervised-topic-models.pdf
  • Minka and Lafferty (2002) Thomas Minka and John Lafferty. 2002. Expectation-propagation for the Generative Aspect Model. In

    Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence

    (UAI’02). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 352–359.
  • Newman et al. (2010) David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic Evaluation of Topic Coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT ’10). Association for Computational Linguistics, Stroudsburg, PA, USA, 100–108. http://dl.acm.org/citation.cfm?id=1857999.1858011
  • Porteous et al. (2008) Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’08). ACM, New York, NY, USA, 569–577. https://doi.org/10.1145/1401890.1401960
  • Sato and Nakagawa ([n. d.]) Issei Sato and Hiroshi Nakagawa. [n. d.]. Rethinking Collapsed Variational Bayes Inference for LDA. In

    Proceedings of the 29th International Conference on Machine Learning

    . icml.cc / Omnipress, 101.
  • Smola and Narayanamurthy (2010) Alexander Smola and Shravan Narayanamurthy. 2010. An Architecture for Parallel Topic Models. Proc. VLDB Endow. 3, 1-2 (Sept. 2010), 703–710. https://doi.org/10.14778/1920841.1920931
  • Teh et al. (2007) Yee W. Teh, David Newman, and Max Welling. 2007. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems 19. MIT Press, 1353–1360.
  • Walker (1977) Alastair J. Walker. 1977.

    An Efficient Method for Generating Discrete Random Variables with General Distributions.

    ACM Trans. Math. Softw. 3, 3 (Sept. 1977), 253–256. https://doi.org/10.1145/355744.355749
  • Yan et al. (2009) Feng Yan, Ningyi Xu, and Yuan Alan Qi. 2009. Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units. In Proceedings of the 22Nd International Conference on Neural Information Processing Systems (NIPS’09). Curran Associates Inc., USA, 2134–2142. http://dl.acm.org/citation.cfm?id=2984093.2984332
  • Yao et al. (2009) Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient Methods for Topic Model Inference on Streaming Document Collections. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’09). ACM, New York, NY, USA, 937–946. https://doi.org/10.1145/1557019.1557121
  • Yu et al. (2015) Hsiang-Fu Yu, Cho-Jui Hsieh, Hyokun Yun, S.V.N. Vishwanathan, and Inderjit S. Dhillon. 2015. A Scalable Asynchronous Distributed Algorithm for Topic Modeling. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1340–1350. https://doi.org/10.1145/2736277.2741682
  • Yuan et al. (2015) Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. LightLDA: Big Topic Models on Modest Computer Clusters. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1351–1361. https://doi.org/10.1145/2736277.2741115
  • Zhao et al. (2015) Huasha Zhao, Biye Jiang, John F. Canny, and Bobby Jaros. 2015. SAME but Different: Fast and High Quality Gibbs Parameter Estimation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’15). ACM, New York, NY, USA, 1495–1502. https://doi.org/10.1145/2783258.2783416
  • Zheng (2012) Jia Zheng. 2012. A topic modeling toolbox using belief propagation. J. Mach. Learn. Res. 13 (Jan. 2012), 2233–2236. http://www.jmlr.org/papers/volume13/zeng12a/zeng12a.pdf