Gamma Belief Networks
To infer a multilayer representation of high-dimensional count vectors, we propose the Poisson gamma belief network (PGBN) that factorizes each of its layers into the product of a connection weight matrix and the nonnegative real hidden units of the next layer. The PGBN's hidden layers are jointly trained with an upward-downward Gibbs sampler, each iteration of which upward samples Dirichlet distributed connection weight vectors starting from the first layer (bottom data layer), and then downward samples gamma distributed hidden units starting from the top hidden layer. The gamma-negative binomial process combined with a layer-wise training strategy allows the PGBN to infer the width of each layer given a fixed budget on the width of the first layer. The PGBN with a single hidden layer reduces to Poisson factor analysis. Example results on text analysis illustrate interesting relationships between the width of the first layer and the inferred network structure, and demonstrate that the PGBN, whose hidden units are imposed with correlated gamma priors, can add more layers to increase its performance gains over Poisson factor analysis, given the same limit on the width of the first layer.READ FULL TEXT VIEW PDF
To infer multilayer deep representations of high-dimensional discrete an...
For text analysis, one often resorts to a lossy representation that eith...
The Poisson-Gamma model is a generalization of the Poisson model, which ...
A common approach to analyze a covariate-sample count matrix, an element...
Deep belief networks are a powerful way to model complex probability
Combining Bayesian nonparametrics and a forward model selection strategy...
We investigate solving discrete optimisation problems using the estimati...
Gamma Belief Networks
There has been significant recent interest in deep learning. Despite its tremendous success in supervised learning, inferring a multilayer data representation in an unsupervised manner remains a challenging problem[1, 2, 3]
. The sigmoid belief network (SBN), which connects the binary units of adjacent layers via the sigmoid functions, infers a deep representation of multivariate binary vectors[4, 5]
. The deep belief network (DBN)
is a SBN whose top hidden layer is replaced by the restricted Boltzmann machine (RBM) that is undirected. The deep Boltzmann machine (DBM) is an undirected deep network that connects the binary units of adjacent layers using the RBMs . All these deep networks are designed to model binary observations. Although one may modify the bottom layer to model Gaussian and multinomial observations, the hidden units of these networks are still typically restricted to be binary [8, 9, 10]. One may further consider the exponential family harmoniums [11, 12] to construct more general networks with non-binary hidden units, but often at the expense of noticeably increased complexity in training and data fitting.
Moving beyond conventional deep networks using binary hidden units, we construct a deep directed network with gamma distributed nonnegative real hidden units to unsupervisedly infer a multilayer representation of multivariate count vectors, with a simple but powerful mechanism to capture the correlations among the visible/hidden features across all layers and handle highly overdispersed counts. The proposed model is called the Poisson gamma belief network (PGBN), which factorizes the observed count vectors under the Poisson likelihood into the product of a factor loading matrix and the gamma distributed hidden units (factor scores) of layer one; and further factorizes the shape parameters of the gamma hidden units of each layer into the product of a connection weight matrix and the gamma hidden units of the next layer. Distinct from previous deep networks that often utilize binary units for tractable inference and require tuning both the width (number of hidden units) of each layer and the network depth (number of layers), the PGBN employs nonnegative real hidden units and automatically infers the widths of subsequent layers given a fixed budget on the width of its first layer. Note that the budget could be infinite and hence the whole network can grow without bound as more data are being observed. When the budget is finite and hence the ultimate capacity of the network is limited, we find that the PGBN equipped with a narrower first layer could increase its depth to match or even outperform a shallower network with a substantially wider first layer.
The gamma distribution density function has the highly desired strong non-linearity for deep learning, but the existence of neither a conjugate prior nor a closed-form maximum likelihood estimate for its shape parameter makes a deep network with gamma hidden units appear unattractive. Despite seemingly difficult, we discover that, by generalizing the data augmentation and marginalization techniques for discrete data, one may propagate latent counts one layer at a time from the bottom data layer to the top hidden layer, with which one may derive an efficient upward-downward Gibbs sampler that, one layer at a time in each iteration, upward samples Dirichlet distributed connection weight vectors and then downward samples gamma distributed hidden units.
In addition to constructing a new deep network that well fits multivariate count data and developing an efficient upward-downward Gibbs sampler, other contributions of the paper include: 1) combining the gamma-negative binomial process [13, 14] with a layer-wise training strategy to automatically infer the network structure; 2) revealing the relationship between the upper bound imposed on the width of the first layer and the inferred widths of subsequent layers; 3) revealing the relationship between the network depth and the model’s ability to model overdispersed counts; 4) and generating a multivariate high-dimensional random count vector, whose distribution is governed by the PGBN, by propagating the gamma hidden units of the top hidden layer back to the bottom data layer.
Let the Chinese restaurant table (CRT) distribution represent the distribution of a random count generated as
Its probability mass function (PMF) can be expressed aswhere , , and are unsigned Stirling numbers of the first kind. Let denote the logarithmic distribution with PMF where . Let denote the negative binomial (NB) distribution with PMF where . The NB distribution
can be generated as a gamma mixed Poisson distribution aswhere is the gamma scale parameter. As shown in 
, the joint distribution ofand given and in where and , is the same as that in which is called the Poisson-logarithmic bivariate distribution, with PMF .
Assuming the observations are multivariate count vectors , the generative model of the Poisson gamma belief network (PGBN) with hidden layers, from top to bottom, is expressed as
The PGBN factorizes the count observation into the product of the factor loading and hidden units of layer one under the Poisson likelihood, where , and for , factorizes the shape parameters of the gamma distributed hidden units of layer into the product of the connection weight matrix and the hidden units of layer ; the top layer’s hidden units share the same vector as their gamma shape parameters; and the are probability parameters and are gamma scale parameters, with .
For scale identifiabilty and ease of inference, each column of is restricted to have a unit norm. To complete the hierarchical model, for , we let
and impose and ; and for , we let
We expect the correlations between the rows (features) of to be captured by the columns of , and the correlations between the rows (latent features) of to be captured by the columns of . Even if all for are identity matrices, indicating no correlations between latent features, our analysis will show that a deep structure with could still benefit data fitting by better modeling the variability of the latent features .
Sigmoid and deep belief networks. Under the hierarchical model in (1), given the connection weight matrices, the joint distribution of the count observations and gamma hidden units of the PGBN can be expressed, similar to those of the sigmoid and deep belief networks , as
With representing the th row , for the gamma hidden units we have
which are highly nonlinear functions that are strongly desired in deep learning. By contrast, with the sigmoid function and bias terms , a sigmoid/deep belief network would connect the binary hidden units of layer (for deep belief networks, ) to the product of the connection weights and binary hidden units of the next layer with
) clearly shows the differences between the gamma nonnegative hidden units and the sigmoid link based binary hidden units. Note that the rectified linear units have emerged as powerful alternatives of sigmoid units to introduce nonlinearity. It would be interesting to use the gamma units to introduce nonlinearity in the positive region of the rectified linear units.
Deep Poisson factor analysis. With , the PGBN specified by (1)-(3) reduces to Poisson factor analysis (PFA) using the (truncated) gamma-negative binomial process , which is also related to latent Dirichlet allocation  if the Dirichlet priors are imposed on both and . With
, the PGBN is related to the gamma Markov chain hinted by Corollary 2 of and realized in , the deep exponential family of , and the deep PFA of . Different from the PGBN, in , it is the gamma scale but not shape parameters that are chained and factorized; in , it is the correlations between binary topic usage indicators but not the full connection weights that are captured; and neither  nor  provide a principled way to learn the network structure. Below we break the PGBN of layers into related submodels that are solved with the same subroutine.
for , one may connect the observed (if ) or some latent (if ) counts to the product at layer under the Poisson likelihood as
where . With representing the number of times that factor of layer appears in observation and , since , we can marginalize out as in , leading to
Further marginalizing out the gamma distributed from the above Poisson likelihood leads to
The th element of can be augmented under its compound Poisson representation as
Thus if (7) is true for layer , then it is also true for layer . ∎
As and is in the same order as , the total count of layer , expressed as , would often be much smaller than that of layer , expressed as . Thus the PGBN may use as a simple criterion to decide whether to add more layers.
In comparison to a single-layer shallow model with that assumes the hidden units of layer one to be independent in the prior, the multilayer deep model with captures the correlations between them. Note that for the extreme case that for are all identity matrices, which indicates that there are no correlations between the features of left to be captured, the deep structure could still provide benefits as it helps model latent counts that may be highly overdispersed. For example, supposing for all , then from (1) and (9) we have
For simplicity, let us further assume for all
. Using the laws of total expectation and total variance, we haveand and hence
In comparison to PFA with , with a variance-to-mean ratio of , the PGBN with hidden layers, which mixes the shape of
with a chain of gamma random variables, increases the variance-to-mean ratio of the latent countgiven by a factor of , and hence could better model highly overdispersed counts.
With Lemma 1 and Corollary 2 and the width of the first layer being bounded by , we develop an upward-downward Gibbs sampler for the PGBN, each iteration of which proceeds as follows: Sample . We can sample for all layers using (10). But for the first hidden layer, we may treat each observed count as a sequence of word tokens at the th term (in a vocabulary of size ) in the th document, and assign the words one after another to the latent factors (topics), with both the topics and topic weights marginalized out, as
where is the topic index for and counts the number of times that term appears in document ; we use the symbol to represent summing over the corresponding index, e.g., , and use to denote the count calculated without considering word in document . The collapsed Gibbs sampling update equation shown above is related to the one developed in  for latent Dirichlet allocation, and the one developed in  for PFA using the beta-negative binomial process. When , we would replace the terms with for PFA built on the gamma-negative binomial process  (or with for the hierarchical Dirichlet process latent Dirichlet allocation, see  and  for details), and add an additional term to account for the possibility of creating an additional topic . For simplicity, in this paper, we truncate the nonparametric Bayesian model with factors and let if .
Sample . Given these latent counts, we sample the factors/topics as
Sample . Both and are sampled using related equations in . We sample as
Sample . With for and , we sample and as
and calculate and with (6).
As jointly training all layers together is often difficult, existing deep networks are typically trained using a greedy layer-wise unsupervised training algorithm, such as the one proposed in  to train the deep belief networks. The effectiveness of this training strategy is further analyzed in . By contrast, the PGBN has a simple Gibbs sampler to jointly train all its hidden layers, as described in Section 2.3, and hence does not require greedy layer-wise training. Yet the same as commonly used deep learning algorithms, it still needs to specify the number of layers and the width of each layer.
In this paper, we adopt the idea of layer-wise training for the PGBN, not because of the lack of an effective joint-training algorithm, but for the purpose of learning the width of each hidden layer in a greedy layer-wise manner, given a fixed budget on the width of the first layer. The proposed layer-wise training strategy is summarized in Algorithm 1. With a PGBN of layers that has already been trained, the key idea is to use a truncated gamma-negative binomial process  to model the latent count matrix for the newly added top layer as , and rely on that stochastic process’s shrinkage mechanism to prune inactive factors (connection weight vectors) of layer , and hence the inferred would be smaller than if is sufficiently large. The newly added layer and the layers below it would be jointly trained, but with the structure below the newly added layer kept unchanged. Note that when , the PGBN would infer the number of active factors if is set large enough, otherwise, it would still assign the factors with different weights , but may not be able to prune any of them.
We apply the PGBNs for topic modeling of text corpora, each document of which is represented as a term-frequency count vector. Note that the PGBN with a single hidden layer is identical to the (truncated) gamma-negative binomial process PFA of , which is a nonparametric Bayesian algorithm that performs similarly to the hierarchical Dirichlet process latent Dirichlet allocation  for text analysis, and is considered as a strong baseline that outperforms a large number of topic modeling algorithms. Thus we will focus on making comparison to the PGBN with a single layer, with its layer width set to be large to approximate the performance of the gamma-negative binomial process PFA. We evaluate the PGBNs’ performance by examining both how well they unsupervisedly extract low-dimensional features for document classification, and how well they predict heldout word tokens. Matlab code will be available in http://mingyuanzhou.github.io/.
We use Algorithm 1 to learn, in a layer-wise manner, from the training data the weight matrices and the top-layer hidden units’ gamma shape parameters : to add layer to a previously trained network with layers, we use iterations to jointly train and together with , prune the inactive factors of layer , and continue the joint training with another iterations. We set the hyper-parameters as and . Given the trained network, we apply the upward-downward Gibbs sampler to collect 500 MCMC samples after 500 burnins to estimate the posterior mean of the feature usage proportion vector at the first hidden layer, for every document in both the training and testing sets.
Feature learning for binary classification. We consider the 20 newsgroups dataset (http://qwone.com/jason/20Newsgroups/) that consists of 18,774 documents from 20 different news groups, with a vocabulary of size 61,188. It is partitioned into a training set of 11,269 documents and a testing set of 7,505 ones. We first consider two binary classification tasks that distinguish between the and , and between the and news groups. For each binary classification task, we remove a standard list of stop words and only consider the terms that appear at least five times, and report the classification accuracies based on 12 independent random trials. With the upper bound of the first layer’s width set as , and and for all , we use Algorithm 1 to train a network with layers. Denote as the estimated dimensional feature vector for document , where is the inferred number of active factors of the first layer that is bounded by the pre-specified truncation level . We use the
regularized logistic regression provided by the LIBLINEAR package
to train a linear classifier onin the training set and use it to classify in the test set, where the regularization parameter is five-folder cross-validated on the training set from .
As shown in Fig. 2, modifying the PGBN from a single-layer shallow network to a multilayer deep one clearly improves the qualities of the unsupervisedly extracted feature vectors. In a random trial, with , we infer a network structure of for the first binary classification task, and for the second one. Figs. 2(c)-(d) also show that increasing the network depth in general improves the performance, but the first-layer width clearly plays an important role in controlling the ultimate network capacity. This insight is further illustrated below.
Feature learning for multi-class classification. We test the PGBNs for multi-class classification on 20newsgroups. After removing a standard list of stopwords and the terms that appear less than five times, we obtain a vocabulary with . We set and for all . If , we set for all , otherwise we set and for . We use all 11,269 training documents to infer a set of networks with and , and mimic the same testing procedure used for binary classification to extract low-dimensional feature vectors, with which each testing document is classified to one of the 20 news groups using the regularized logistic regression. Fig. 2 shows a clear trend of improvement in classification accuracy by increasing the network depth with a limited first-layer width, or by increasing the upper bound of the width of the first layer with the depth fixed. For example, a single-layer PGBN with could add one or more layers to slightly outperform a single-layer PGBN with , and a single-layer PGBN with could add layers to clearly outperform a single-layer PGBN with as large as . We also note that each iteration of jointly training multiple layers costs moderately more than that of training a single layer, e.g., with , a training iteration on a single core of an Intel Xeon 2.7 GHz CPU on average takes about , , seconds for the PGBN with , , and layers, respectively.
Examining the inferred network structure also reveals interesting details. For example, in a random trial with Algorithm 1, the inferred network widths are , , , and , for , and , respectively. This indicates that for a network with an insufficient budget on its first-layer width, as the network depth increases, its inferred layer widths decay more slowly than a network with a sufficient or surplus budget on its first-layer width; and a network with a surplus budget on its first-layer width may only need relatively small widths for its higher hidden layers. In the Appendix, we provide comparisons of accuracies between the PGBN and other related algorithms, including these of  and , on similar multi-class document classification tasks.
Perplexities for holdout words. In addition to examining the performance of the PGBN for unsupervised feature learning, we also consider a more direct approach that we randomly choose 30% of the word tokens in each document as training, and use the remaining ones to calculate per-heldout-word perplexity. We consider the NIPS12 (http://www.cs.nyu.edu/roweis/data.html) corpus, limiting the vocabulary to the 2000 most frequent terms. We set and for all , set and for , and consider five random trials. Among the Gibbs sampling iterations used to train layer , we collect one sample per five iterations during the last iterations, for each of which we draw the topics and topics weights , to compute the per-heldout-word perplexity using Equation (34) of . As shown in Fig. 3, we observe a clear trend of improvement by increasing both and .
Qualitative analysis and document simulation. In addition to these quantitative experiments, we have also examined the topics learned at each layer. We use to project topic of layer as a -dimensional word probability vector. Generally speaking, the topics at lower layers are more specific, whereas those at higher layers are more general. E.g., examining the results used to produce Fig. 3, with and , the PGBN infers a network with . The ranks (by popularity) and top five words of three example topics for layer are “6 network units input learning training,” “15 data model learning set image,” and “34 network learning model input neural;” while these of five example topics of layer are “19 likelihood em mixture parameters data,” “37 bayesian posterior prior log evidence,” “62 variables belief networks conditional inference,” “126 boltzmann binary machine energy hinton,” and “127 speech speaker acoustic vowel phonetic.” We have also tried drawing and downward passing it through the -layer network to generate synthetic documents, which are found to be quite interpretable and reflect various general aspects of the corpus used to train the network. We provide in the Appendix a number of synthetic documents generated from a PGBN trained on the 20newsgroups corpus, whose inferred structure is .
The Poisson gamma belief network is proposed to extract a multilayer deep representation for high-dimensional count vectors, with an efficient upward-downward Gibbs sampler to jointly train all its layers and a layer-wise training strategy to automatically infer the network structure. Example results clearly demonstrate the advantages of deep topic models. For big data problems, in practice one may rarely has a sufficient budget to allow the first-layer width to grow without bound, thus it is natural to consider a belief network that can use a deep representation to not only enhance its representation power, but also better allocate its computational resource. Our algorithm achieves a good compromise between the widths of hidden layers and the depth of the network.
Acknowledgements. M. Zhou thanks TACC for computational support. B. Chen thanks the support of the Thousand Young Talent Program of China, NSC-China (61372132), and NCET-13-0945.
Training products of experts by minimizing contrastive divergence.Neural computation, pages 1771–1800, 2002.
For comparison, we consider the same regularized logistic regression multi-class classifier, trained either on the raw word counts or normalized term-frequencies of the 20newsgroups training documents using five-folder cross-validation. As summarized in Tab. 1, when using the raw term-frequency word counts as covariates, the same classifier achieves () accuracy on the 20newsgroups test documents if using the top 2000 terms that exclude (include) a standard list of stopwords, achieves if using all the terms in the vocabulary, and achieves if using the terms remained after removing a standard list of stopwords and the terms that appear less than five times; and when using the normalized term-frequencies as covariates, the corresponding accuracies are () if using the top 2000 terms excluding (including) stopwords, with all the terms, and with the selected terms.
|with stopwords||with stopwords||remove stopwords||remove stopwords|
|with rare words||with rare words||remove rare words||remove rare words|
|raw word counts||term frequencies||raw word counts||term frequencies|
|with stopwords||with stopwords||remove stopwords||remove stopwords|
|raw counts||term frequencies||raw counts||term frequencies|
As summarized in Tab. 2, for multi-class classification on the same dataset, with a vocabulary size of 2000 that consisits of the 2000 most frequent terms after removing stopwords and stemming, the DocNADE  and the over-replicated softmax  provide the accuracies of and , respectively, for a feature dimension of , and provide the accuracies of and , respectively, for a feature dimension of .
|remove stopwords, stemming||remove stopwords, stemming|
As summarized in Tab. 3, with the same vocabulary size of 2000 (but different terms due to different preprocessing), the proposed PGBN provides () with () for , and ( with ( for , which may be further improved if we also consider the stemming step, as done in the these two algorithms, for word preprocessing, or if we set the values of to be smaller than 0.05. We also summarize in Tab. 3 the classification accuracies of the PGBNs learned with , as shown in Fig. 2.
|remove stopwords||remove stopwords||remove stopwords|
|remove stopwords||remove stopwords||remove stopwords|
|remove rare words||remove rare words||remove rare words|
Below we provide several synthetic documents generated from the PGBN with