Gamma Belief Networks

12/09/2015 ∙ by Mingyuan Zhou, et al. ∙ The University of Texas at Austin Xidian University NetEase, Inc 0

To infer multilayer deep representations of high-dimensional discrete and nonnegative real vectors, we propose an augmentable gamma belief network (GBN) that factorizes each of its hidden layers into the product of a sparse connection weight matrix and the nonnegative real hidden units of the next layer. The GBN's hidden layers are jointly trained with an upward-downward Gibbs sampler that solves each layer with the same subroutine. The gamma-negative binomial process combined with a layer-wise training strategy allows inferring the width of each layer given a fixed budget on the width of the first layer. Example results illustrate interesting relationships between the width of the first layer and the inferred network structure, and demonstrate that the GBN can add more layers to improve its performance in both unsupervisedly extracting features and predicting heldout data. For exploratory data analysis, we extract trees and subnetworks from the learned deep network to visualize how the very specific factors discovered at the first hidden layer and the increasingly more general factors discovered at deeper hidden layers are related to each other, and we generate synthetic data by propagating random variables through the deep network from the top hidden layer back to the bottom data layer.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been significant recent interest in deep learning. Despite its tremendous success in supervised learning, inferring a multilayer data representation in an unsupervised manner remains a challenging problem

(Bengio and LeCun, 2007; Ranzato et al., 2007; Bengio et al., 2015)

. To generate data with a deep network, it is often unclear how to set the structure of the network, including the depth (number of layers) of the network and the width (number of hidden units) of each layer. In addition, for some commonly used deep generative models, including the sigmoid belief network (SBN), deep belief network (DBN), and deep Boltzmann machine (DBM), the hidden units are often restricted to be binary. More specifically, the SBN, which connects the binary units of adjacent layers via the sigmoid functions, infers a deep representation of multivariate binary vectors

(Neal, 1992; Saul et al., 1996); the DBN (Hinton et al., 2006)

is a SBN whose top hidden layer is replaced by the restricted Boltzmann machine (RBM)

(Hinton, 2002) that is undirected; and the DBM is an undirected deep network that connects the binary units of adjacent layers using the RBMs (Salakhutdinov and Hinton, 2009). All these three deep networks are designed to model binary observations, without principled ways to set the network structure. Although one may modify the bottom layer to model Gaussian and multinomial observations, the hidden units of these networks are still typically restricted to be binary (Salakhutdinov and Hinton, 2009; Larochelle and Lauly, 2012; Salakhutdinov et al., 2013). To generalize these models, one may consider the exponential family harmoniums (Welling et al., 2004; Xing et al., 2005) to construct more general networks with non-binary hidden units, but often at the expense of noticeably increased complexity in training and data fitting. To model real-valued data without restricting the hidden units to be binary, one may consider the general framework of nonlinear Gaussian belief networks (Frey and Hinton, 1999)

that constructs continuous hidden units by nonlinearly transforming Gaussian distributed latent variables, including as special cases both the continuous SBN of

Frey (1997a, b) and the rectified Gaussian nets of Hinton and Ghahramani (1997). More recent scalable generalizations under that framework include variational auto-encoders (Kingma and Welling, 2014) and deep latent Gaussian models (Rezende et al., 2014).

Moving beyond conventional deep generative models using binary or nonlinearly transformed Gaussian hidden units and setting the network structure in a heuristic manner, we construct deep networks using gamma distributed nonnegative real hidden units, and combine the gamma-negative binomial process

(Zhou and Carin, 2015; Zhou et al., 2015b) with a greedy-layer wise training strategy to automatically infer the network structure. The proposed model is called the augmentable gamma belief network, referred to hereafter for brevity as the GBN, which factorizes the observed or latent count vectors under the Poisson likelihood into the product of a factor loading matrix and the gamma distributed hidden units (factor scores) of layer one; and further factorizes the shape parameters of the gamma hidden units of each layer into the product of a connection weight matrix and the gamma hidden units of the next layer. The GBN together with Poisson factor analysis can unsupervisedly infer a multilayer representation from multivariate count vectors, with a simple but powerful mechanism to capture the correlations between the visible/hidden features across all layers and handle highly overdispersed counts. With the Bernoulli-Poisson link function (Zhou, 2015), the GBN is further applied to high-dimensional sparse binary vectors by truncating latent counts, and with a Poisson randomized gamma distribution, the GBN is further applied to high-dimensional sparse nonnegative real data by randomizing the gamma shape parameters with latent counts.

For tractable inference of a deep generative model, one often applies either a sampling based procedure (Neal, 1992; Frey, 1997a) or variational inference (Saul et al., 1996; Frey, 1997b; Ranganath et al., 2014b; Kingma and Welling, 2014)

. However, conjugate priors on the model parameters that connect adjacent layers are often unknown, making it difficult to develop fully Bayesian inference that infers the posterior distributions of these parameters. It was not until recently that a Gibbs sampling algorithm, imposing priors on the network connection weights and sampling from their conditional posteriors, was developed for the SBN by

Gan et al. (2015b), using the Polya-Gamma data augmentation technique developed for logistic models (Polson et al., 2012). In this paper, we will develop data augmentation technique unique for the augmentable GBN, allowing us to develop a fully Bayesian upward-downward Gibbs sampling algorithm to infer the posterior distributions of not only the hidden units, but also the connection weights between adjacent layers.

Distinct from previous deep networks that often require tuning both the width (number of hidden units) of each layer and the network depth (number of layers), the GBN employs nonnegative real hidden units and automatically infers the widths of subsequent layers given a fixed budget on the width of its first layer. Note that the budget could be infinite and hence the whole network can grow without bound as more data are being observed. Similar to other belief networks that can often be improved by adding more hidden layers (Hinton et al., 2006; Sutskever and Hinton, 2008; Bengio et al., 2015), for the proposed model, when the budget on the first layer is finite and hence the ultimate capacity of the network could be limited, our experimental results also show that a GBN equipped with a narrower first layer could increase its depth to match or even outperform a shallower one with a substantially wider first layer.

The gamma distribution density function has the highly desired strong non-linearity for deep learning, but the existence of neither a conjugate prior nor a closed-form maximum likelihood estimate

(Choi and Wette, 1969)

for its shape parameter makes a deep network with gamma hidden units appear unattractive. Despite seemingly difficult, we discover that, by generalizing the data augmentation and marginalization techniques for discrete data modeled with the Poisson, gamma, and negative binomial distributions

(Zhou and Carin, 2015), one may propagate latent counts one layer at a time from the bottom data layer to the top hidden layer, with which one may derive an efficient upward-downward Gibbs sampler that, one layer at a time in each iteration, upward samples Dirichlet distributed connection weight vectors and then downward samples gamma distributed hidden units, with the latent parameters of each layer solved with the same subroutine.

With extensive experiments in text and image analysis, we demonstrate that the deep GBN with two or more hidden layers clearly outperforms the shallow GBN with a single hidden layer in both unsupervisedly extracting latent features for classification and predicting heldout data. Moreover, we demonstrate the excellent ability of the GBN in exploratory data analysis: by extracting trees and subnetworks from the learned deep network, we can follow the paths of each tree to visualize various aspects of the data, from very general to very specific, and understand how they are related to each other.

In addition to constructing a new deep network that well fits high-dimensional sparse binary, count, and nonnegative real data, developing an efficient upward-downward Gibbs sampler, and applying the learned deep network for exploratory data analysis, other contributions of the paper include: 1) proposing novel link functions, 2) combining the gamma-negative binomial process (Zhou and Carin, 2015; Zhou et al., 2015b) with a layer-wise training strategy to automatically infer the network structure; 3) revealing the relationship between the upper bound imposed on the width of the first layer and the inferred widths of subsequent layers; 4) revealing the relationship between the depth of the network and the model’s ability to model overdispersed counts; and 5) generating multivariate high-dimensional discrete or nonnegative real vectors, whose distributions are governed by the GBN, by propagating the gamma hidden units of the top hidden layer back to the bottom data layer. We note this paper significantly extends our recent conference publication (Zhou et al., 2015a) that proposes the Poisson GBN.

2 Augmentable Gamma Belief Networks

Denoting as the hidden units of sample at layer , where , the generative model of the augmentable gamma belief network (GBN) with hidden layers, from top to bottom, is expressed as

(1)

where represents a gamma distribution with mean

and variance

. For , the GBN factorizes the shape parameters of the gamma distributed hidden units of layer into the product of the connection weight matrix and the hidden units of layer ; the top layer’s hidden units share the same vector as their gamma shape parameters; and the

are probability parameters and

are gamma scale parameters, with . We will discuss later how to measure the connection strengths between the nodes of adjacent layers and the overall popularity of a factor at a particular hidden layer.

For scale identifiability and ease of inference and interpretation, each column of is restricted to have a unit norm and hence . To complete the hierarchical model, for , we let

(2)

where is the th column of ; we impose and ; and for , we let

(3)

We expect the correlations between the rows (latent features) of to be captured by the columns of . Even if for are all identity matrices, indicating no correlations between the latent features to be captured, our analysis in Section 3.2 will show that a deep structure with could still benefit data fitting by better modeling the variability of the latent features . Before further examining the network structure, below we first introduce a set of distributions that will be used to either model different types of data or augment the model for simple inference.

2.1 Distributions for Count, Binary, and Nonnegative Real Data

Below we first describe some useful count distributions that will be used later.

2.1.1 Useful Count Distributions and Their Relationships

Let the Chinese restaurant table (CRT) distribution represent the random number of tables seated by customers in a Chinese restaurant process (Blackwell and MacQueen, 1973; Antoniak, 1974; Aldous, 1985; Pitman, 2006) with concentration parameter . Its probability mass function (PMF) can be expressed as

where , , and are unsigned Stirling numbers of the first kind. A CRT distributed sample can be generated by taking the summation of independent Bernoulli random variables as

Let denote the logarithmic distribution (Fisher et al., 1943; Anscombe, 1950; Johnson et al., 1997) with PMF

where , and let denote the negative binomial (NB) distribution (Greenwood and Yule, 1920; Bliss and Fisher, 1953) with PMF

where . The NB distribution

can be generated as a gamma mixed Poisson distribution as

where is the gamma scale parameter.

As shown in (Zhou and Carin, 2015)

, the joint distribution of

and given and in

where and , is the same as that in

(4)

which is called the Poisson-logarithmic bivariate distribution, with PMF

We will exploit these relationships to derive efficient inference for the proposed models.

2.1.2 Bernoulli-Poisson Link and Truncated Poisson Distribution

As in Zhou (2015)

, the Bernoulli-Poisson (BerPo) link thresholds a random count at one to obtain a binary variable as

(5)

where if and if . If is marginalized out from (5), then given , one obtains a Bernoulli random variable as The conditional posterior of the latent count can be expressed as

where follows a truncated Poisson distribution, with for . Thus if , then almost surely (a.s.), and if , then , which can be simulated with a rejection sampler that has a minimal acceptance rate of 63.2% at (Zhou, 2015). Given the latent count and a gamma prior on , one can then update using the gamma-Poisson conjugacy. The BerPo link shares some similarities with the probit link that thresholds a normal random variable at zero, and the logistic link that lets . We advocate the BerPo link as an alternative to the probit and logistic links since if , then a.s., which could lead to significant computational savings if the binary vectors are sparse. In addition, the conjugacy between the gamma and Poisson distributions makes it convenient to construct hierarchical Bayesian models amenable to posterior simulation.

2.1.3 Poisson Randomized Gamma and Truncated Bessel Distributions

To model nonnegative data that include both zeros and positive observations, we introduce the Poisson randomized gamma (PRG) distribution as

whose distribution has a point mass at and is continuous for . The PRG distribution is generated as a Poisson mixed gamma distribution as

in which we define a.s. and hence if and only . Thus the PMF of can be expressed as

(6)

where

is the modified Bessel function of the first kind with fixed at . Using the laws of total expectation and total variance, or using the PMF directly, one may show that

Thus the variance-to-mean ratio of the PRG distribution is , as controlled by .

The conditional posterior of given , , and can be expressed as

(7)

where we define as the truncated Bessel distribution, with PMF

Thus if and only if , and is a positive integer drawn from a truncated Bessel distribution if . In Appendix A

, we plot the probability distribution functions of the proposed PRG and truncated Bessel distributions and show how they differ from the randomized gamma and Bessel distributions

(Yuan and Kalbfleisch, 2000), respectively.

2.2 Link Functions for Three Different Types of Observations

If the observations are multivariate count vectors , where , then we link the integer-valued visible units to the nonnegative real hidden units at layer one using Poisson factor analysis (PFA) as

(8)

Under this construction, the correlations between the rows (features) of are captured by the columns of . Detailed descriptions on how PFA is related to a wide variety of discrete latent variable models, including nonnegative matrix factorization (Lee and Seung, 2001), latent Dirichlet allocation (Blei et al., 2003), the gamma-Poisson model (Canny, 2004)

, discrete Principal component analysis

(Buntine and Jakulin, 2006), and the focused topic model (Williamson et al., 2010), can be found in Zhou et al. (2012) and Zhou and Carin (2015).

We call PFA using the GBN in (1) as the prior on its factor scores as the Poisson gamma belief network (PGBN), as proposed in Zhou et al. (2015a). The PGBN can be naturally applied to factorize the term-document frequency count matrix of a text corpus, not only extracting semantically meaningful topics at multiple layers, but also capturing the relationships between the topics of different layers using the deep network, as discussed below in both Sections 2.3 and 4.

If the observations are high-dimensional sparse binary vectors , then we factorize them using Bernoulli-Poisson factor analysis (Ber-PFA) as

(9)

We call Ber-PFA with the augmentable GBN as the prior on its factor scores as the Bernoulli-Poisson gamma belief network (BerPo-GBN).

If the observations are high-dimensional sparse nonnegative real-valued vectors , then we factorize them using Poisson randomized gamma (PRG) factor analysis as

(10)

We call PRG factor analysis with the augmentable GBN as the prior on its factor scores as the PRG gamma belief network (PRG-GBN).

Figure 1: An example directed network of five hidden layers, with visible units, , and sparse connections between the units of adjacent layers.

We show in Figure 1 an example directed belief network of five hidden layers, with visible units, with , , , , and hidden units for layers one, two, three, four, and five, respectively, and with sparse connections between the units of adjacent layers.

2.3 Exploratory Data Analysis

To interpret the network structure of the GBN, we notice that

(11)
(12)

Thus for visualization, it is straightforward to project the topics/hidden units/factor loadings/nodes of layer to the bottom data layer as the columns of the matrix

(13)

and rank their popularities using the dimensional nonnegative weight vector

(14)

To measure the connection strength between node of layer and node of layer , we use the value of

which is also expressed as or .

Figure 2: Extracted from the network shown in Figure 1, the left plot is a tree rooted at node , the middle plot is a tree rooted at node , and the right plot is a subnetwork consisting of both the tree rooted at node and the tree rooted at node .

Our intuition is that examining the nodes of the hidden layers, via their projections to the bottom data layer, from the top to bottom layers will gradually reveal less general and more specific aspects of the data. To verify this intuition and further understand the relationships between the general and specific aspects of the data, we consider extracting a tree for each node of layer , where , to help visualize the inferred multilayer deep structure. To be more specific, to construct a tree rooted at a node of layer , we grow the tree downward by linking the root node (if at layer ) or each leaf node of the tree (if at a layer below layer ) to all the nodes at the layer below that are connected to the root/leaf node with non-negligible weights. Note that a tree in our definition permits a node to have more than one parent, which means that different branches of the tree can overlap with each other. In addition, we also consider extracting subnetworks, each of which consists of multiple related trees from the full deep network. For example, shown in the left of Figure 2 is the tree extracted from the network in Figure 1 using node as the root, shown in the middle is the tree using node as the root, and shown in the right is a subnetwork consisting of two related trees that are rooted at nodes and , respectively.

2.3.1 Visualizing Nodes of Different Layers

Before presenting the technical details, we first provide some example results obtained with the PGBN on extracting multilayer representations from the 11,269 training documents of the 20newsgroups data set (http://qwone.com/jason/20Newsgroups/). Given a fixed budget of on the width of the first layer, with for all , a five-layer deep network inferred by the PGBN has a network structure as , meaning that there are , , , , and nodes at layers one to five, respectively.

Figure 4: The top 30 topics of layer three of the PGBN trained on the 20newsgroups corpus.
Figure 3: Example topics of layer one of the PGBN trained on the 20newsgroups corpus.
Figure 4: The top 30 topics of layer three of the PGBN trained on the 20newsgroups corpus.
Figure 5: The top 30 topics of layer five of the PGBN trained on the 20newsgroups corpus.
Figure 3: Example topics of layer one of the PGBN trained on the 20newsgroups corpus.

For visualization, we first relabel the nodes at each layer based on their weights , calculated as in (14), with a more popular (larger weight) node assigned with a smaller label. We visualize node of layer by displaying its top 12 words ranked according to their probabilities in , the th column of the projected representation calculated as in (13). We set the font size of node of layer proportional to in each subplot, and color the outside border of a text box as red, green, orange, blue, or black for a node of layer five, four, three, two, or one, respectively. For better interpretation, we also exclude from the vocabulary the top 30 words of node 1 of layer one: “don just like people think know time good make way does writes edu ve want say really article use right did things point going better thing need sure used little,” and the top 20 words of node 2 of layer one: “edu writes article com apr cs ca just know don like think news cc david university john org wrote world.” These 50 words are not in the standard list of stopwords but can be considered as stopwords specific to the 20newsgroups corpus discovered by the PGBN.

For the PGBN learned on the 20newsgroups corpus, we plot 54 example topics of layer one in Figure 5, the top 30 topics of layer three in Figure 5, and the top 30 topics of layer five in Figure 5. Figure 5 clearly shows that the topics of layer one, except for topics 1-3 that mainly consist of common functional words of the corpus, are all very specific. For example, topics 71 and 81 shown in the first row are about “candida yeast symptoms” and “sex,” respectively, topics 53, 73, 83, and 84 shown in the second row are about “printer,” “msg,” “police radar detector,” and “Canadian health care system,” respectively, and topics 46 and 76 shown in third row are about “ice hockey” and “second amendment,” respectively. By contrast, the topics of layers three and five, shown in Figures 5 and 5, respectively, are much less specific and can in general be matched to one or two news groups out of the 20 news groups, including comp.{graphics, os.ms-windows.misc, sys.ibm.pc.hardware, sys.mac.hardware, windows.x}, rec.{autos, motorcycles}, rec.sport.{baseball, hockey}, sci.{crypt, electronics, med, space}, misc.forsale, talk. politics.{misc, guns, mideast}, and {talk.religion.misc, alt.atheism, soc.religion.christian}.

2.3.2 Visualizing Trees Rooted at The Top-Layer Hidden Units

Figure 6: A tree that includes all the lower-layer nodes (directly or indirectly) linked with non-negligible weights to the top ranked node of the top layer, taken from the full network inferred by the PGBN on the 11,269 training documents of the 20newsgroups corpus, with for all . A line from node at layer to node at layer indicates that , with the width of the line proportional to . For each node, the rank (in terms of popularity) at the corresponding layer and the top 12 words of the corresponding topic are displayed inside the text box, where the text font size monotonically decreases as the popularity of the node decreases, and the outside border of the text box is colored as red, green, orange, blue, or black if the node is at layer five, four, three, two, or one, respectively.
Figure 7: Analogous plot to Figure 6 for a tree on “religion,” rooted at node 2 of the top-layer.

While it is interesting to examine the topics of different layers to understand the general and specific aspects of the corpus used to train the PGBN, it would be more informative to further illustrate how the topics of different layers are related to each other. Thus we consider constructing trees to visualize the PGBN. We first pick a node as the root of a tree and grow the tree downward by drawing a line from node at layer , the root or a leaf node of the tree, to node at layer for all in the set , where we set the width of the line connecting node of layer to node of layer be proportional to and use to adjust the complexity of a tree. In general, increasing would discard more weak connections and hence make the tree simpler and easier to visualize.

We set for all to visualize both a five-layer tree rooted at the top ranked node of the top hidden layer, as shown in Figure 6, and a five-layer tree rooted at the second ranked node of the top hidden layer, as shown in Figure 7. For the tree in Figure 6, while it is somewhat vague to determine the actual meanings of both node 1 of layer five and node 1 of layer four based on their top words, examining the more specific topics of layers three and two within the tree clearly indicate that this tree is primarily about “windows,” “window system,” “graphics,” “information,” and “software,” which are relatively specific concepts that are all closely related to each other. The similarities and differences between the five nodes of layer two can be further understood by examining the nodes of layer one that are connected to them. For example, while nodes 26 and 16 of layer two share their connections to multiple nodes of layer one, node 27 of layer one on “image” is strongly connected to node 26 of layer two but not to node 16 of layer two, and node 17 of layer one on “video” is strongly connected to node 16 of layer two but not to node 26 of layer two.

Following the branches of each tree shown in both figures, it is clear that the topics become more and more specific when moving along the tree from the top to bottom. Taking the tree on “religion” shown in Figure 7 for example, the root node splits into two nodes when moving from layers five to four: while the left node is still mainly about “religion,” the right node is on “objective morality.” When moving from layers four to three, node 5 of layer four splits into a node about “Christian” and another node about “Islamic.” When moving from layers three to two, node 3 of layer three splits into a node about “God, Jesus, & Christian,” and another node about “science, atheism, & question of the existence of God.” When moving from layers two to one, all four nodes of layer two split into multiple topics, and they are all strongly connected to both topics 1 and 2 of layer one, whose top words are those that appear frequently in the 20newsgroups corpus.

2.3.3 Visualizing Subnetworks Consisting of Related Trees

Examining the top-layer topics shown in Figure 5, one may find that some of the nodes seem to be closely related to each other. For example, topics 3 and 11 share eleven words out of the top twelve ones; topics 15 and 23 both have “Israel” and “Jews” as their top two words; topics 16 and 18 are both related to “gun;” and topics 7, 13, and 26 all share “team(s),” “game(s),” “player(s),” “season,” and “league.”

To understand the relationships and distinctions between these related nodes, we construct subnetworks that include the trees rooted at them, as shown in Figures 18-20 in Appendix C. It is clear from Figure 18 that the top-layer topic 3 differs from topic 11 in that it is not only strongly connected to topic 2 of layer four on“car & bike,” but also has a non-negligible connection to topic 27 of layer four on “sales.” It is clear from Figure 18 that topic 15 differs from topic 23 in that it is not only about “Israel & Arabs,” but also about “Israel, Armenia, & Turkey.” It is clear from Figure 20 in that topic 16 differs from topic 18 in that it is mainly about Waco siege happened in 1993 involving David Koresh, the Federal Bureau of Investigation (FBI), and the Bureau of Alcohol, Tobacco, Firearms and Explosives (BATF). It is clear from Figure 20 that topics 7 and 13 are mainly about “ice hockey” and “baseball,” respectively, and topic 26 is a mixture of both.

2.3.4 Capturing Correlations Between Nodes

For the augmentable GBN, as in (18), given the weight vector , we have

(15)

A distinction between a shallow augmentable GBN with hidden layer and a deep augmentable GBN with hidden layers is that the prior for changes from for to for . For the GBN with , given the shared weight vector , we have

(16)

for the GBN with , given the shared weight vector , we have

(17)

and for the GBN with , given the weight vector , we have

(18)

Thus in the prior, the co-occurrence patterns of the columns of are modeled by only a single vector when , but are captured in the columns of when . Similarly, in the prior, if , the co-occurrence patterns of the columns of the projected topics will be captured in the columns of the matrix .

To be more specific, we show in Figure 21 in Appendix C three example trees rooted at three different nodes of layer three, where we lower the threshold to to reveal more weak links between the nodes of adjacent layers. The top subplot reveals that, in addition to strongly co-occurring with the top two topics of layer one, topic 21 of layer one on “medicine” tends to co-occur not only with topics 7, 21, and 26, which are all common topics that frequently appear, but also with some much less common topics that are related to very specific diseases or symptoms, such as topic 67 on “msg” and “Chinese restaurant syndrome,” topic 73 on “candida yeast symptoms,” and topic 180 on “acidophilous” and “astemizole (hismanal).”

The middle subplot reveals that topic 31 of layer two on “encryption & cryptography” tends to co-occur with topic 13 of layer two on “government & encryption,” and it also indicates that topic 31 of layer one is more purely about “encryption” and more isolated from “government” in comparison to the other topics of layer one.

The bottom subplot reveals that in layer one, topic 14 on “law & government,” topic 32 on “Israel & Lebanon,” topic 34 on “Turkey, Armenia, Soviet Union, & Russian,” topic 132 on “Greece, Turkey, & Cyprus,” topic 98 on “Bosnia, Serbs, & Muslims,” topic 143 on “Armenia, Azeris, Cyprus, Turkey, & Karabakh,” and several other very specific topics related to Turkey and/or Armenia all tend to co-occur with each other.

We note that capturing the co-occurrence patterns between the topics not only helps exploratory data analysis, but also helps extract better features for classification in an unsupervised manner and improves prediction for held-out data, as will be demonstrated in detail in Section 4.

2.4 Related Models

The structure of the augmentable GBN resembles the sigmoid belief network and recently proposed deep exponential family model (Ranganath et al., 2014b). Such kind of gamma distribution based network and its inference procedure were vaguely hinted in Corollary 2 of Zhou and Carin (2015), and had been exploited by Acharya et al. (2015)

to develop a gamma Markov chain to model the temporal evolution of the factor scores of a dynamic count matrix, but have not yet been investigated for extracting multilayer data representations. The proposed augmentable GBN may also be considered as an exponential family harmonium

(Welling et al., 2004; Xing et al., 2005).

2.4.1 Sigmoid and Deep Belief Networks

Under the hierarchical model in (1), given the connection weight matrices, the joint distribution of the observed/latent counts and gamma hidden units of the GBN can be expressed, similar to those of the sigmoid and deep belief networks (Bengio et al., 2015), as

With representing the th row , for the gamma hidden units we have

(19)

which are highly nonlinear functions that are strongly desired in deep learning. By contrast, with the sigmoid function and bias terms , a sigmoid/deep belief network would connect the binary hidden units of layer (for deep belief networks, ) to the product of the connection weights and binary hidden units of the next layer with

(20)

Comparing (19) with (20) clearly shows the distinctions between the gamma distributed nonnegative hidden units and the sigmoid link function based binary hidden units. The limitation of binary units in capturing the approximately linear data structure over small ranges is a key motivation for Frey and Hinton (1999)

to investigate nonlinear Gaussian belief networks with real-valued units. As a new alternative to binary units, it would be interesting to further investigate whether the gamma distributed nonnegative real units can in theory carry richer information and model more complex nonlinearities given the same network structure. Note that the rectified linear units have emerged as powerful alternatives of sigmoid units to introduce nonlinearity

(Nair and Hinton, 2010). It would be interesting to investigate whether the gamma units can be used to introduce nonlinearity into the positive region of the rectified linear units.

2.4.2 Deep Poisson Factor Analysis

With , the PGBN specified by (1)-(3) and (8) reduces to Poisson factor analysis (PFA) using the (truncated) gamma-negative binomial process (Zhou and Carin, 2015), with a truncation level of . As discussed in (Zhou et al., 2012; Zhou and Carin, 2015), with priors imposed on neither nor , PFA is related to nonnegative matrix factorization (Lee and Seung, 2001), and with the Dirichlet priors imposed on both and , PFA is related to latent Dirichlet allocation (Blei et al., 2003).

Related to the PGBN and the dynamic model in (Acharya et al., 2015), the deep exponential family model of Ranganath et al. (2014b) also considers a gamma chain under Poisson observations, but it is the gamma scale parameters that are chained and factorized, which allows learning the network parameters using black box variational inference (Ranganath et al., 2014a). In the proposed PGBN, we chain the gamma random variables via the gamma shape parameters. Both strategies worth through investigation. We prefer chaining the shape parameters in this paper, which leads to efficient upward-downward Gibbs sampling via data augmentation and makes it clear how the latent counts are propagated across layers, as discussed in detail in the following sections. The sigmoid belief network has also been recently incorporated into PFA for deep factorization of count data (Gan et al., 2015a), however, that deep structure captures only the correlations between binary factor usage patterns but not the full connection weights. In addition, neither Ranganath et al. (2014b) nor Gan et al. (2015a) provide a principled way to learn the network structure, whereas the proposed GBN uses the gamma-negative binomial process together with a greedy layer-wise training strategy to automatically infer the widths of the hidden layers, which will be described in Section 3.3.

2.4.3 Correlated and Tree-Structured Topic Models

The PGBN with can also be related to correlated topic models (Blei and Lafferty, 2006; Paisley et al., 2012; Chen et al., 2013; Ranganath and Blei, 2015; Linderman et al., 2015)

, which typically use the logistic normal distributions to replace the topic-proportion Dirichlet distributions used in latent Dirichlet allocation

(Blei et al., 2003), capturing the co-occurrence patterns between the topics in the latent Gaussian space using a covariance matrix. By contrast, the PGBN factorizes the topic usage weights (not proportions) under the gamma likelihood, capturing the co-occurrence patterns between the topics of the first layer (i.e., the columns of ) in the columns of , the latent weight matrix connecting the hidden units of layers two and one. For the PGBN, the computation does not involve matrix inversion, which is often necessary for correlated topic models without specially structured covariance matrices, and scales linearly with the number of topics, hence it is suitable to be used to capture the correlations between hundreds of or thousands of topics.

As in Figures 6, 7, and 18-21, trees and subnetworks can be extracted from the inferred deep network to visualize the data. Tree-structured topic models have also been proposed before, such as those in Blei et al. (2010), Adams et al. (2010), and Paisley et al. (2015), but they usually artificially impose the tree structures to be learned, whereas the PGBN learns a directed network, from which trees and subnetworks can be extracted for visualization, without the need to specify the number of nodes per layer, restrict the number of branches per node, and forbid a node to have multiple parents.

3 Model Properties and Inference

Inference for the GBN shown in (1

) appears challenging, because not only the conjugate prior is unknown for the shape parameter of a gamma distribution, but also the gradients are difficult to evaluate for the parameters of the (log) gamma probability density function, which, as in (

19), includes the parameters inside the (log) gamma function. To address these challenges, we consider data augmentation (van Dyk and Meng, 2001) that introduces auxiliary variables to make it simple to compute the conditional posteriors of model parameters via the joint distribution of the auxiliary and existing random variables. We will first show that each gamma hidden unit can be linked to a Poisson distributed latent count variable, leading to a negative binomial likelihood for the parameters of the gamma hidden unit if it is margined out from the Poisson distribution; we then introduce an auxiliary count variable, which is sampled from the CRT distribution parametrized by the negative binomial latent count and shape parameter, to make the joint likelihood of the auxiliary CRT count and latent negative binomial count given the parameters of the gamma hidden unit amenable to posterior simulation. More specifically, under the proposed augmentation scheme, the gamma shape parameters will be linked to auxiliary counts under the Poisson likelihoods, making it straightforward for posterior simulation, as described below in detail.

3.1 The Upward Propagation of Latent Counts

We break the inference of the GBN of hidden layers into related subproblems, each of which is solved with the same subroutine. Thus for implementation, it is straightforward for the GBN to adjust its depth . Let us denote as the observed or latent count vector of layer , and as its th element, where .

Lemma 1 (Augment-and-Conquer The Gamma Belief Network)

With and

(21)

for , one may connect the observed or latent counts to the product at layer under the Poisson likelihood as

(22)

By definition (22) is true for layer . Suppose that (22) is also true for layer , then we can augment each count , where , into the summation of latent counts, which are smaller than or equal to as

(23)

Let the symbol represent summing over the corresponding index and let

represent the number of times that factor of layer appears in observation  and . Since , we can marginalize out as in (Zhou et al., 2012), leading to

Further marginalizing out the gamma distributed from the Poisson likelihood leads to

(24)

Element of can be augmented under its compound Poisson representation as

Thus if (22) is true for layer , then it is also true for layer .

Corollary 2 (Propagate the latent counts upward)

Using Lemma 4.1 of (Zhou et al., 2012) on (23) and Theorem 1 of (Zhou and Carin, 2015) on (24), we can propagate the latent counts of layer upward to layer as

(25)
(26)

We provide a set of graphical representations in Figure 8 to describe the GBN model and illustrate the augment-and-conquer inference scheme. We provide the upward-downward Gibbs sampler in Appendix B.

Figure 8: Graphical representations of the model and data augmentation and marginalization based inference scheme. (a) graphical representation of the GBN hierarchical model. (b) an augmented representation of Poisson factor model of layer , corresponding to (23) with . (c) an alternative representation using the relationships between the Poisson and multinomial distributions, obtained by applying Lemma 4.1 of (Zhou et al., 2012) on (23) for . (d) a negative binomial distribution based representation that marginalizes out the gamma from the Poisson distributions, corresponding to (24) for . (e) an equivalent representation that introduces CRT distributed auxiliary variables, corresponding to (26) with . (f) an equivalent representation using Theorem 1 of (Zhou and Carin, 2015) on (24) and (26) for . (g) An representation obtained by repeating the same augmentation-marginalization steps described in (b)-(f) one layer at a time from layers to . (h) An representation of the top hidden layer.

Note that , and as the number of tables occupied by the customers is in the same order as the logarithm of the customer number in a Chinese restaurant process, is in the same order as . Thus the total count of layer as would often be much smaller than that of layer as (though in general not as small as a count that is in the same order of the logarithm of ), and hence one may use the total count as a simple criterion to decide whether it is necessary to add more layers to the GBN. In addition, if the latent count becomes close or equal to zero, then the posterior mean of could become so small that node of layer can be considered to be disconnected from node of layer .

3.2 Modeling Data Variability With Distributed Representation

In comparison to a single-layer model with , which assumes that the hidden units of layer one are independent in the prior, the multilayer model with captures the correlations between them. Note that for the extreme case that for are all identity matrices, which indicates that there are no correlations between the features of left to be captured, the deep structure could still provide benefits as it helps model latent counts that may be highly overdispersed. For example, let us assume for all , then from (1) and (24) we have

Using the laws of total expectation and total variance, we have

Further applying the same laws, we have

Thus the variance-to-mean ratio (VMR) of the count given can be expressed as

(27)

In comparison to PFA with given , with a VMR of , the GBN with hidden layers, which mixes the shape of with a chain of gamma random variables, increases by a factor of

which is equal to

if we further assume for all . Therefore, by increasing the depth of the network to distribute the variability into more layers, the multilayer structure could increase its capacity to model data variability.

3.3 Learning The Network Structure With Layer-Wise Training

As jointly training all layers together is often difficult, existing deep networks are typically trained using a greedy layer-wise unsupervised training algorithm, such as the one proposed in (Hinton et al., 2006) to train the deep belief networks. The effectiveness of this training strategy is further analyzed in (Bengio et al., 2007). By contrast, the augmentable GBN has a simple Gibbs sampler to jointly train all its hidden layers, as described in Appendix B, and hence does not necessarily require greedy layer-wise training, but the same as these commonly used deep learning algorithms, it still needs to specify the number of layers and the width of each layer.

In this paper, we adopt the idea of layer-wise training for the GBN, not because of the lack of an effective joint-training algorithm that trains all layers together in each iteration, but for the purpose of learning the width of each hidden layer in a greedy layer-wise manner, given a fixed budget on the width of the first layer. The basic idea is to first train a GBN with a single hidden layer, , , for which we know how to use the gamma-negative binomial process (Zhou and Carin, 2015; Zhou et al., 2015b) to infer the posterior distribution of the number of active factors; we fix the width of the first layer with the number of active factors inferred at iteration , prune all inactive factors of the first layer, and continue Gibbs sampling for another iterations. Now we describe the proposed recursive procedure to build a GBN with layers. With a GBN of hidden layers that has already been inferred, for which the hidden units of the top layer are distributed as , where , we add another layer by letting , where and is redefined as . The key idea is with latent counts upward propagated from the bottom data layer, one may marginalize out , leading to , and hence can again rely on the shrinkage mechanism of a truncated gamma-negative binomial process to prune inactive factors (connection weight vectors, columns of ) of layer , making , the inferred layer width for the newly added layer, smaller than if is set to be sufficiently large. The newly added layer and all the layers below would be jointly trained, but with the structure below the newly added layer kept unchanged. Note that when , the GBN infers the number of active factors if is set large enough, otherwise, it still assigns the factors with different weights , but may not be able to prune any of them. The details of the proposed layer-wise training strategies are summarized in Algorithm 1 for multivariate count data, and in Algorithm 2 for multivariate binary and nonnegative real data.

4 Experimental Results

In this section, we present experimental results for count, binary, and nonnegative real data.

4.1 Deep Topic Modeling

We first analyze multivariate count data with the Poisson gamma belief network (PGBN). We apply the PGBNs for topic modeling of text corpora, each document of which is represented as a term-frequency count vector. Note that the PGBN with a single hidden layer is identical to the (truncated) gamma-negative binomial process PFA of Zhou and Carin (2015), which is a nonparametric Bayesian algorithm that performs similarly to the hierarchical Dirichlet process latent Dirichlet allocation of Teh et al. (2006) for text analysis, and is considered as a strong baseline. Thus we will focus on making comparison to the PGBN with a single layer, with its layer width set to be large to approximate the performance of the gamma-negative binomial process PFA. We evaluate the PGBNs’ performance by examining both how well they unsupervisedly extract low-dimensional features for document classification, and how well they predict heldout word tokens. Matlab code will be available in http://mingyuanzhou.github.io/.

We use Algorithm 1 to learn, in a layer-wise manner, from the training data the connection weight matrices and the top-layer hidden units’ gamma shape parameters : to add layer to a previously trained network with layers, we use iterations to jointly train and together with , prune the inactive factors of layer , and continue the joint training with another iterations. We set the hyper-parameters as and . Given the trained network, we apply the upward-downward Gibbs sampler to collect 500 MCMC samples after 500 burnins to estimate the posterior mean of the feature usage proportion vector