Deep Poisson gamma dynamical systems

10/26/2018 ∙ by Dandan Guo, et al. ∙ The University of Texas at Austin Xidian University NetEase, Inc 0

We develop deep Poisson-gamma dynamical systems (DPGDS) to model sequentially observed multivariate count data, improving previously proposed models by not only mining deep hierarchical latent structure from the data, but also capturing both first-order and long-range temporal dependencies. Using sophisticated but simple-to-implement data augmentation techniques, we derived closed-form Gibbs sampling update equations by first backward and upward propagating auxiliary latent counts, and then forward and downward sampling latent variables. Moreover, we develop stochastic gradient MCMC inference that is scalable to very long multivariate count time series. Experiments on both synthetic and a variety of real-world data demonstrate that the proposed model not only has excellent predictive performance, but also provides highly interpretable multilayer latent structure to represent hierarchical and temporal information propagation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The need to model time-varying count vectors

appears in a wide variety of settings, such as text analysis, international relation study, social interaction understanding, and natural language processing

Sutskever2007Learning ; wang2008continuous ; hermans2013training ; gan2015deep ; GP-DPFA2015 ; charlin2015dynamic ; Schein2016Poisson ; Gong2017Deep ; Rabiee2017Recurrent

. To model these count data, it is important to not only consider the sparsity of high-dimensional data and robustness to over-dispersed temporal patterns, but also capture complex dependencies both within and across time steps. In order to move beyond linear dynamical systems (LDS)

ghahramani1999learning and its nonlinear generalization wang2006gaussian that often make the Gaussian assumption kalman1963mathematical , the gamma process dynamic Poisson factor analysis (GP-DPFA) GP-DPFA2015 factorizes the observed time-varying count vectors under the Poisson likelihood as

, and transmit temporal information smoothly by evolving the factor scores with a gamma Markov chain as

, which has highly desired strong non-linearity. To further capture cross-factor temporal dependence, a transition matrix is further used in Poisson–gamma dynamical system (PGDS) Schein2016Poisson as . However, these shallow models may still have shortcomings in capturing long-range temporal dependencies Gong2017Deep . For example, if given , then no longer depends on for all .

Deep probabilistic models are widely used to capture the relationships between latent variables across multiple stochastic layers Gong2017Deep ; gan2015deep ; neal1992connectionist ; ranganath2014deep ; zhou2015the ; henao2015deep . For example, deep dynamic Poisson factor analysis (DDPFA) Gong2017Deep

utilizes recurrent neural networks (RNN)

hermans2013training to capture long-range temporal dependencies of the factor scores. The latent variables and RNN parameters, however, are separately inferred. Deep temporal sigmoid belief network (DTSBN) gan2015deep is a deep dynamic generative model defined as a sequential stack of sigmoid belief networks (SBNs), whose hidden units are typically restricted to be binary. Although a deep structure is designed to describe complex long-range temporal dependencies, how the layers in DTSBN are related to each other lacks an intuitive interpretation, which is of paramount interest for a multilayer probabilistic model zhou2015the .

In this paper, we present deep Poisson gamma dynamical systems (DPGDS), a deep probabilistic dynamical model that takes the advantage of the hierarchical structure to efficiently incorporate both between-layer and temporal dependencies, while providing rich interpretation. Moving beyond DTSBN using binary hidden units, we build a deep dynamic directed network with gamma distributed nonnegative real hidden units, inferring a multilayer contextual representation of multivariate time-varying count vectors. Consequently, DPGDS can handle highly overdispersed counts, capturing the correlations between the visible/hidden features across layers and over times using the gamma belief network

zhou2015the . Combing the deep and temporal structures shown in Fig. 1, DPGDS breaks the assumption that given , no longer depends on for , suggesting that it may better capture long-range temporal dependencies. As a result, the model can allow more specific information, which are also more likely to exhibit fast temporal changing, to transmit through lower layers, while allowing more general information, which are also more likely to slowly evolve over time, to transmit through higher layers. For example, as shown in Fig. 1 that is learned from GDELT2003 with DPGDS, when analyzing these international events, the factors at lower layers are more specific to discover the relationships between the different countries, whereas those at higher layers are more general to reflect the conflicts between the different areas consisting of several related countries, or the ones occurring simultaneously, and the latent representation at a lower layer varies more intensely than that at a higher layer.

Distinct from DDPFA Gong2017Deep that adopts a two-stage inference, the latent variables of DPGDS can be jointly trained with both a Backward-Upward–Forward-Downward (BUFD) Gibbs sampler and a sophisticated stochastic gradient MCMC (SGMCMC) algorithm that is scalable to very long multivariate time series ma2015a ; welling2011bayesian ; patterson2013stochastic ; ding2014bayesian ; Li2016Preconditioned . Furthermore, the factors learned at each layer can refine the understanding and analysis of sequentially observed multivariate count data, which, to the best of our knowledge, may be very challenging for existing methods. Finally, based on a diverse range of real-world data sets, we show that DPGDS exhibits excellent predictive performance, inferring interpretable latent structure with well captured long-range temporal dependencies.

2 Deep Poisson gamma dynamic systems

Shown in Fig. 1 is the graphical representation of a three-hidden-layer DPGDS. Let us denote

as a gamma random variable with mean

and variance

. Given a set of -dimensional sequentially observed multivariate count vectors , represented as a matrix , the generative process of a -hidden-layer DPGDS, from top to bottom, is expressed as

(1)

where is the factor loading matrix at layer , the hidden units of layer  at time , and a transition matrix of layer  that captures cross-factor temporal dependencies. We denote as a scaling factor, reflecting the scale of the counts at time ; one may also set for . We denote

as a scaling hyperparameter that controls the temporal variation of the hidden units. The multilayer time-varying hidden units

are well suited for downstream analysis, as will be shown below.

Figure 1: Graphical model and illustration for a three-hidden-layer deep Poisson Gamma Dynamical System (DPGDS). (a) The generative model; (b) Visualization of data and latent factors learned from GDELT2003, with the black, red, blue and green lines denoting the observed data, temporal trajectories of example latent factors at layer 1, 2, 3, respectively.

DPGDS factorizes the count observation into the product of , , and under the Poisson likelihood. It further factorizes the shape parameters of the gamma distributed of layer  at time into the sum of , capturing the dependence between different layers, and , capturing the temporal dependence at the same layer. At the top layer, is only dependent on , and at , for and . To complete the hierarchical model, we introduce factor weights in layer  to model the strength of each factor, and for , we let

(2)

Note that is the column of and

can be interpreted as the probability of transiting from topic

of the previous time to topic of the current time at layer .

Finally, we place Dirichlet priors on the factor loadings and draw other parameters from a noninformative gamma prior: , and . Note that imposing Dirichlet distributions on the columns of and not only makes the latent representation more identifiable and interpretable, but also facilitates inference, as will be shown in the next section. Clearly when , DPGDS reduces to PGDS Schein2016Poisson . In real-world applications, a binary observation can be linked to a latent count using the Bernoulli-Poisson link as Zhou2015Infinite . Nonnegative-real-valued matrix can also be linked to a latent count matrix via a Poisson randomized gamma distribution as JMLR:v17:15-633 .

Hierarchical structure: To interpret the hierarchical structure of (2), we notice that if the temporal structure is ignored. Thus it is straightforward to interpret by projecting them to the bottom data layer as , which are often quite specific at the bottom layer and become increasingly more general when moving upwards, as will be shown below in Fig. 6.

Long-range temporal dependencies: Using the law of total expectations on (2), for a three-hidden-layer DPGDS shown in Fig. 1, we have

(3)

which suggests that play the role of transiting the latent representation across time and, different from most existing dynamic models, DPGDS can capture and transmit long-range temporal information (often general and change slowly over time) through its higher hidden layers.

3 Scalable MCMC inference

In this paper, in each iteration, across layers and times, we first exploit a variety of data augmentation techniques for count data to “backward” and “upward” propagate auxiliary latent counts, with which we then “downward” and “forward” sample latent variables, leading to a Backward-Upward–Forward-Downward Gibbs sampling (BUFD) Gibbs sampling algorithm.

3.1 Backward and upward propagation of latent counts

Figure 2: Graphical representation of the model and data augmentation and marginalization based inference scheme. (a) An alternative representation of layer

using the relationships between the Poisson and multinomial distributions; (b) A negative binomial distribution based representation that marginalizes out the gamma from the Poisson distributions, corresponding to (

4) for ; (c) An equivalent representation that introduces CRT distributed auxiliary variables, corresponding to (5); (d) An equivalent representation using P3, corresponding to (6); (e) An equivalent representation obtained by using P1, corresponding to (7); (f) A representation obtained by repeating the same augmentation-marginalization steps described in (a).

Different from PGDS that has only backward propagation for latent counts, DPGDS have both backward and upward ones due to its deep hierarchical structure. To derive closed-form Gibbs sampling update equations, we exploit three useful properties for count data, denoted as P1, P2, and P3zhou2015negative ; Schein2016Poisson , respectively, as presented in the Appendix. Let us denote as the negative binomial distribution with probability mass function , where . First, we can augment each count in (2) into the summation of latent counts that are smaller or equal as , with  . Since by construction, we also have , as shown in Fig. 2. We start with at the last time point , as none of the other time-step factors depend on it in their priors. Via P2, as shown in Fig. 2, we can marginalize out to obtain

(4)

where and .

In order to marginalize out , as shown in Fig. 2, we introduce an auxiliary variable following the Chines restaurant table (CRT) distribution zhou2015negative as

(5)

As shown in Fig. 2

, we re-express the joint distribution over

and according to P3 as

(6)

where the sum-logarithmic (SumLog) distribution is defined as in Zhou and Carin zhou2015negative . Via P1, as in Fig. 2, the Poisson random variable in (6) can be augmented as , where

(7)

It is obvious that due to the deep dynamic structure, the count at layer two is divided into two parts: one from time at layer one, while the other from time at layer two. Furthermore, is the scaling factor at layer two, which is propagated from the one at layer one . Repeating the process all the way back to , and from up to , we are able to marginalize out all gamma latent variables and provide closed-form conditional posteriors for all of them.

3.2 Backward-upward–forward-downward Gibbs sampling

Sampling auxiliary counts: This step is about the “backward” and “upward” pass. Let us denote ,  , and . Working backward for and upward for , we draw

(8)
(9)

Note that via the deep structure, the latent counts will be influenced by the effects from both of time at layer  and time at layer . With and , we can sample the latent counts at layer  and by

(10)

and then draw

(11)

Sampling hidden units and calculating : Given the augmented latent count variables, working forward for and downward for , we can sample

(12)

where and . Note if for , then we may let , where the function is the lower real part of the Lambert W function corless1996on ; Schein2016Poisson . From (3.2), we can find that the conditional posterior of is parameterized by not only both and , which represent the information from layer (downward) and time (forward), respectively, but also both and , which record the message from layer (upward) in (8) and time (backward) in (11), respectively. We describe the BUFD Gibbs sampling algorithm for DPGDS in Algorithm 1 and provide more details in the Appendix.

3.3 Stochastic gradient MCMC inference

Although the proposed BUFD Gibbs sampling algorithm for DPGDS has closed-form update equations, it requires processing all time-varying vectors at each iteration and hence has limited scalability Zhang2018WHAI . To allow for scalable inference, we apply the topic-layer-adaptive stochastic gradient Riemannian (TLASGR) MCMC algorithm described in Cong et al. cong2017deep and Zhang et al. Zhang2018WHAI , which can be used to sample simplex-constrained global parameters cong2017fast in a mini-batch based manner. It improves its sampling efficiency via the use of the Fisher information matrix (FIM) girolami2011riemann , with adaptive step-sizes for the latent factors and transition matrices of different layers. More specifically, for , column of the transition matrix of layer , its sampling can be efficiently realized as

(13)

where

is calculated using the estimated FIM, both

and come from the augmented latent counts , denotes a simplex constraint, and denotes the prior of . The update of is the same with Cong et al. cong2017deep , and all the other global parameters are sampled using SGNHT ding2014bayesian . We provide the details of the SGMCMC for DPGDS in Algorithm 2 in the Appendix.

4 Experiments

Figure 3: Results on the bouncing ball data set. (a) Shown in the first to third columns are the top fifteen latent factors learned by a three-hidden-layer DPGDS at layers 1, 2, and 3, respectively; (b) The average prediction errors as a function of the sequence length for various algorithms.

In this section, we present experimental results on a synthetic dataset and five real-world datasets. For a fair comparison, we consider PGDS Schein2016Poisson , GP-DPFA GP-DPFA2015 , DTSBN gan2015deep , and GPDM wang2006gaussian that can be considered as a dynamic generalization of the Gaussian process latent variable model of Lawrence lawrence2005probabilistic , using the code provided by the authors. Note that as shown Schein et al. Schein2016Poisson and Gan et al. gan2015deep , PGDS and DTSBN are state-of-the-art count time series modeling algorithms that outperform a wide variety of previously proposed ones, such as LDS kalman1963mathematical and DRFM han2014dynamic . The hyperparameter settings of PGDS, GP-DPFA, GPDM, TSBN, and DTSBN are the same as their original settings Schein2016Poisson ; GP-DPFA2015 ; wang2006gaussian ; gan2015deep . For DPGDS, we set and . We use for both DPGDS and DTSBN and for PGDS, GP-DPFA, GPDM, and TSBN. For PGDS, GP-DPFA, GPDM, and DPGDS, we run 2000 Gibbs sampling as burn-in and collect 3000 samples for evaluation. We also use SGMCMC to infer DPGDS, with 5000 collection samples after 5000 burn-in steps, and use 10000 SGMCMC iterations for both TSBN and DTSBN to evaluate their performance.

4.1 Synthetic dataset

Following the literature Sutskever2007Learning ; gan2015deep , we consider sequences of different lengths, including and , and generate 50 synthetic bouncing ball videos for training, and 30 ones for testing. Each video frame is a binary-valued image with size 30 30, describing the location of three balls within the image. Both TSBN and DTSBN model it with the Bernoulli likelihood, while both PGDS and DPGDS use the Bernoulli-Poisson link Zhou2015Infinite .

As shown in Fig. 3, the average prediction errors of all algorithms decrease as the training sequence length increases. A higher-order TSBN, TSBN-4, performs much better than the first-order TSBN does, suggesting that using high-order messages can help TSBN better pass useful information. As discussed above, since a deep structure provides a natural way to propagate high-order information for prediction, it is not surprising to find that both DTSBN and DPGDS, which are both multi-layer models, have exhibited superior performance. Moreover, it is clear that the proposed DPGDS consistently outperforms DTSBN under all settings.

Another advantage of DPGDS is that its inferred deep latent structure often has meaningful interpretation. As shown in Fig. 3, for the bouncing ball data, the inferred factors at layer one represent points or pixels, those at layer two cover larger spatial contiguous regions, some of which exhibit the shape of a single bouncing ball, and those at layer three are able to capture multiple bouncing balls. In addition, we show in Appendix B the one-step prediction frames of different models.

4.2 Real-world datasets

Model Top- GDELT () ICEWS () SOTU () DBLP () NIPS ()
GPDPFA MP 0.611 0.001 0.607 0.002 0.379 0.002 0.435 0.009 0.843 0.005
MR 0.145 0.002 0.235 0.005 0.369 0.002 0.254 0.005 0.050 0.001
PP 0.447 0.014 0.465 0.008 0.617 0.013 0.581 0.011 0.807 0.006
PGDS MP 0.679 0.001 0.658 0.001 0.375 0.002 0.419 0.004 0.864 0.004
MR 0.150 0.001 0.245 0.005 0.373 0.002 0.252 0.004 0.050 0.001
PP 0.420 0.017 0.455 0.008 0.612 0.018 0.566 0.008 0.802 0.020
GPDM MP 0.5200.001 0.530 0.002 0.274 0.001 0.388 0.004 0.355 0.008
MR 0.141 0.001 0.234 0.001 0.261 0.002 0.146 0.005 0.050 0.001
PP 0.3620.021 0.1850.017 0.587 0.016 0.509 0.008 0.384 0.028
TSBN MP 0.594 0.007 0.471 0.001 0.360 0.001 0.403 0.012 0.788 0.005
MR 0.124 0.001 0.158 0.001 0.275 0.001 0.194 0.001 0.050 0.001
PP 0.418 0.019 0.445 0.031 0.611 0.001 0.527 0.003 0.692 0.017
DTSBN-2 MP 0.439 0.001 0.475 0.002 0.370 0.004 0.407 0.003 0.756 0.001
MR 0.134 0.001 0.208 0.001 0.361 0.001 0.248 0.007 0.050 0.001
PP 0.391 0.001 0.446 0.001 0.587 0.027 0.522 0.005 0.737 0.004
DTSBN-3 MP 0.411 0.001 0.431 0.001 0.450 0.008 0.390 0.002 0.774 0.002
MR 0.141 0.001 0.189 0.001 0.274 0.001 0.252 0.004 0.050 0.001
PP 0.367 0.011 0.451 0.026 0.548 0.013 0.510 0.006 0.715 0.009
DPGDS-2 MP 0.688 0.002 0.659 0.001 0.379 0.002 0.430 0.009 0.867 0.008
MR 0.149 0.001 0.242 0.007 0.373 0.001 0.254 0.005 0.050 0.001
PP 0.443 0.025 0.473 0.012 0.622 0.014 0.582 0.007 0.814 0.035
DPGDS-3 MP 0.689 0.002 0.660 0.001 0.380 0.001 0.431 0.012 0.887 0.002
MR 0.150 0.001 0.244 0.003 0.374 0.002 0.255 0.004 0.050 0.001
PP 0.456 0.015 0.478 0.024 0.628 0.021 0.600 0.001 0.839 0.007
Table 1: Top- results on real-world text data

Besides the binary-valued synthetic bouncing ball dataset, we quantitatively and qualitatively evaluate all algorithms on the following real-world datasets used in Schein et al. Schein2016Poisson . The State-of-the-Union (SOTU) dataset consists of the text of the annual SOTU speech transcripts from 1790 to 2014. The Global Database of Events, Language, and Tone (GDELT) and Integrated Crisis Early Warning System (ICEWS) are both datasets for international relations extracted from news corpora. Note that ICEWS consists of undirected pairs, while GDELT consists of directed pairs of countries. The NIPS corpus contains the text of every NIPS conference paper from 1987 to 2003. The DBLP corpus is a database of computer science research papers. Each of these datasets is summarized as a count matrix, as shown in Tab. 1. Unless specified otherwise, we choose the top 1000 most frequently used terms to form the vocabulary, which means we set for all real-data experiments.

4.2.1 Quantitative comparison

For a fair and comprehensive comparison, we calculate the precision and recall at top-

gopalan2014bayesian ; gan2015deep ; han2014dynamic ; GP-DPFA2015 , which are calculated by the fraction of the top- words that match the true ranking of the words and appear in the top- ranking, respectively, with . We also use the Mean Precision (MP) and Mean Recall (MR) over all the years appearing in the training set to evaluate different models. As another criterion, the Predictive Precision (PP) shows the predictive precision for the final year, for which all the observations are held out. Similar as previous methods gan2015deep ; GP-DPFA2015

, for each corpus, the entire data of the last year is held out, and for the documents in the previous years we randomly partition the words of each document into 80% / 20% in each trial, and we conduct five random trials to report the sample mean and standard deviation. Note that to apply GPDM, we have used Anscombe transform

Anscombe to preprocess the count data to mitigate the mismatch between the data and model assumption. The results on all five datasets are summarized in Tab. 1, which clearly show that the proposed DPGDS has achieved the best performance on most of the evaluation criteria, and again a deep model often improves its performance by increasing its number of layers. To add more empirical study

Figure 4: MP as a function of time for GDELT.

on scalability, we have also tested the efficiency of our model on a GDELT data (from 2001 to 2005, temporal granularity of 24 hrs, with a total of 1825 time points), which is not too large so that we can still run DPGDS-Gibbs and GPDM. As shown in Fig. 4, we present how various algorithms progress over time, evaluated with MP. It takes about 1000s for DTSBN and DPGDS-SGMCMC to converge, 3.5 hrs for DPGDS-Gibbs, 5 hrs for GPDM. Clearly, our DPGDS-SGMCMC is scalable and clearly outperforms both DTSBN and GPDM. We also present in Appendix C the results of DPGDS-SGMCMC on a very long time series, on which it becomes too expensive to run a batch learning algorithm.

4.2.2 Exploratory data analysis

Compared to previously proposed dynamic systems, the proposed DPGDS, whose inferred latent structure is simple to visualize, provides much richer interpretation. More specifically, we may not only exhibit the content of each factor (topic), but also explore both the hierarchical relationships between them at different layers, and the temporal relationships between them at the same layer. Based on the results inferred on ICEWS 2001-2003 via a three hidden layer DPGDS, with the size of 200-100-50, we show in Fig. 6 how some example topics are hierarchically and temporally related to each other, and how their corresponding latent representations evolve over time.

In Fig. 6, we select two large-weighted topics at the top hidden layer and move down the network to include any lower-layer topics that are connected to them with sufficiently large weights. For each topic, we list all its terms whose values are larger than 1% of the largest element of the topic. It is interesting to note that topic 2 at layer three is connected to three topics at layer two, which are characterized mainly by the interactions of Israel (ISR)-Palestinian Territory (PSE), Iraq (IRQ)-USA-Iran (IRN), and North Korea (PRK)-South Korea (KOR)-USA-China (CHN)-Japan (JPN), respectively. The activation strength of one of these three interactions, known to be dominant in general during 2001-2003, can be contributed not only by a large activation of topic 2 at layer three, but also by a large activation of some other topic of the same layer (layer two) at the previous time. For example, topic 41 of layer two on “ISR-PSE, IND-PAK, RUS-UKR, GEO-RUS, AFG-PAK, SYR-USA, MNE-SRB” could be associated with the activation of topic 46 of layer two on “IND-PAK, RUS-TUR, ISR-PSE, BLR-RUS” at the previous time; and topic 99 of layer two on “PRK-KOR, JPN-USA, CHN-USA, CHN-KOR, CHN-JPN, USA-RUS” could be associated with the activation of topic 63 of layer two on “IRN-USA, CHN-USA, AUS-CHN, CHN-KOR” at the previous time.

Another instructive observation is that topic 140 of layer one on “IRQ-USA, IRQ-GBR, IRN-IRQ, IRQ-KWT, AUS-IRQ” is related not only in hierarchy to topic 34 of the higher layer on “IRQ-USA, IRQ-GBR, GBR-USA, IRQ-KWT, IRN-IRQ, SYR-USA,” but also in time to topic 166 of the same layer on “ESP-USA, ESP-GBR, FRA-GBR, POR-USA,” which are interactions between the member states of the North Atlantic Treaty Organization (NATO). Based on the transitions from topic 13 on “PRK-KOR” to both topic 140 on “IRQ-USA” and 77 on “ISR-PSE,” we can find that the ongoing Iraq war and Israeli–Palestinian relations regain attention after the six-party talks Schein2016Poisson .

To get an insight of the benefits attributed to the deep structure, how the latent representations of several representative topics evolve over days are shown in Fig. 6. It is clear that relative to these temporal factor trajectories at the bottom layer, which are specific for the bilateral interactions between two countries, these from higher layers vary more smoothly, whose corresponding high-layer topics capture the multilateral interactions between multiple closely related countries. Similar phenomena have also been demonstrated in Fig. 1 on GDELT2003. Moreover, we find that a spike of the temporal trajectory of topic 166 (NATO) appears right before a one of topic 140 (Iraq war), matching the above description in Fig. 6. Also, topic 14 of layer three and its descendants, including topic 23 of layer two and topic 48 at layer one are mainly about a breakthrough between RUS and Azerbaijan (AZE), coinciding with Putin’s visit in January 2001. Additional example results for the topics and their hierarchical and temporal relationships, inferred by DPGDS on different datasets, are provided in the Appendix.

Figure 5: Topics and their temporal trajectories inferred by a three-hidden-layer DPGDS from the ICEWS 2001-2003 dataset (best viewed in color). (a) Some example topics that are hierarchically or temporally related; (b) The temporal trajectories of some inferred latent topics.
Figure 6: Learned transition structure on ICEWS 2001-2003 from the same DPGDS depicted in Fig. 6. Shown in (a)-(c) are transition matrices for layers 1, 2 and 3, respectively, with a darker color indicating a larger transition weight (between 0 and 1).
Figure 5: Topics and their temporal trajectories inferred by a three-hidden-layer DPGDS from the ICEWS 2001-2003 dataset (best viewed in color). (a) Some example topics that are hierarchically or temporally related; (b) The temporal trajectories of some inferred latent topics.

In Fig. 6, we also present a subset of the transition matrix in each layer, corresponding to the top ten topics, some of which have been displayed in Fig. 6. The transition matrix captures the cross-topic temporal dependence at layer . From Fig. 6, besides the temporal transitions between the topics at the same layer, we can also see that with the increase of the layer index , the transition matrix more closely approaches a diagonal matrix, meaning that the feature factors become more likely to transit to themselves, which matches the characteristic of DPGDS that the topics in higher layers have the ability to cover longer-range temporal dependencies and contain more general information, as shown in Fig. 6. With both the hierarchical connections between layers and dynamic transitions at the same layer, distinct from the shallow PGDS, DPGDS is equipped with a larger capacity to model diverse temporal patterns with the help of its deep structure.

5 Conclusions

We propose deep Poisson gamma dynamical systems (DPGDS) that take the advantage of a probabilistic deep hierarchical structure to efficiently capture both across-layer and temporal dependencies. The inferred latent structure provides rich interpretation for both hierarchical and temporal information propagation. For Bayesian inference, we develop both a Backward-Upward–Forward-Downward Gibbs sampler and a stochastic gradient MCMC (SGMCMC) that is scalable to long multivariate count/binary time series. Experimental results on a variety of datasets show that DPGDS not only exhibits excellent predictive performance, but also provides highly interpretable latent structure.

Acknowledgements

D. Guo, B. Chen, and H. Zhang acknowledge the support of the Program for Young Thousand Talent by Chinese Central Government, the 111 Project (No. B18039), NSFC (61771361), NSFC for Distinguished Young Scholars (61525105) and the Innovation Fund of Xidian University. M. Zhou acknowledges the support of Award IIS-1812699 from the U.S. National Science Foundation.

References

  • (1)

    I. Sutskever and G. E. Hinton, “Learning multilevel distributed representations for high-dimensional sequences,” in

    AISTATS, 2007.
  • (2) C. Wang, D. Blei, and D. Heckerman, “Continuous time dynamic topic models,” in UAI, 2008, pp. 579–586.
  • (3) M. Hermans and B. Schrauwen, “Training and analysing deep recurrent neural networks,” in NIPS, 2013, pp. 190–198.
  • (4) Z. Gan, C. Li, R. Henao, D. E. Carlson, and L. Carin, “Deep temporal sigmoid belief networks for sequence modeling,” in NIPS, 2015, pp. 2467–2475.
  • (5) A. Acharya, J. Ghosh, and M. Zhou, “Nonparametric Bayesian factor analysis for dynamic count matrices,” in AISTATS, 2015.
  • (6) L. Charlin, R. Ranganath, J. Mcinerney, and D. M. Blei, “Dynamic Poisson factorization,” in ACM, 2015, pp. 155–162.
  • (7) A. Schein, M. Zhou, and H. Wallach, “Poisson–gamma dynamical systems,” in NIPS, 2016.
  • (8) C. Y. Gong and W. Huang, “Deep dynamic Poisson factorization model,” in NIPS, 2017.
  • (9) H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, and H. R. Rabiee, “Recurrent Poisson factorization for temporal recommendation,” in KDD, 2017, pp. 847–855.
  • (10) Z. Ghahramani and S. T. Roweis, “Learning nonlinear dynamical systems using an EM algorithm,” in NIPS, 1999, pp. 431–437.
  • (11) J. M. Wang, A. Hertzmann, and D. M. Blei, “Gaussian process dynamical models,” in NIPS, 2006.
  • (12) R. E. Kalman, “Mathematical description of linear dynamical systems,” Journal of The Society for Industrial and Applied Mathematics, Series A: Control, vol. 1, no. 2, pp. 152–192, 1963.
  • (13) R. M. Neal, “Connectionist learning of belief networks,” Artificial Intelligence, vol. 56, no. 1, pp. 71–113, 1992.
  • (14) R. Ranganath, L. Tang, L. Charlin, and D. M. Blei, “Deep exponential families,” in AISTATS, 2014, pp. 762–771.
  • (15) M. Zhou, Y. Cong, and B. Chen, “The Poisson gamma belief network,” in NIPS, 2015, pp. 3043–3051.
  • (16) R. Henao, Z. Gan, J. T. Lu, and L. Carin, “Deep Poisson factor modeling,” in NIPS, 2015, pp. 2800–2808.
  • (17) Y. A. Ma, T. Chen, and E. B. Fox, “A complete recipe for stochastic gradient MCMC,” in NIPS, 2015, pp. 2917–2925.
  • (18) M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” in ICML, 2011, pp. 681–688.
  • (19) S. Patterson and Y. W. Teh, “Stochastic gradient Riemannian Langevin dynamics on the probability simplex,” in NIPS, 2013, pp. 3102–3110.
  • (20) N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven, “Bayesian sampling using stochastic gradient thermostats,” in NIPS, 2014, pp. 3203–3211.
  • (21) C. Li, C. Chen, D. Carlson, and L. Carin, “Preconditioned stochastic gradient Langevin dynamics for deep neural networks,” in AAAI, 2016, pp. 1788–1794.
  • (22) M. Zhou, “Infinite edge partition models for overlapping community detection and link prediction,” in AISTATS, 2015, pp. 1135–1143.
  • (23) M. Zhou, Y. Cong, and B. Chen, “Augmentable gamma belief networks,”

    Journal of Machine Learning Research

    , vol. 17, no. 163, pp. 1–44, 2016.
  • (24) M. Zhou and L. Carin, “Negative binomial process count and mixture modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 2, pp. 307–320, 2015.
  • (25) R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E. Knuth, “On the LambertW function,” Advances in Computational Mathematics, vol. 5, no. 1, pp. 329–359, 1996.
  • (26)

    H. Zhang, B. Chen, D. Guo, and M. Zhou, “WHAI: Weibull hybrid autoencoding inference for deep topic modeling,” in

    ICLR, 2018.
  • (27) Y. Cong, B. Chen, H. Liu, and M. Zhou, “Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC,” in ICML, 2017.
  • (28)

    Y. Cong, B. Chen, and M. Zhou, “Fast simulation of hyperplane-truncated multivariate normal distributions,”

    Bayesian Anal., vol. 12, no. 4, pp. 1017–1037, 2017.
  • (29) M. A. Girolami and B. Calderhead, “Riemann manifold Langevin and Hamiltonian Monte Carlo methods,” Journal of The Royal Statistical Society Series B-statistical Methodology, vol. 73, no. 2, pp. 123–214, 2011.
  • (30)

    N. D. Lawrence, “Probabilistic non-linear principal component analysis with gaussian process latent variable models,”

    Journal of Machine Learning Research, vol. 6, pp. 1783–1816, 2005.
  • (31) S. Han, L. Du, E. Salazar, and L. Carin, “Dynamic rank factor model for text streams,” in NIPS, 2014, pp. 2663–2671.
  • (32) P. Gopalan, F. J. R. Ruiz, R. Ranganath, and D. M. Blei, “Bayesian nonparametric Poisson factorization for recommendation systems,” in AISTATS, 2014.
  • (33) F. J. Anscombe, “The transformation of Poisson, binomial and negative-binomial data,” Biometrika, vol. 35, no. 3/4, pp. 246–254, 1948.
  • (34) D. B. Dunson and A. H. Herring, “Bayesian latent variable models for mixed discrete outcomes,” Biostatistics, vol. 6, no. 1, pp. 11–25, 2005.
  • (35) M. Zhou, L. Hannah, D. B. Dunson, and L. Carin, “Beta-negative binomial process and Poisson factor analysis,” in AISTATS, 2012, pp. 1462–1471.
  • (36) M. Zhou, “Nonparametric Bayesian negative binomial factor analysis,” Bayesian Anal., vol. 13, no. 4, pp. 1061–1089, 2018.

Appendix A Details of inference via Gibbs sampling for DPGDS

Inference for the DPGDS shown in (2

) is challenging, as neither the conjugate prior nor closed-form maximum likelihood estimate is known for the shape parameter of a gamma distribution. Although seemingly difficult, by generalizing the data augmentation and marginalization techniques, we are able to derive a backward-upward and then forward-downward Gibbs sampling algorithm, making it simple to draw random samples to represent the posteriors of model parameters. We marginalize over

by performing a “ackward” and “upward” filters, starting with . We repeatedly exploit the following three properties:

Property 1 (P1): if , where are independent Poisson-distributed random variables, then and dunson2005bayesian ; zhou2012beta .

Property 2 (P2): , where c is a constant, and then is a negative binomial–distributed random variable. We can equivalently parameterize it as , where is the Bernoulli–Poisson link Zhou2015Infinite and .

Property 3 (P3): if and is a Chinese restaurant table-distributed random variable, then and are equivalently jointly distributed as and zhou2015negative .

a.1 Forward-downward sampling

Sampling transition matrix : The alternative model specification, with marginalized out, assumes that . Therefore, via the Dirichlet-multinomial conjugacy, we have

(14)

Sampling loading factor matrix : Given these latent counts, via the Dirichlet-multinomial conjugacy, we have

(15)

Sampling : Via the gamma-Poisson conjugacy, we have

(16)
(17)

Sampling :

(18)

Sampling and :

(19)

To obtain closed-form conditional posterior for and , we start with

(20)

where and . Following Zhou zhou_bayesian

, we draw a beta-distributed auxiliary variable:

(21)

Consequently, we have

(22)

for