1 Introduction
A central task in text analysis and language modeling is to effectively represent the documents to capture their underlying semantic structures. A basic idea is to represent the words appearing in a document with a sequence of onehot vectors, where the vector dimension is the size of the vocabulary. This preserves all textual information but results in a collection of extremely large and sparse matrices for a text corpus. Given the memory and computation constraints, it is very challenging to directly model this lossless representation. Thus existing methods often resort to simplified lossy representations that either completely ignore word order (Blei et al., 2003), or embed the words into a lower dimensional feature space (Mikolov et al., 2013).
Ignoring word order, each document is simplified as a bagofwords count vector, the th element of which represents how many times the th vocabulary term appears in that document. With a text corpus simplified as a termdocument frequency count matrix, a wide array of latent variable models (LVMs) have been proposed for text analysis (Deerwester et al., 1990; Papadimitriou et al., 2000; Lee & Seung, 2001; Blei et al., 2003; Hinton & Salakhutdinov, 2009; Zhou et al., 2012). Extending “shallow” probabilistic topic models such as latent Dirichlet allocation (LDA) (Blei et al., 2003) and Poisson factor analysis (PFA) (Zhou et al., 2012), steady progress has been made in inferring multistochasticlayer deep latent representations for text analysis (Gan et al., 2015; Zhou et al., 2016; Ranganath et al., 2015; Zhang et al., 2018)
. Despite the progress, completely ignoring word order could still be particularly problematic on some common textanalysis tasks, such as spam detection and sentiment analysis
(Pang et al., 2002; Tang et al., 2014).To preserve word order, a common practice is to first convert each word in the vocabulary from a highdimensional sparse onehot vector into a lowdimensional dense wordembedding vector. The wordembedding vectors can be either trained as part of the learning (Kim, 2014; Kalchbrenner et al., 2014), or pretrained by some other methods on an additional large corpus (Mikolov et al., 2013)
. Sequentially ordered word embedding vectors have been successfully combined with deep neural networks to address various problems in text analysis and language modeling. A typical combination method is to use the wordembedding layer as part of a recurrent neural network (RNN), especially long shortterm memory (LSTM) and its variants
(Hochreiter & Schmidhuber, 1997; Chung et al., 2014), achieving great success in numerous tasks that heavily rely on having highquality sentence representation. Another popular combination method is to apply a convolutional neural network (CNN)
(Lecun et al., 1998) directly to the embedding representation, treating the word embedding layer as an image input; it has been widely used in systems for entity search, sentence modeling, product feature mining, and so on (Xu & Sarikaya, 2013; Weston et al., 2014).In this paper, we first propose convolutional PFA (CPFA) that directly models the documents, each of which is represented without information loss as a sequence of onehot vectors. We then boot its performance by coupling it with the gamma belief network (GBN) of Zhou et al. (2016), a multistochastichidden layer deep generative model, via a novel probabilistic documentlevel pooling layer. We refer to the CPFA and GBN coupled model as convolutional Poisson GBN (CPGBN). To the best of our knowledge, CPGBN is the first unsupervised probabilistic convolutional model that infers multistochasticlayer latent variables for documents represented without information loss. Its hidden layers can be jointly trained with an upwarddownward Gibbs sampler; this makes its inference different from greedy layerwise training (Lee et al., 2009; Chen et al., 2013). In each Gibbs sampling iteration, the main computation is embarrassingly parallel and hence will be accelerated with Graphical Process Units (GPUs). We also develop a Weibull distribution based convolutional variational autoencoder to provide amortized variational inference, which further accelerates both training and testing for large corpora. Exploiting the multilayer structure of CPGBN, we further propose a supervised CPGBN (sCPGBN), which combines the representation power of CPGBN for topic modeling and the discriminative power of deep neural networks (NNs) under a principled probabilistic framework. We show that the proposed models achieve stateofart results in a variety of textanalysis tasks.
2 Convolutional Models for Text Analysis
Below we introduce CPFA and then develop a probabilistic documentlevel pooling method to couple CPFA with GBN, which further serves as the decoder of a Weibull distribution based convolutional variational autoencoder (VAE).
2.1 Convolutional Poisson Factor Analysis
Denote as the vocabulary and let represent the sequentially ordered words of the th document, which can be represented as a sequence of onehot vectors. For example, with vocabulary “don’t”,“hate”,“I”,“it”,“like”, document (“I”,“like”,“it”) can be represented as , where , , and are onehot column vectors. Let us denote , which is one if and only if word of document matches term of the vocabulary.
To exploit a rich set of tools developed for count data analysis (Zhou et al., 2012, 2016), we first link these sequential binary vectors to sequential count vectors via the BernoulliPoisson link (Zhou, 2015). More specifically, we link each to a latent count as , where , and factorize the matrix under the Poisson likelihood. Distinct from vanilla PFA (Zhou et al., 2012) where the columns of the matrix are treated as conditionally independent, here we introduce convolution into the hierarchical model to capture the sequential dependence between the columns. We construct the hierarchical model of CPFA as
(1) 
where denotes a convolution operator, , is the th convolutional filter/factor/topic whose filter width is , , and ; the latent count matrix is factorized into the summation of equalsized latent count matrices, the Poisson rates of the th of which are obtained by convolving
with its corresponding gamma distributed feature representation
, where . To complete the hierarchical model, we let and . Note as in Zhou et al. (2016), we may consider as the truncation level of a gamma process, which allows the number of needed factors to be inferred from the data as long as is set sufficiently large.We can interpret
as the probability that the
th term in the vocabulary appears at the th temporal location for the th latent topic, and expect each to extract both global cooccurrence patterns, such as common topics, and local temporal structures, such as common gram phrases, where , from the text corpus. Note the convolution layers of CPFA convert text regions of size (,“am so happy” with ) to feature vectors, directly learning the embedding of text regions without going through a separate learning for word embedding. Thus CPFA provides a potential solution for distinguishing polysemous words according to their neighboring words. The length of the representation weight vector in our model is , which varies with the document length . This differs CPFA from traditional convolutional models with a fixed feature map size (Zhang et al., 2017; Miao et al., 2018; Min et al., 2019), which requires either heuristic cropping or zeropadding.
2.2 Convolutional Poisson Gamma Belief Network
There has been significant recent interest in inferring multistochasticlayer deep latent representations for text analysis in an unsupervised manner (Gan et al., 2015; Zhou et al., 2016; Ranganath et al., 2015; Wang et al., 2018; Zhang et al., 2018), where word order is ignored. The key intuition behind these models, such as GBN (Zhou et al., 2016), is that words frequently cooccurred in the same document can form specific wordlevel topics in shallow layers; as the depth of the network increases, frequently cooccurred topics can form more general ones. Here, we propose a model to preserve word order, without losing the nice hierarchical topical interpretation provided by a deep topic model. The intuition is that by preserving word order, words can first form short phrases; frequently cooccurred short phrases can then be combined to form specific phraselevel topics; and these specific phraselevel topics can form increasingly more general phraselevel topics when moving towards deeper layers.
As in Fig. 1, we couple CPFA in (1) with GBN to construct CPGBN, whose generative model with hidden layers, from top to bottom, is expressed as
(2) 
where is the th row of and superscripts indicate layers. Note CPGBN first factorizes the latent count matrix under the Poisson likelihood into the summation of convolutions, the th of which is between and weight vector . Using the relationship between the gamma and Dirichlet distributions (, Lemma IV.3 of Zhou & Carin (2012)), in (2) can be equivalently generated as
(3) 
which could be seen as a specific probabilistic documentlevel pooling algorithm on the gamma shape parameter. For , the shape parameters of the gamma distributed hidden units are factorized into the product of the connection weight matrix and hidden units of layer ; the top layer’s hidden units share the same as their gamma shape parameters; and are gamma scale parameters. For scale identifiability and ease of inference, the columns of and are restricted to have unit norm. To complete the hierarchical model, we let , , , and .
Examining (3) shows CPGBN provides a probabilistic documentlevel pooling layer, which summarizes the content coefficients across all word positions into ; the hierarchical structure after can be flexibly modified according to the deep models (not restricted to GBN) to be combined with. The proposed pooling layer can be trained jointly with all the other layers, making it distinct from a usual one that often cuts off the message passing from deeper layers (Lee et al., 2009; Chen et al., 2013). We note using pooling on the first hidden layer is related to shallow text CNNs that use documentlevel pooling directly after a single convolutional layer (Kim, 2014; Johnson & Zhang, 2015a), which often contributes to improved efficiency (Boureau et al., 2010; Wang et al., 2010).
2.3 Convolutional Inference Network for CPGBN
To make our model both scalable to big corpora in training and fast in outofsample prediction, below we introduce a convolutional inference network, which will be used in hybrid MCMC/variational inference described in Section 3.2
. Note the usual strategy of autoencoding variational inference is to construct an inference network to map the observations directly to their latent representations, and optimize the encoder and decoder by minimizing the negative evidence lower bound (ELBO) as
, where(4) 
following Zhang et al. (2018), we use the Weibull distribution to approximate the gamma distributed conditional posterior of , as it is reparameterizable, resembles the gamma distribution, and the Kullback–Leibler (KL) divergence from the gamma to Weibull distributions is analytic; as in Fig. 1, we construct the autoencoding variational distribution as , where
(5) 
The parameters of are deterministically transformed from the observation using CNNs specified as
where , , , , and is obtained with zeropadding; the parameters and are transformed from specified as
where , , and for .
Further we develop sCPGBN, a supervised generalization of CPGBN, for text categorization tasks: by adding a softmax classifier on the concatenation of
, the loss function of the entire framework is modified as
where denotes the crossentropy loss and is used to balance generation and discrimination (Higgins et al., 2017).
3 Inference
Below we describe the key inference equations for CPFA shown in (1), a single hiddenlayer version of CPGBN shown in (2), and provide more details in the Appendix. How the inference of CPFA, including Gibbs sampling and hybrid MCMC/autoencoding variational inference, is generalized to that of CPGBN is similar to how the inference of PFA is generalized to that of PGBN, as described in detail in Zhou et al. (2016) and Zhang et al. (2018) and omitted here for brevity.
3.1 Gibbs Sampling
Directly dealing with the whole matrix by expanding the convolution operation with Toeplitz conversion (Bojanczyk et al., 1995) provides a straightforward solution for the inference of convolutional models, which transforms each observation matrix into a vector, on which the inference methods for sparse factor analysis (Carvalho et al., 2008; James et al., 2010) could then be applied. However, considering the sparsity of the document matrix consisting of onehot vectors, directly processing these matrices without considering sparsity will bring unnecessary burden in computation and storage. Instead, we apply data augmentation under the Poisson likelihood (Zhou et al., 2012, 2016) to upward propagate latent count matrices as
where . Note we only need to focus on nonzero elements of . We rewrite the likelihood function by expanding the convolution operation along the dimension of as
where if . Thus each nonzero element could be augmented as
(6) 
where and . We can now decouple in (1) by marginalizing out , leading to
where the symbol “” denotes summing over the corresponding index and hence . Using the gammaPoisson conjugacy, we have
Similarly, we can expand the convolution along the other direction as , where if , and obtain , where and . Further applying the relationship between the Poisson and multinomial distributions, we have
With the Dirichletmultinomial conjugacy, we have
Exploiting the properties of the Poisson and multinomial distributions helps CPFA fully take advantages of the sparsity of the onehot vectors, making its complexity comparable to a regular bagofwords topic model that uses Gibbs sampling for inference. Note as the multinomial related samplings inside each iteration are embarrassingly parallel, they are accelerated with GPUs in our experiments.
3.2 Hybrid MCMC/Variational Inference
While having closedform update equations, the Gibbs sampler requires processing all documents in each iteration and hence has limited scalability. Fortunately, there have been several related research on scalable inference for discrete LVMs (Ma et al., 2015; Patterson & Teh, 2013). Specifically, TLASGRMCMC of Cong et al., 2017, which uses an elegant simplex constraint and increases the sampling efficiency via the use of the Fisher information matrix (FIM), with adaptive stepsizes for the topics of different layers, can be naturally extended to our model. The efficient TLASGRMCMC update of in CPFA can be described as
(7) 
where denotes the number of minibatches processed so far; the symbol in the subscript denotes summing over the data in a minibatch; and the definitions of , , , and are analogous to these in Cong et al. (2017) and omitted here for brevity.
Similar to Zhang et al. (2018), combining TLASGRMCMC and the convolutional inference network described in Section 2.3, we can construct a hybrid stochasticgradientMCMC/autoencoding variational inference for CPFA. More specifically, in minibatch based each iteration, we draw a random sample of the CPFA global parameters via TLASGRMCMC; given the sampled global parameters, we optimize the parameters of the convolutional inference network, denoted as , using the ELBO in (4), which for CPFA is simplified as
(8) 
We describe the proposed hybrid stochasticgradientMCMC/autoencoding variational inference algorithm in Algorithm 1
, which is implemented in TensorFlow (
Abadi et al., 2016), combined with pyCUDA (Klockner et al., 2012) for more efficient computation.4 Related Work
With the bagofwords representation that ignores the word order information, a diverse set of deep topic models have been proposed to infer a multilayer data representation in an unsupervised manner. A main mechanism of them is to connect adjacent layers by specific factorization, which usually boosts the performance (Gan et al., 2015; Zhou et al., 2016; Zhang et al., 2018). However, limited by the bagofwords representation, they usually perform poorly on sentiment analysis tasks, which heavily rely on the word order information (Xu & Sarikaya, 2013; Weston et al., 2014). In this paper, the proposed CPGBN could be seen as a novel convolutional extension, which not only clearly remedies the loss of word order, but also inherits various virtues of deep topic models.
Benefiting from the advance of wordembedding methods, CNNbased architectures have been leveraged as encoders for various natural language processing tasks (
Kim, 2014; Kalchbrenner et al., 2014). They in general directly apply to the word embedding layer a single convolution layer, which, given a convolution filter window of size , essentially acts as a detector of typicalgrams. More complex deep neural networks taking CNNs as the their encoder and RNNs as decoder have also been studied for text generation (
Zhang et al., 2016; Semeniuta et al., 2017). However, for unsupervised sentence modeling, language decoders other than RNNs are less well studied; it was not until recently that Zhang et al. (2017)have proposed a simple yet powerful, purely convolutional framework for unsupervisedly learning sentence representations, which is the first to force the encoded latent representation to capture the information from the entire sentence via a multilayer CNN specification. But there still exists a limitation in requiring an additional large corpus for training word embeddings, and it is also difficult to visualize and explain the semantic meanings learned by blackbox deep networks.
For text categorization, the bigrams (or a combination of bigrams and unigrams) are confirmed to provide more discriminative power than unigrams (Tan et al., 2002; Glorot et al., 2011). Motivated by this observation, Johnson & Zhang (2015a)
tackle document categorization tasks by directly applying shallow CNNs, with filter width three, on onehot encoding document matrices, outperforming both traditional
grams and wordembedding based methods without the aid of additional training data. In addition, the shallow CNN serves as an important building block in many other supervised applications to help achieve sateofart results (Johnson & Zhang, 2015b, 2017).5 Experimental Results
Data  MR  TREC  SUBJ  ELEC  IMDB 

2  6  2  2  2  
20  10  23  123  266  
10662  5952  10000  50000  50000  
20277  8678  22636  52248  95212  
20000  8000  20000  30000  30000  
CV  500  CV  25000  25000 
Model  Size  Accuracy  Time  

MR  TREC  SUBJ  MR  TREC  SUBJ  
LDA  200  54.4  45.5  68.2  3.93  0.92  3.81 
DocNADE  200  54.2  62.0  72.9       
DPFA  200  55.2  51.4  74.5  6.61  1.88  6.53 
DPFA  200100  55.4  52.0  74.4  6.74  1.92  6.62 
DPFA  20010050  56.1  62.0  78.5  6.92  1.95  6.80 
PGBN  200  56.3  66.7  76.2  3.97  1.01  3.56 
PGBN  200100  56.7  67.3  77.3  5.09  1.72  4.39 
PGBN  20010050  57.0  67.9  78.3  5.67  1.87  4.91 
WHAI  200  55.6  60.4  75.4       
WHAI  200100  56.2  63.5  76.0       
WHAI  20010050  56.4  65.6  76.5       
CPGBN  200  61.5  68.4  77.4  3.58  0.98  3.53 
CPGBN  200100  62.4  73.4  81.2  8.19  1.99  6.56 
CPGBN  20010050  63.6  74.4  81.5  10.44  2.59  7.87 
Kernel Index  Visualized Topic  Visualized Phrase  
1st Column  2nd Column  3rd Column  
192th Kernel  how  do  you 
how do you,
how many years, how much degrees 
cocktail  many  years  
stadium  much  miles  
run  long  degrees  
80th Kernel  microsoft  address  microsoft email address, microsoft email address, virtual ip address  
virtual  addresses  
answers.com  ip  floods  
softball  brothers  score  
177th Kernel  who  created  maria 
who created snoopy,
who fired caesar, who wrote angela 
willy  wrote  angela  
bar  fired  snoopy  
hydrogen  are  caesar  
47th Kernel  dist  how  far  dist how far,
dist how high , dist how tall 
alltime  stock  high  
wheel  1976  tall  
saltpepter  westview  exchange 
5.1 Datasets and Preprocessing
We test the proposed CPGBN and its supervised extension (sCPGBN) on various benchmarks, including:
: Movie reviews with one sentence per review, where the task is to classify a review as being positive or negative (Pang & Lee, 2005).
: TREC question dataset, where the task is to classify a question into one of six question types (whether the question is about abbreviation, entity, description, human, location, or numeric) (Li & Roth, 2002).
: Subjectivity dataset, where the task is to classify a sentence as being subjective or objective (Pang & Lee, 2004).
: ELEC dataset (Mcauley & Leskovec, 2013) consists of electronic product reviews, which is part of a large Amazon review dataset.
: IMDB dataset (Maas et al., 2011) is a benchmark dataset for sentiment analysis, where the task is to determine whether a movie review is positive or negative.
We follow the steps listed in Johnson & Zhang (2015a) to tokenize the text, where emojis such as “:)” are treated as tokens and all the characters are converted to lower case. We then select the top most frequent words to construct the vocabulary, without dropping stopwords; we map the words not included in the vocabulary to a same special token to keep all sentences structurally intact. The summary statistics of all benchmark datasets are listed in Table 1.
5.2 Inference Efficiency
In this section we show the results of the proposed CPGBN on TREC. First, to demonstrate the advantages of increasing the depth of the network, we construct three networks of different depths: with for , for , and for . Under the same configuration of filter width
and the same hyperparameter setting, where
and , the networks are trained with the proposed Gibbs sampler. The trace plots of model likelihoods are shown in Fig. 2. It is worth noting that increasing the network depth in general improves the quality of data fitting, but as the complexity of the model increases, the model tends to converge more slowly in time.Considering that the data fitting and generation ability is not necessarily strongly correlated the performance on specific tasks, we evaluate the proposed models on document classification. Using the same experimental settings as mentioned above, we investigate how the classification accuracy is impacted by the network structure. On each network, we apply the Gibbs sampler to collect 200 MCMC samples after 500 burnins to estimate the posterior mean of the feature usage weight vector
, for every document in both the training and testing sets. A linear support vector machine (SVM) (
Cortes & Vapnik, 1995) is taken as the classifier on the first hidden layer, denoted as in (2), to make a fair comparison, where each result listed in Table 2 is the average accuracy of five independent runs. Fig. 3 shows a clear trend of improvement in classification accuracy, by increasing the network depth given a limited firstlayer width, or by increasing the hiddenlayer width given a fixed depth.5.3 Unsupervised Models
In our second set of experiments, we evaluate the performance of different unsupervised algorithms on MR, TREC, and SUBJ datasets by comparing the discriminative ability of their unsupervisely extracted latent features. We consider LDA (Blei et al., 2003) and its deep extensions, including DPFA (Gan et al., 2015) and PGBN (Zhou et al., 2016), which are trained with batch Gibbs sampling. We also consider WHAI (Zhang et al., 2018) and DocNADE (Lauly et al., 2017
) that are trained with stochastic gradient descent.
To make a fair comparison, we let CPGBNs to have the same hidden layer widths as the other methods, and set the filter width as 3 for the convolutional layer. Listed in Table 2 are the results of various algorithms, where the means and error bars are obtained from five independent runs, using the code provided by the original authors. For all batch learning algorithms, we also report in Table 2
their average run time for an epoch (
processing all training documents once). Clearly, given the same generative network structure, CPGBN performs the best in terms of classification accuracy, which can be attributed to its ability to utilize the word order information. The performance of CPGBN has a clear trend of improvement as the generative network becomes deeper, which is also observed on other deep generative models including DPFA, PGBN, and WHAI. In terms of running time, the shallow LDA could be the most efficient model compared to these more sophisticated ones, while CPGBN of a single layer achieves a comparable effectiveness thanks to its efficient use of GPU for parallelizing its computation inside each iteration. Note all running times are reported based on a Nvidia GTX 1080Ti GPU.In addition to quantitative evaluations, we have also visually inspected the inferred convolutional kernels of CPGBN, which is distinct from many existing convolutional models that build nonlinearity via “blackbox” neural networks. As shown in Table 3, we list several convolutional kernel elements of filter width 3 learned from TREC, using a singlehiddenlayer CPGBN of size 200. We exhibit the top 4 most probable words in each column of the corresponding kernel element. It’s particularly interesting to note that the words in different columns can be combined into a variety of interpretable phrases with similar semantics. CPGBN explicitly take the word order information into consideration to extract phrases, which are then combined into a hierarchy of phraselevel topics, helping clearly improve the quality of unsupervisedly extracted features. Take the 177th convolutional kernel for example, the top word of its 1st topic is “who,” its 2nd topic is a verb topic: “created, wrote, fired, are,” while its 3rd topic is a noun topic: “maria/angela/snoopy/caesar.” These wordlevel topics can be combined to construct phrases such as “who, created/ wrote/ fired/are, maria/angela/snoopy/caesar,” resulting in a phraselevel topic about “human,” one of the six types of questions in TREC. Note these shallow phraselevel topics will become more general in a deeper layer of CPGBN. We provide two example phraselevel topic hierarchies in the Appendix to enhance interpretability.
5.4 Supervised Models
Table 4 lists the comparison of various supervised algorithms on three common benchmarks, including SUBJ, ELEC, and IMDB. The results listed there are either quoted from published papers, or reproduced with the code provided by the original authors. We consider bagofwords representation based supervised topic models, including sAVITM (Srivastava & Sutton, 2017), MedLDA (Zhu et al., 2014), and sWHAI (Zhang et al., 2018). We also consider three types of bagofgram models (Johnson & Zhang, 2015a), where , and word embedding based methods, indicated with suffix “wv,” including SVMwv (Zhang & Wallace, 2017) and RNNwv and LSTMwv (Johnson & Zhang, 2016). In addition, we consider several related CNN based methods, including three different variants of Text CNN (Kim, 2014)—CNNrand, CNNstatic, and CNNnonstatic—and CNNonehot (Johnson & Zhang, 2015a) that is based on onehot encoding.
We construct three different sCPGBNs with , as described in Section 2.3. As shown in Table 4, the word embedding based methods generally outperform the methods based on bagofwords, which is not surprising as the latter completely ignore word order. Among all bagofwords representation based methods, sWHAI performs the best and even achieves comparable performance to some wordembedding based methods, which illustrates the benefits of having multistochasticlayer latent representations. As for grams based models, although they achieve comparable performance to wordembedding based methods, we find via experiments that both their performance and computation are sensitive to the vocabulary size. Among the CNN related algorithms, CNNonehot tends to have a better performance on classifying longer texts than Text CNN does, which agrees with the observations of Zhang & Wallace (2017); possible explanation for this phenomenon is that CNNonehot is prone to overfitting on short documents. Moving beyond CNNonehot, sCPGBN could help capture the underlying highorder statistics to alleviate overfitting, as commonly observed in deep generative models (DGMs) (Li et al., 2015), and improves its performance by increasing its number of stochastic hidden layers.
Model  SUBJ  ELEC  IMDB 

sAVITM (Srivastava & Sutton, 2017)  85.7  83.7  84.9 
MedLDA (Zhu et al., 2014)  86.5  84.6  85.7 
sWHAIlayer1 (Zhang et al., 2018)  90.6  86.8  87.2 
sWHAIlayer2 (Zhang et al., 2018)  91.7  87.5  88.0 
sWHAIlayer3 (Zhang et al., 2018)  92.0  87.8  88.2 
SVMunigrams (Tan et al., 2002)  88.5  86.3  87.7 
SVMbigrams (Tan et al., 2002)  89.4  87.2  88.2 
SVMtrigrams (Tan et al., 2002)  89.7  87.4  88.5 
SVMwv (Zhang & Wallace, 2017)  90.1  85.9  86.5 
RNNwv (Johnson & Zhang, 2016)  88.9  87.5  88.3 
LSTMwv (Johnson & Zhang, 2016)  89.8  88.3  89.0 
CNNrand (Kim, 2014)  89.6  86.8  86.3 
CNNstatic (Kim, 2014)  93.0  87.8  88.9 
CNNnonstatic (Kim, 2014)  93.4  88.6  89.5 
CNNonehot (Johnson & Zhang, 2015a)  91.1  91.3  91.6 
sCPGBNlayer1  93.4  91.6  91.8 
sCPGBNlayer2  93.7  92.0  92.4 
sCPGBNlayer3  93.8  92.2  92.6 
Comparison of classification accuracy on supervised feature extraction tasks on three different datasets.
6 Conclusion
We propose convolutional Poisson factor analysis (CPFA), a hierarchical Bayesian model that represents each word in a document as a onehot vector, and captures the word order information by performing convolution on sequentially ordered onehot word vectors. By developing a principled documentlevel stochastic pooling layer, we further couple CPFA with a multistochasticlayer deep topic model to construct convolutional Poisson gamma belief network (CPGBN). We develop a Gibbs sampler to jointly train all the layers of CPGBN. For more scalable training and fast testing, we further introduce a minibatch based stochastic inference algorithm that combines both stochasticgradient MCMC and a Weibull distribution based convolutional variational autoencoder. In addition, we provide a supervised extension of CPGBN. Example results on both unsupervised and supervised feature extraction tasks show CPGBN combines the virtues of both convolutional operations and deep topic models, providing not only stateoftheart classification performance, but also highly interpretable phraselevel deep latent representations.
Acknowledgements
B. Chen acknowledges the support of the Program for Young Thousand Talent by Chinese Central Government, the 111 Project (No. B18039), NSFC (61771361), NSFC for Distinguished Young Scholars (61525105), and the Innovation Fund of Xidian University. M. Zhou acknowledges the support of Award IIS1812699 from the U.S. National Science Foundation and the McCombs Research Excellence Grant.
References
 Abadi et al. (2016) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. TensorFlow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003.
 Bojanczyk et al. (1995) Bojanczyk, A. W., Brent, R. P., De Hoog, F. R., and Sweet, D. R. On the stability of the bareiss and related toeplitz factorization algorithms. SIAM Journal on Matrix Analysis and Applications, 16(1):40–57, 1995.
 Boureau et al. (2010) Boureau, Y., Ponce, J., and Lecun, Y. A theoretical analysis of feature pooling in visual recognition. In ICML, pp. 111–118, 2010.
 Carvalho et al. (2008) Carvalho, C. M., Chang, J. T., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. Highdimensional sparse factor modeling: Applications in gene expression genomics. J. Amer. Statist. Assoc., 103(484):1438–1456, 2008.
 Chen et al. (2013) Chen, B., Polatkan, G., Sapiro, G., Blei, D. M., Dunson, D. B., and Carin, L. Deep learning with hierarchical convolutional factor analysis. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1887–1901, 2013.
 Chung et al. (2014) Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 Cong et al. (2017) Cong, Y., Chen, B., Liu, H., and Zhou, M. Deep latent Dirichlet allocation with topiclayeradaptive stochastic gradient Riemannian MCMC. In ICML, pp. 864–873, 2017.
 Cortes & Vapnik (1995) Cortes, C. and Vapnik, V. Supportvector networks. Machine Learning, 20(3):273–297, 1995.
 Deerwester et al. (1990) Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci., 1990.
 Gan et al. (2015) Gan, Z., Chen, C., Henao, R., Carlson, D. E., and Carin, L. Scalable deep Poisson factor analysis for topic modeling. In ICML, pp. 1823–1832, 2015.
 Glorot et al. (2011) Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for largescale sentiment classification: A deep learning approach. In ICML, pp. 513–520, 2011.
 Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C. P., Glorot, X., Botvinick, M. M., Mohamed, S., and Lerchner, A. betavae: Learning basic visual concepts with a constrained variational framework. In ICLR, volume 3, 2017.
 Hinton & Salakhutdinov (2009) Hinton, G. E. and Salakhutdinov, R. Replicated softmax: An undirected topic model. In NIPS, pp. 1607–1614, 2009.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 James et al. (2010) James, G. M., Sabatti, C., Zhou, N., and Zhu, J. Sparse regulatory networks. AOAS, 4(2):663–686, 2010.
 Johnson & Zhang (2015a) Johnson, R. and Zhang, T. Effective use of word order for text categorization with convolutional neural networks. NAACL, pp. 103–112, 2015a.
 Johnson & Zhang (2015b) Johnson, R. and Zhang, T. Semisupervised convolutional neural networks for text categorization via region embedding. In NIPS, pp. 919–927, 2015b.
 Johnson & Zhang (2016) Johnson, R. and Zhang, T. Supervised and semisupervised text categorization using LSTM for region embeddings. ICML, pp. 526–534, 2016.
 Johnson & Zhang (2017) Johnson, R. and Zhang, T. Deep pyramid convolutional neural networks for text categorization. In ACL, pp. 562–570, 2017.
 Kalchbrenner et al. (2014) Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A convolutional neural network for modelling sentences. In ACL, pp. 655–665, 2014.
 Kim (2014) Kim, Y. Convolutional neural networks for sentence classification. In EMNLP, pp. 1746–1751, 2014.
 Klockner et al. (2012) Klockner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., and Fasih, A. R. Pycuda and pyopencl: A scriptingbased approach to gpu runtime code generation. Parallel computing, 38(3):157–174, 2012.
 Lauly et al. (2017) Lauly, S., Zheng, Y., Allauzen, A., and Larochelle, H. Document neural autoregressive distribution estimation. JMLR, 18(113):1–24, 2017.
 Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lee & Seung (2001) Lee, D. D. and Seung, H. S. Algorithms for nonnegative matrix factorization. In NIPS, 2001.

Lee et al. (2009)
Lee, H., Pham, P. T., Largman, Y., and Ng, A. Y.
Unsupervised feature learning for audio classification using convolutional deep belief networks.
In NIPS, pp. 1096–1104, 2009.  Li et al. (2015) Li, C., Zhu, J., Shi, T., and Zhang, B. Maxmargin deep generative models. In NIPS, pp. 1837–1845, 2015.
 Li & Roth (2002) Li, X. and Roth, D. Learning question classifiers. In International Conference on Computational Linguistics, pp. 1–7, 2002.
 Ma et al. (2015) Ma, Y. A., Chen, T., and Fox, E. B. A complete recipe for stochastic gradient mcmc. In NIPS, pp. 2917–2925, 2015.
 Maas et al. (2011) Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In ACL, pp. 142–150, 2011.
 Mcauley & Leskovec (2013) Mcauley, J. and Leskovec, J. Hidden factors and hidden topics: understanding rating dimensions with review text. In ACM RecSys, pp. 165–172, 2013.
 Miao et al. (2018) Miao, X., Zhen, X., Liu, X., Deng, C., Athitsos, V., and Huang, H. Direct shape regression networks for endtoend face alignment. In CVPR, pp. 5040–5049, 2018.
 Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119, 2013.
 Min et al. (2019) Min, S., Chen, X., Zha, Z., Wu, F., and Zhang, Y. A twostream mutual attention network for semisupervised biomedical segmentation with noisy labels. AAAI, 2019.
 Pang & Lee (2004) Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ACL, pp. 271–278, 2004.
 Pang & Lee (2005) Pang, B. and Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, pp. 115–124, 2005.
 Pang et al. (2002) Pang, B., Lee, L., and Vaithyanathan, S. Thumbs up? sentiment classification using machine learning techniques. In ACL, pp. 79–86, 2002.
 Papadimitriou et al. (2000) Papadimitriou, C., Raghavan, P., Tamaki, H., and Vempala, S. Latent semantic indexing: A probabilistic analysis. J. Computer and System Sci., 2000.
 Patterson & Teh (2013) Patterson, S. and Teh, Y. W. Stochastic gradient riemannian langevin dynamics on the probability simplex. In NIPS, pp. 3102–3110, 2013.
 Ranganath et al. (2015) Ranganath, R., Tang, L., Charlin, L., and Blei, D. Deep exponential families. In AISTATS, pp. 762–771, 2015.
 Semeniuta et al. (2017) Semeniuta, S., Severyn, A., and Barth, E. A hybrid convolutional variational autoencoder for text generation. EMNLP, pp. 627–637, 2017.
 Srivastava & Sutton (2017) Srivastava, A. and Sutton, C. A. Autoencoding variational inference for topic models. ICLR, 2017.
 Tan et al. (2002) Tan, C., Wang, Y., and Lee, C. The use of bigrams to enhance text categorization. Information Processing and Management, 38(4):529–546, 2002.
 Tang et al. (2014) Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., and Qin, B. Learning sentimentspecific word embedding for twitter sentiment classification. In ACL, pp. 1555–1565, 2014.
 Wang et al. (2018) Wang, C., Chen, B., and Zhou, M. Multimodal Poisson gamma belief network. In AAAI, 2018.
 Wang et al. (2010) Wang, J., Yang, J., Yu, K., Lv, F., Huang, T. S., and Gong, Y. Localityconstrained linear coding for image classification. In CVPR, pp. 3360–3367, 2010.
 Weston et al. (2014) Weston, J., Chopra, S., and Adams, K. Tagspace: Semantic embeddings from hashtags. In EMNLP, pp. 1822–1827, 2014.
 Xu & Sarikaya (2013) Xu, P. and Sarikaya, R. Convolutional neural network based triangular crf for joint intent detection and slot filling. In IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 78–83, 2013.
 Zhang et al. (2018) Zhang, H., Chen, B., Guo, D., and Zhou, M. WHAI: Weibull hybrid autoencoding inference for deep topic modeling. ICLR, 2018.
 Zhang & Wallace (2017) Zhang, Y. and Wallace, B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In IJCNLP, pp. 253–263, 2017.
 Zhang et al. (2016) Zhang, Y., Gan, Z., and Carin, L. Generating text via adversarial training. In NIPS workshop on Adversarial Training, volume 21, 2016.
 Zhang et al. (2017) Zhang, Y., Shen, D., Wang, G., Gan, Z., Henao, R., and Carin, L. Deconvolutional paragraph representation learning. In NIPS, pp. 4169–4179, 2017.
 Zhou (2015) Zhou, M. Infinite edge partition models for overlapping community detection and link prediction. In AISTATS, pp. 1135–1143, 2015.
 Zhou & Carin (2012) Zhou, M. and Carin, L. Negative binomial process count and mixture modeling. arXiv preprint arXiv:1209.3442v1, 2012.
 Zhou et al. (2012) Zhou, M., Hannah, L., Dunson, D., and Carin, L. Betanegative binomial process and Poisson factor analysis. In AISTATS, pp. 1462–1471, 2012.
 Zhou et al. (2016) Zhou, M., Cong, Y., and Chen, B. Augmentable gamma belief networks. JMLR, 17(1):5656–5699, 2016.
 Zhu et al. (2014) Zhu, J., Chen, N., Perkins, H., and Zhang, B. Gibbs maxmargin topic models with data augmentation. JMLR, 15(1):1073–1110, 2014.
Appendix A Inference for CPGBN
Here we describe the derivation in detail for convolutional Poisson gamma belief network (CPGBN) with hidden layers, expressed as
(9) 
Note using the relationship between the gamma and Dirichlet distributions (, Lemma IV.3 of Zhou & Carin (2012)), the elements of in the first hidden layer can be equivalently generated as
(10) 
Note the random variable
, which pools the random weights of all words in document , follows(11) 
As described in Section 3.1, we have
(12) 
(13) 
leading to the following conditional posteriors:
(14) 
Appendix B Sensitivity to Filter Width
To investigate the effect of the filter width of the convolutional kernel, we have evaluated the performance of CPFA (, CPGBN with a single hidden layer) on the SUBJ dataset with a variety of filter widths (unsupervised feature extraction + linear SVM for classification). We use the same CPFA code but vary its setting of the filter width. Averaging over five independent runs, the accuracy for filter wdith 1, 2, 3, 4, 5, 6, and 7 are , , , , , , and , respectively. Note when the filter width reduces to 1, CPFA reduces to PFA (, no convolution). These results suggest the performance of CPFA has low sensitivity to the filter width. While setting the filter width as three may not be the optimal choice, it is a common practice for existing text CNNs (Kim, 2014; Johnson & Zhang, 2015a).
Appendix C Hierarchical Visualization
Distinct from wordlevel topics learned by traditional topic models (Deerwester et al., 1990; Papadimitriou et al., 2000; Lee & Seung, 2001; Blei et al., 2003; Hinton & Salakhutdinov, 2009; Zhou et al., 2012), we propose novel phraselevel topics preserving word order as shown in Table. 3, where each phraselevel topic is often combined with several frequently cooccurred short phrases. To explore the connections between phraselevel topics of different layers learned by CPGBN, we follow Zhou et al. (2016) to construct trees to understand the general and specific aspects of the corpus. More specifically, we construct trees learned from TREC dataset, with the network structure set as . We pick a node at the top layer as the root of a tree and grow the tree downward by drawing a line from node at layer to the top relevant nodes at layer .
As shown in Fig. 5, we select the top 3 relevant nodes at the second layer linked to the selected root node, and the top 2 relevant nodes at the third layer linked to the selected nodes at the second layer. Considering the TREC corpus only consists of questions (questions about abbreviation, entity, description, human, location, or numeric), most of the topics learned by CPGBN are focused on short phrases on asking specific questions, as shown in Table. 3. Following the branches of the tree in Fig. 5, the root node covers very general question types on “how many, how long, what, when, why,” and it is clear that the topics become more and more specific when moving along the tree from the top to bottom, where the shallow topics of the first layer tend to focus on a single question type, , the bottomlayer node queries “how many” and the one queries “how long.”