Convolutional Poisson Gamma Belief Network

05/14/2019 ∙ by Chaojie Wang, et al. ∙ 0

For text analysis, one often resorts to a lossy representation that either completely ignores word order or embeds each word as a low-dimensional dense feature vector. In this paper, we propose convolutional Poisson factor analysis (CPFA) that directly operates on a lossless representation that processes the words in each document as a sequence of high-dimensional one-hot vectors. To boost its performance, we further propose the convolutional Poisson gamma belief network (CPGBN) that couples CPFA with the gamma belief network via a novel probabilistic pooling layer. CPFA forms words into phrases and captures very specific phrase-level topics, and CPGBN further builds a hierarchy of increasingly more general phrase-level topics. For efficient inference, we develop both a Gibbs sampler and a Weibull distribution based convolutional variational auto-encoder. Experimental results demonstrate that CPGBN can extract high-quality text latent representations that capture the word order information, and hence can be leveraged as a building block to enrich a wide variety of existing latent variable models that ignore word order.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A central task in text analysis and language modeling is to effectively represent the documents to capture their underlying semantic structures. A basic idea is to represent the words appearing in a document with a sequence of one-hot vectors, where the vector dimension is the size of the vocabulary. This preserves all textual information but results in a collection of extremely large and sparse matrices for a text corpus. Given the memory and computation constraints, it is very challenging to directly model this lossless representation. Thus existing methods often resort to simplified lossy representations that either completely ignore word order (Blei et al., 2003), or embed the words into a lower dimensional feature space (Mikolov et al., 2013).

Ignoring word order, each document is simplified as a bag-of-words count vector, the th element of which represents how many times the th vocabulary term appears in that document. With a text corpus simplified as a term-document frequency count matrix, a wide array of latent variable models (LVMs) have been proposed for text analysis (Deerwester et al., 1990; Papadimitriou et al., 2000; Lee & Seung, 2001; Blei et al., 2003; Hinton & Salakhutdinov, 2009; Zhou et al., 2012). Extending “shallow” probabilistic topic models such as latent Dirichlet allocation (LDA) (Blei et al., 2003) and Poisson factor analysis (PFA) (Zhou et al., 2012), steady progress has been made in inferring multi-stochastic-layer deep latent representations for text analysis (Gan et al., 2015; Zhou et al., 2016; Ranganath et al., 2015; Zhang et al., 2018)

. Despite the progress, completely ignoring word order could still be particularly problematic on some common text-analysis tasks, such as spam detection and sentiment analysis

(Pang et al., 2002; Tang et al., 2014).

To preserve word order, a common practice is to first convert each word in the vocabulary from a high-dimensional sparse one-hot vector into a low-dimensional dense word-embedding vector. The word-embedding vectors can be either trained as part of the learning (Kim, 2014; Kalchbrenner et al., 2014), or pre-trained by some other methods on an additional large corpus (Mikolov et al., 2013)

. Sequentially ordered word embedding vectors have been successfully combined with deep neural networks to address various problems in text analysis and language modeling. A typical combination method is to use the word-embedding layer as part of a recurrent neural network (RNN), especially long short-term memory (LSTM) and its variants

(Hochreiter & Schmidhuber, 1997; Chung et al., 2014)

, achieving great success in numerous tasks that heavily rely on having high-quality sentence representation. Another popular combination method is to apply a convolutional neural network (CNN)

(Lecun et al., 1998) directly to the embedding representation, treating the word embedding layer as an image input; it has been widely used in systems for entity search, sentence modeling, product feature mining, and so on (Xu & Sarikaya, 2013; Weston et al., 2014).

In this paper, we first propose convolutional PFA (CPFA) that directly models the documents, each of which is represented without information loss as a sequence of one-hot vectors. We then boot its performance by coupling it with the gamma belief network (GBN) of Zhou et al. (2016), a multi-stochastic-hidden layer deep generative model, via a novel probabilistic document-level pooling layer. We refer to the CPFA and GBN coupled model as convolutional Poisson GBN (CPGBN). To the best of our knowledge, CPGBN is the first unsupervised probabilistic convolutional model that infers multi-stochastic-layer latent variables for documents represented without information loss. Its hidden layers can be jointly trained with an upward-downward Gibbs sampler; this makes its inference different from greedy layer-wise training (Lee et al., 2009; Chen et al., 2013). In each Gibbs sampling iteration, the main computation is embarrassingly parallel and hence will be accelerated with Graphical Process Units (GPUs). We also develop a Weibull distribution based convolutional variational auto-encoder to provide amortized variational inference, which further accelerates both training and testing for large corpora. Exploiting the multi-layer structure of CPGBN, we further propose a supervised CPGBN (sCPGBN), which combines the representation power of CPGBN for topic modeling and the discriminative power of deep neural networks (NNs) under a principled probabilistic framework. We show that the proposed models achieve state-of-art results in a variety of text-analysis tasks.

2 Convolutional Models for Text Analysis

Below we introduce CPFA and then develop a probabilistic document-level pooling method to couple CPFA with GBN, which further serves as the decoder of a Weibull distribution based convolutional variational auto-encoder (VAE).

2.1 Convolutional Poisson Factor Analysis

Denote as the vocabulary and let represent the sequentially ordered words of the th document, which can be represented as a sequence of one-hot vectors. For example, with vocabulary “don’t”,“hate”,“I”,“it”,“like”, document (“I”,“like”,“it”) can be represented as , where , , and are one-hot column vectors. Let us denote , which is one if and only if word of document matches term of the vocabulary.

To exploit a rich set of tools developed for count data analysis (Zhou et al., 2012, 2016), we first link these sequential binary vectors to sequential count vectors via the Bernoulli-Poisson link (Zhou, 2015). More specifically, we link each to a latent count as , where , and factorize the matrix under the Poisson likelihood. Distinct from vanilla PFA (Zhou et al., 2012) where the columns of the matrix are treated as conditionally independent, here we introduce convolution into the hierarchical model to capture the sequential dependence between the columns. We construct the hierarchical model of CPFA as

(1)

where denotes a convolution operator, , is the th convolutional filter/factor/topic whose filter width is , , and ; the latent count matrix is factorized into the summation of equal-sized latent count matrices, the Poisson rates of the th of which are obtained by convolving

with its corresponding gamma distributed feature representation

, where . To complete the hierarchical model, we let and . Note as in Zhou et al. (2016), we may consider as the truncation level of a gamma process, which allows the number of needed factors to be inferred from the data as long as is set sufficiently large.

We can interpret

as the probability that the

th term in the vocabulary appears at the th temporal location for the th latent topic, and expect each to extract both global cooccurrence patterns, such as common topics, and local temporal structures, such as common -gram phrases, where , from the text corpus. Note the convolution layers of CPFA convert text regions of size (,“am so happy” with ) to feature vectors, directly learning the embedding of text regions without going through a separate learning for word embedding. Thus CPFA provides a potential solution for distinguishing polysemous words according to their neighboring words. The length of the representation weight vector in our model is , which varies with the document length . This differs CPFA from traditional convolutional models with a fixed feature map size (Zhang et al., 2017; Miao et al., 2018; Min et al., 2019

), which requires either heuristic cropping or zero-padding.

2.2 Convolutional Poisson Gamma Belief Network

There has been significant recent interest in inferring multi-stochastic-layer deep latent representations for text analysis in an unsupervised manner (Gan et al., 2015; Zhou et al., 2016; Ranganath et al., 2015; Wang et al., 2018; Zhang et al., 2018), where word order is ignored. The key intuition behind these models, such as GBN (Zhou et al., 2016), is that words frequently co-occurred in the same document can form specific word-level topics in shallow layers; as the depth of the network increases, frequently co-occurred topics can form more general ones. Here, we propose a model to preserve word order, without losing the nice hierarchical topical interpretation provided by a deep topic model. The intuition is that by preserving word order, words can first form short phrases; frequently co-occurred short phrases can then be combined to form specific phrase-level topics; and these specific phrase-level topics can form increasingly more general phrase-level topics when moving towards deeper layers.

Figure 1: The proposed CPGBN (upper part) and its corresponding convolutional variational inference network (lower part).

As in Fig. 1, we couple CPFA in (1) with GBN to construct CPGBN, whose generative model with hidden layers, from top to bottom, is expressed as

(2)

where is the th row of and superscripts indicate layers. Note CPGBN first factorizes the latent count matrix under the Poisson likelihood into the summation of convolutions, the th of which is between and weight vector . Using the relationship between the gamma and Dirichlet distributions (, Lemma IV.3 of Zhou & Carin (2012)), in (2) can be equivalently generated as

(3)

which could be seen as a specific probabilistic document-level pooling algorithm on the gamma shape parameter. For , the shape parameters of the gamma distributed hidden units are factorized into the product of the connection weight matrix and hidden units of layer ; the top layer’s hidden units share the same as their gamma shape parameters; and are gamma scale parameters. For scale identifiability and ease of inference, the columns of and are restricted to have unit norm. To complete the hierarchical model, we let , , , and .

Examining (3) shows CPGBN provides a probabilistic document-level pooling layer, which summarizes the content coefficients across all word positions into ; the hierarchical structure after can be flexibly modified according to the deep models (not restricted to GBN) to be combined with. The proposed pooling layer can be trained jointly with all the other layers, making it distinct from a usual one that often cuts off the message passing from deeper layers (Lee et al., 2009; Chen et al., 2013). We note using pooling on the first hidden layer is related to shallow text CNNs that use document-level pooling directly after a single convolutional layer (Kim, 2014; Johnson & Zhang, 2015a), which often contributes to improved efficiency (Boureau et al., 2010; Wang et al., 2010).

2.3 Convolutional Inference Network for CPGBN

To make our model both scalable to big corpora in training and fast in out-of-sample prediction, below we introduce a convolutional inference network, which will be used in hybrid MCMC/variational inference described in Section 3.2

. Note the usual strategy of autoencoding variational inference is to construct an inference network to map the observations directly to their latent representations, and optimize the encoder and decoder by minimizing the negative evidence lower bound (ELBO) as

, where

(4)

following Zhang et al. (2018), we use the Weibull distribution to approximate the gamma distributed conditional posterior of , as it is reparameterizable, resembles the gamma distribution, and the Kullback–Leibler (KL) divergence from the gamma to Weibull distributions is analytic; as in Fig. 1, we construct the autoencoding variational distribution as , where

(5)

The parameters of are deterministically transformed from the observation using CNNs specified as

where , , , , and is obtained with zero-padding; the parameters and are transformed from specified as

where , , and for .

Further we develop sCPGBN, a supervised generalization of CPGBN, for text categorization tasks: by adding a softmax classifier on the concatenation of

, the loss function of the entire framework is modified as

where denotes the cross-entropy loss and is used to balance generation and discrimination (Higgins et al., 2017).

3 Inference

Below we describe the key inference equations for CPFA shown in (1), a single hidden-layer version of CPGBN shown in (2), and provide more details in the Appendix. How the inference of CPFA, including Gibbs sampling and hybrid MCMC/autoencoding variational inference, is generalized to that of CPGBN is similar to how the inference of PFA is generalized to that of PGBN, as described in detail in Zhou et al. (2016) and Zhang et al. (2018) and omitted here for brevity.

3.1 Gibbs Sampling

Directly dealing with the whole matrix by expanding the convolution operation with Toeplitz conversion (Bojanczyk et al., 1995) provides a straightforward solution for the inference of convolutional models, which transforms each observation matrix into a vector, on which the inference methods for sparse factor analysis (Carvalho et al., 2008; James et al., 2010) could then be applied. However, considering the sparsity of the document matrix consisting of one-hot vectors, directly processing these matrices without considering sparsity will bring unnecessary burden in computation and storage. Instead, we apply data augmentation under the Poisson likelihood (Zhou et al., 2012, 2016) to upward propagate latent count matrices as

where . Note we only need to focus on nonzero elements of . We rewrite the likelihood function by expanding the convolution operation along the dimension of as

where if . Thus each nonzero element could be augmented as

(6)

where and . We can now decouple in (1) by marginalizing out , leading to

where the symbol “” denotes summing over the corresponding index and hence . Using the gamma-Poisson conjugacy, we have

Similarly, we can expand the convolution along the other direction as , where if , and obtain , where and . Further applying the relationship between the Poisson and multinomial distributions, we have

With the Dirichlet-multinomial conjugacy, we have

Exploiting the properties of the Poisson and multinomial distributions helps CPFA fully take advantages of the sparsity of the one-hot vectors, making its complexity comparable to a regular bag-of-words topic model that uses Gibbs sampling for inference. Note as the multinomial related samplings inside each iteration are embarrassingly parallel, they are accelerated with GPUs in our experiments.

  Set mini-batch size and number of dictionaries ;
  Initialize encoder parameter and model parameter ;
  for  do
     Randomly select a mini-batch of documents to form a subset ;
     for  do
         Draw random noise

from uniform distribution;

         Calculate according to (8) and update ;
     end for
     Sample from (5) given ;
     Parallely process each positive point in to obtain according to (6);
     Update according to (7)
  end for
Algorithm 1 Hybrid stochastic-gradient MCMC and autoencoding variational inference for CPGBN

3.2 Hybrid MCMC/Variational Inference

While having closed-form update equations, the Gibbs sampler requires processing all documents in each iteration and hence has limited scalability. Fortunately, there have been several related research on scalable inference for discrete LVMs (Ma et al., 2015; Patterson & Teh, 2013). Specifically, TLASGR-MCMC of Cong et al., 2017, which uses an elegant simplex constraint and increases the sampling efficiency via the use of the Fisher information matrix (FIM), with adaptive step-sizes for the topics of different layers, can be naturally extended to our model. The efficient TLASGR-MCMC update of in CPFA can be described as

(7)

where denotes the number of mini-batches processed so far; the symbol in the subscript denotes summing over the data in a mini-batch; and the definitions of , , , and are analogous to these in Cong et al. (2017) and omitted here for brevity.

Similar to Zhang et al. (2018), combining TLASGR-MCMC and the convolutional inference network described in Section 2.3, we can construct a hybrid stochastic-gradient-MCMC/autoencoding variational inference for CPFA. More specifically, in mini-batch based each iteration, we draw a random sample of the CPFA global parameters via TLASGR-MCMC; given the sampled global parameters, we optimize the parameters of the convolutional inference network, denoted as , using the ELBO in (4), which for CPFA is simplified as

(8)

We describe the proposed hybrid stochastic-gradient-MCMC/autoencoding variational inference algorithm in Algorithm 1

, which is implemented in TensorFlow (

Abadi et al., 2016), combined with pyCUDA (Klockner et al., 2012) for more efficient computation.

4 Related Work

With the bag-of-words representation that ignores the word order information, a diverse set of deep topic models have been proposed to infer a multilayer data representation in an unsupervised manner. A main mechanism of them is to connect adjacent layers by specific factorization, which usually boosts the performance (Gan et al., 2015; Zhou et al., 2016; Zhang et al., 2018). However, limited by the bag-of-words representation, they usually perform poorly on sentiment analysis tasks, which heavily rely on the word order information (Xu & Sarikaya, 2013; Weston et al., 2014). In this paper, the proposed CPGBN could be seen as a novel convolutional extension, which not only clearly remedies the loss of word order, but also inherits various virtues of deep topic models.

Benefiting from the advance of word-embedding methods, CNN-based architectures have been leveraged as encoders for various natural language processing tasks (

Kim, 2014; Kalchbrenner et al., 2014). They in general directly apply to the word embedding layer a single convolution layer, which, given a convolution filter window of size , essentially acts as a detector of typical

-grams. More complex deep neural networks taking CNNs as the their encoder and RNNs as decoder have also been studied for text generation (

Zhang et al., 2016; Semeniuta et al., 2017). However, for unsupervised sentence modeling, language decoders other than RNNs are less well studied; it was not until recently that Zhang et al. (2017)

have proposed a simple yet powerful, purely convolutional framework for unsupervisedly learning sentence representations, which is the first to force the encoded latent representation to capture the information from the entire sentence via a multi-layer CNN specification. But there still exists a limitation in requiring an additional large corpus for training word embeddings, and it is also difficult to visualize and explain the semantic meanings learned by black-box deep networks.

For text categorization, the bi-grams (or a combination of bi-grams and unigrams) are confirmed to provide more discriminative power than unigrams (Tan et al., 2002; Glorot et al., 2011). Motivated by this observation, Johnson & Zhang (2015a)

tackle document categorization tasks by directly applying shallow CNNs, with filter width three, on one-hot encoding document matrices, outperforming both traditional

-grams and word-embedding based methods without the aid of additional training data. In addition, the shallow CNN serves as an important building block in many other supervised applications to help achieve sate-of-art results (Johnson & Zhang, 2015b, 2017).

5 Experimental Results

Figure 2: Point likelihood of CPGBNs on TREC as a function of time with various structural settings.
Figure 3: Classification accuracy of the CPGBNs on TREC as a function of the depth with various structural settings.
Data MR TREC SUBJ ELEC IMDB
2 6 2 2 2
20 10 23 123 266
10662 5952 10000 50000 50000
20277 8678 22636 52248 95212
20000 8000 20000 30000 30000
CV 500 CV 25000 25000
Table 1: Summary statistics for the datasets after tokenization (: Number of target classes. : Average sentence length. : Dataset size. : Vocabulary size. : Number of words present in the set of pre-trained word vectors. : Test set size, where CV means 10-fold cross validation).
Model Size Accuracy Time
MR TREC SUBJ MR TREC SUBJ
LDA 200 54.4 45.5 68.2 3.93 0.92 3.81
DocNADE 200 54.2 62.0 72.9 - - -
DPFA 200 55.2 51.4 74.5 6.61 1.88 6.53
DPFA 200-100 55.4 52.0 74.4 6.74 1.92 6.62
DPFA 200-100-50 56.1 62.0 78.5 6.92 1.95 6.80
PGBN 200 56.3 66.7 76.2 3.97 1.01 3.56
PGBN 200-100 56.7 67.3 77.3 5.09 1.72 4.39
PGBN 200-100-50 57.0 67.9 78.3 5.67 1.87 4.91
WHAI 200 55.6 60.4 75.4 - - -
WHAI 200-100 56.2 63.5 76.0 - - -
WHAI 200-100-50 56.4 65.6 76.5 - - -
CPGBN 200 61.5 68.4 77.4 3.58 0.98 3.53
CPGBN 200-100 62.4 73.4 81.2 8.19 1.99 6.56
CPGBN 200-100-50 63.6 74.4 81.5 10.44 2.59 7.87
Table 2: Comparison of classification accuracy on unsupervisedly extracted feature vectors and average training time (seconds per Gibbs sampling iteration across all documents) on three different datasets.
Kernel Index Visualized Topic Visualized Phrase
1st Column 2nd Column 3rd Column
192th Kernel how do you how do you,
how many years,
how much degrees
cocktail many years
stadium much miles
run long degrees
80th Kernel microsoft e-mail address microsoft e-mail address, microsoft email address, virtual ip address
virtual email addresses
answers.com ip floods
softball brothers score
177th Kernel who created maria who created snoopy,
who fired caesar,
who wrote angela
willy wrote angela
bar fired snoopy
hydrogen are caesar
47th Kernel dist how far dist how far,
dist how high ,
dist how tall
all-time stock high
wheel 1976 tall
saltpepter westview exchange
Table 3: Example phrases learned from TREC by CPGBN.

5.1 Datasets and Preprocessing

We test the proposed CPGBN and its supervised extension (sCPGBN) on various benchmarks, including:

: Movie reviews with one sentence per review, where the task is to classify a review as being positive or negative (Pang & Lee, 2005).

: TREC question dataset, where the task is to classify a question into one of six question types (whether the question is about abbreviation, entity, description, human, location, or numeric) (Li & Roth, 2002).

: Subjectivity dataset, where the task is to classify a sentence as being subjective or objective (Pang & Lee, 2004).

: ELEC dataset (Mcauley & Leskovec, 2013) consists of electronic product reviews, which is part of a large Amazon review dataset.

: IMDB dataset (Maas et al., 2011) is a benchmark dataset for sentiment analysis, where the task is to determine whether a movie review is positive or negative.

We follow the steps listed in Johnson & Zhang (2015a) to tokenize the text, where emojis such as “:-)” are treated as tokens and all the characters are converted to lower case. We then select the top most frequent words to construct the vocabulary, without dropping stopwords; we map the words not included in the vocabulary to a same special token to keep all sentences structurally intact. The summary statistics of all benchmark datasets are listed in Table 1.

5.2 Inference Efficiency

In this section we show the results of the proposed CPGBN on TREC. First, to demonstrate the advantages of increasing the depth of the network, we construct three networks of different depths: with for , for , and for . Under the same configuration of filter width

and the same hyperparameter setting, where

and , the networks are trained with the proposed Gibbs sampler. The trace plots of model likelihoods are shown in Fig. 2. It is worth noting that increasing the network depth in general improves the quality of data fitting, but as the complexity of the model increases, the model tends to converge more slowly in time.

Considering that the data fitting and generation ability is not necessarily strongly correlated the performance on specific tasks, we evaluate the proposed models on document classification. Using the same experimental settings as mentioned above, we investigate how the classification accuracy is impacted by the network structure. On each network, we apply the Gibbs sampler to collect 200 MCMC samples after 500 burn-ins to estimate the posterior mean of the feature usage weight vector

, for every document in both the training and testing sets. A linear support vector machine (SVM) (

Cortes & Vapnik, 1995) is taken as the classifier on the first hidden layer, denoted as in (2), to make a fair comparison, where each result listed in Table 2 is the average accuracy of five independent runs. Fig. 3 shows a clear trend of improvement in classification accuracy, by increasing the network depth given a limited first-layer width, or by increasing the hidden-layer width given a fixed depth.

5.3 Unsupervised Models

In our second set of experiments, we evaluate the performance of different unsupervised algorithms on MR, TREC, and SUBJ datasets by comparing the discriminative ability of their unsupervisely extracted latent features. We consider LDA (Blei et al., 2003) and its deep extensions, including DPFA (Gan et al., 2015) and PGBN (Zhou et al., 2016), which are trained with batch Gibbs sampling. We also consider WHAI (Zhang et al., 2018) and DocNADE (Lauly et al., 2017

) that are trained with stochastic gradient descent.

To make a fair comparison, we let CPGBNs to have the same hidden layer widths as the other methods, and set the filter width as 3 for the convolutional layer. Listed in Table 2 are the results of various algorithms, where the means and error bars are obtained from five independent runs, using the code provided by the original authors. For all batch learning algorithms, we also report in Table 2

their average run time for an epoch (

processing all training documents once). Clearly, given the same generative network structure, CPGBN performs the best in terms of classification accuracy, which can be attributed to its ability to utilize the word order information. The performance of CPGBN has a clear trend of improvement as the generative network becomes deeper, which is also observed on other deep generative models including DPFA, PGBN, and WHAI. In terms of running time, the shallow LDA could be the most efficient model compared to these more sophisticated ones, while CPGBN of a single layer achieves a comparable effectiveness thanks to its efficient use of GPU for parallelizing its computation inside each iteration. Note all running times are reported based on a Nvidia GTX 1080Ti GPU.

In addition to quantitative evaluations, we have also visually inspected the inferred convolutional kernels of CPGBN, which is distinct from many existing convolutional models that build nonlinearity via “black-box” neural networks. As shown in Table 3, we list several convolutional kernel elements of filter width 3 learned from TREC, using a single-hidden-layer CPGBN of size 200. We exhibit the top 4 most probable words in each column of the corresponding kernel element. It’s particularly interesting to note that the words in different columns can be combined into a variety of interpretable phrases with similar semantics. CPGBN explicitly take the word order information into consideration to extract phrases, which are then combined into a hierarchy of phrase-level topics, helping clearly improve the quality of unsupervisedly extracted features. Take the 177th convolutional kernel for example, the top word of its 1st topic is “who,” its 2nd topic is a verb topic: “created, wrote, fired, are,” while its 3rd topic is a noun topic: “maria/angela/snoopy/caesar.” These word-level topics can be combined to construct phrases such as “who, created/ wrote/ fired/are, maria/angela/snoopy/caesar,” resulting in a phrase-level topic about “human,” one of the six types of questions in TREC. Note these shallow phrase-level topics will become more general in a deeper layer of CPGBN. We provide two example phrase-level topic hierarchies in the Appendix to enhance interpretability.

5.4 Supervised Models

Table 4 lists the comparison of various supervised algorithms on three common benchmarks, including SUBJ, ELEC, and IMDB. The results listed there are either quoted from published papers, or reproduced with the code provided by the original authors. We consider bag-of-words representation based supervised topic models, including sAVITM (Srivastava & Sutton, 2017), MedLDA (Zhu et al., 2014), and sWHAI (Zhang et al., 2018). We also consider three types of bag-of--gram models (Johnson & Zhang, 2015a), where , and word embedding based methods, indicated with suffix “-wv,” including SVM-wv (Zhang & Wallace, 2017) and RNN-wv and LSTM-wv (Johnson & Zhang, 2016). In addition, we consider several related CNN based methods, including three different variants of Text CNN (Kim, 2014)—CNN-rand, CNN-static, and CNN-non-static—and CNN-one-hot (Johnson & Zhang, 2015a) that is based on one-hot encoding.

We construct three different sCPGBNs with , as described in Section 2.3. As shown in Table 4, the word embedding based methods generally outperform the methods based on bag-of-words, which is not surprising as the latter completely ignore word order. Among all bag-of-words representation based methods, sWHAI performs the best and even achieves comparable performance to some word-embedding based methods, which illustrates the benefits of having multi-stochastic-layer latent representations. As for -grams based models, although they achieve comparable performance to word-embedding based methods, we find via experiments that both their performance and computation are sensitive to the vocabulary size. Among the CNN related algorithms, CNN-one-hot tends to have a better performance on classifying longer texts than Text CNN does, which agrees with the observations of Zhang & Wallace (2017); possible explanation for this phenomenon is that CNN-one-hot is prone to overfitting on short documents. Moving beyond CNN-one-hot, sCPGBN could help capture the underlying high-order statistics to alleviate overfitting, as commonly observed in deep generative models (DGMs) (Li et al., 2015), and improves its performance by increasing its number of stochastic hidden layers.

Model SUBJ ELEC IMDB
sAVITM (Srivastava & Sutton, 2017) 85.7 83.7 84.9
MedLDA (Zhu et al., 2014) 86.5 84.6 85.7
sWHAI-layer1 (Zhang et al., 2018) 90.6 86.8 87.2
sWHAI-layer2 (Zhang et al., 2018) 91.7 87.5 88.0
sWHAI-layer3 (Zhang et al., 2018) 92.0 87.8 88.2
SVM-unigrams (Tan et al., 2002) 88.5 86.3 87.7
SVM-bigrams (Tan et al., 2002) 89.4 87.2 88.2
SVM-trigrams (Tan et al., 2002) 89.7 87.4 88.5
SVM-wv (Zhang & Wallace, 2017) 90.1 85.9 86.5
RNN-wv (Johnson & Zhang, 2016) 88.9 87.5 88.3
LSTM-wv (Johnson & Zhang, 2016) 89.8 88.3 89.0
CNN-rand (Kim, 2014) 89.6 86.8 86.3
CNN-static (Kim, 2014) 93.0 87.8 88.9
CNN-non-static (Kim, 2014) 93.4 88.6 89.5
CNN-one-hot (Johnson & Zhang, 2015a) 91.1 91.3 91.6
sCPGBN-layer1 93.4 91.6 91.8
sCPGBN-layer2 93.7 92.0 92.4
sCPGBN-layer3 93.8 92.2 92.6
Table 4:

Comparison of classification accuracy on supervised feature extraction tasks on three different datasets.

6 Conclusion

We propose convolutional Poisson factor analysis (CPFA), a hierarchical Bayesian model that represents each word in a document as a one-hot vector, and captures the word order information by performing convolution on sequentially ordered one-hot word vectors. By developing a principled document-level stochastic pooling layer, we further couple CPFA with a multi-stochastic-layer deep topic model to construct convolutional Poisson gamma belief network (CPGBN). We develop a Gibbs sampler to jointly train all the layers of CPGBN. For more scalable training and fast testing, we further introduce a mini-batch based stochastic inference algorithm that combines both stochastic-gradient MCMC and a Weibull distribution based convolutional variational auto-encoder. In addition, we provide a supervised extension of CPGBN. Example results on both unsupervised and supervised feature extraction tasks show CPGBN combines the virtues of both convolutional operations and deep topic models, providing not only state-of-the-art classification performance, but also highly interpretable phrase-level deep latent representations.

Acknowledgements

B. Chen acknowledges the support of the Program for Young Thousand Talent by Chinese Central Government, the 111 Project (No. B18039), NSFC (61771361), NSFC for Distinguished Young Scholars (61525105), and the Innovation Fund of Xidian University. M. Zhou acknowledges the support of Award IIS-1812699 from the U.S. National Science Foundation and the McCombs Research Excellence Grant.

References

  • Abadi et al. (2016) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
  • Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003.
  • Bojanczyk et al. (1995) Bojanczyk, A. W., Brent, R. P., De Hoog, F. R., and Sweet, D. R. On the stability of the bareiss and related toeplitz factorization algorithms. SIAM Journal on Matrix Analysis and Applications, 16(1):40–57, 1995.
  • Boureau et al. (2010) Boureau, Y., Ponce, J., and Lecun, Y. A theoretical analysis of feature pooling in visual recognition. In ICML, pp. 111–118, 2010.
  • Carvalho et al. (2008) Carvalho, C. M., Chang, J. T., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. High-dimensional sparse factor modeling: Applications in gene expression genomics. J. Amer. Statist. Assoc., 103(484):1438–1456, 2008.
  • Chen et al. (2013) Chen, B., Polatkan, G., Sapiro, G., Blei, D. M., Dunson, D. B., and Carin, L. Deep learning with hierarchical convolutional factor analysis. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1887–1901, 2013.
  • Chung et al. (2014) Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • Cong et al. (2017) Cong, Y., Chen, B., Liu, H., and Zhou, M. Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC. In ICML, pp. 864–873, 2017.
  • Cortes & Vapnik (1995) Cortes, C. and Vapnik, V. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
  • Deerwester et al. (1990) Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci., 1990.
  • Gan et al. (2015) Gan, Z., Chen, C., Henao, R., Carlson, D. E., and Carin, L. Scalable deep Poisson factor analysis for topic modeling. In ICML, pp. 1823–1832, 2015.
  • Glorot et al. (2011) Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, pp. 513–520, 2011.
  • Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C. P., Glorot, X., Botvinick, M. M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, volume 3, 2017.
  • Hinton & Salakhutdinov (2009) Hinton, G. E. and Salakhutdinov, R. Replicated softmax: An undirected topic model. In NIPS, pp. 1607–1614, 2009.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • James et al. (2010) James, G. M., Sabatti, C., Zhou, N., and Zhu, J. Sparse regulatory networks. AOAS, 4(2):663–686, 2010.
  • Johnson & Zhang (2015a) Johnson, R. and Zhang, T. Effective use of word order for text categorization with convolutional neural networks. NAACL, pp. 103–112, 2015a.
  • Johnson & Zhang (2015b) Johnson, R. and Zhang, T. Semi-supervised convolutional neural networks for text categorization via region embedding. In NIPS, pp. 919–927, 2015b.
  • Johnson & Zhang (2016) Johnson, R. and Zhang, T. Supervised and semi-supervised text categorization using LSTM for region embeddings. ICML, pp. 526–534, 2016.
  • Johnson & Zhang (2017) Johnson, R. and Zhang, T. Deep pyramid convolutional neural networks for text categorization. In ACL, pp. 562–570, 2017.
  • Kalchbrenner et al. (2014) Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A convolutional neural network for modelling sentences. In ACL, pp. 655–665, 2014.
  • Kim (2014) Kim, Y. Convolutional neural networks for sentence classification. In EMNLP, pp. 1746–1751, 2014.
  • Klockner et al. (2012) Klockner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., and Fasih, A. R. Pycuda and pyopencl: A scripting-based approach to gpu run-time code generation. Parallel computing, 38(3):157–174, 2012.
  • Lauly et al. (2017) Lauly, S., Zheng, Y., Allauzen, A., and Larochelle, H. Document neural autoregressive distribution estimation. JMLR, 18(113):1–24, 2017.
  • Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lee & Seung (2001) Lee, D. D. and Seung, H. S. Algorithms for non-negative matrix factorization. In NIPS, 2001.
  • Lee et al. (2009) Lee, H., Pham, P. T., Largman, Y., and Ng, A. Y.

    Unsupervised feature learning for audio classification using convolutional deep belief networks.

    In NIPS, pp. 1096–1104, 2009.
  • Li et al. (2015) Li, C., Zhu, J., Shi, T., and Zhang, B. Max-margin deep generative models. In NIPS, pp. 1837–1845, 2015.
  • Li & Roth (2002) Li, X. and Roth, D. Learning question classifiers. In International Conference on Computational Linguistics, pp. 1–7, 2002.
  • Ma et al. (2015) Ma, Y. A., Chen, T., and Fox, E. B. A complete recipe for stochastic gradient mcmc. In NIPS, pp. 2917–2925, 2015.
  • Maas et al. (2011) Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In ACL, pp. 142–150, 2011.
  • Mcauley & Leskovec (2013) Mcauley, J. and Leskovec, J. Hidden factors and hidden topics: understanding rating dimensions with review text. In ACM RecSys, pp. 165–172, 2013.
  • Miao et al. (2018) Miao, X., Zhen, X., Liu, X., Deng, C., Athitsos, V., and Huang, H. Direct shape regression networks for end-to-end face alignment. In CVPR, pp. 5040–5049, 2018.
  • Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119, 2013.
  • Min et al. (2019) Min, S., Chen, X., Zha, Z., Wu, F., and Zhang, Y. A two-stream mutual attention network for semi-supervised biomedical segmentation with noisy labels. AAAI, 2019.
  • Pang & Lee (2004) Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ACL, pp. 271–278, 2004.
  • Pang & Lee (2005) Pang, B. and Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, pp. 115–124, 2005.
  • Pang et al. (2002) Pang, B., Lee, L., and Vaithyanathan, S. Thumbs up? sentiment classification using machine learning techniques. In ACL, pp. 79–86, 2002.
  • Papadimitriou et al. (2000) Papadimitriou, C., Raghavan, P., Tamaki, H., and Vempala, S. Latent semantic indexing: A probabilistic analysis. J. Computer and System Sci., 2000.
  • Patterson & Teh (2013) Patterson, S. and Teh, Y. W. Stochastic gradient riemannian langevin dynamics on the probability simplex. In NIPS, pp. 3102–3110, 2013.
  • Ranganath et al. (2015) Ranganath, R., Tang, L., Charlin, L., and Blei, D. Deep exponential families. In AISTATS, pp. 762–771, 2015.
  • Semeniuta et al. (2017) Semeniuta, S., Severyn, A., and Barth, E. A hybrid convolutional variational autoencoder for text generation. EMNLP, pp. 627–637, 2017.
  • Srivastava & Sutton (2017) Srivastava, A. and Sutton, C. A. Autoencoding variational inference for topic models. ICLR, 2017.
  • Tan et al. (2002) Tan, C., Wang, Y., and Lee, C. The use of bigrams to enhance text categorization. Information Processing and Management, 38(4):529–546, 2002.
  • Tang et al. (2014) Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., and Qin, B. Learning sentiment-specific word embedding for twitter sentiment classification. In ACL, pp. 1555–1565, 2014.
  • Wang et al. (2018) Wang, C., Chen, B., and Zhou, M. Multimodal Poisson gamma belief network. In AAAI, 2018.
  • Wang et al. (2010) Wang, J., Yang, J., Yu, K., Lv, F., Huang, T. S., and Gong, Y. Locality-constrained linear coding for image classification. In CVPR, pp. 3360–3367, 2010.
  • Weston et al. (2014) Weston, J., Chopra, S., and Adams, K. Tagspace: Semantic embeddings from hashtags. In EMNLP, pp. 1822–1827, 2014.
  • Xu & Sarikaya (2013) Xu, P. and Sarikaya, R. Convolutional neural network based triangular crf for joint intent detection and slot filling. In IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 78–83, 2013.
  • Zhang et al. (2018) Zhang, H., Chen, B., Guo, D., and Zhou, M. WHAI: Weibull hybrid autoencoding inference for deep topic modeling. ICLR, 2018.
  • Zhang & Wallace (2017) Zhang, Y. and Wallace, B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In IJCNLP, pp. 253–263, 2017.
  • Zhang et al. (2016) Zhang, Y., Gan, Z., and Carin, L. Generating text via adversarial training. In NIPS workshop on Adversarial Training, volume 21, 2016.
  • Zhang et al. (2017) Zhang, Y., Shen, D., Wang, G., Gan, Z., Henao, R., and Carin, L. Deconvolutional paragraph representation learning. In NIPS, pp. 4169–4179, 2017.
  • Zhou (2015) Zhou, M. Infinite edge partition models for overlapping community detection and link prediction. In AISTATS, pp. 1135–1143, 2015.
  • Zhou & Carin (2012) Zhou, M. and Carin, L. Negative binomial process count and mixture modeling. arXiv preprint arXiv:1209.3442v1, 2012.
  • Zhou et al. (2012) Zhou, M., Hannah, L., Dunson, D., and Carin, L. Beta-negative binomial process and Poisson factor analysis. In AISTATS, pp. 1462–1471, 2012.
  • Zhou et al. (2016) Zhou, M., Cong, Y., and Chen, B. Augmentable gamma belief networks. JMLR, 17(1):5656–5699, 2016.
  • Zhu et al. (2014) Zhu, J., Chen, N., Perkins, H., and Zhang, B. Gibbs max-margin topic models with data augmentation. JMLR, 15(1):1073–1110, 2014.

Appendix A Inference for CPGBN

Here we describe the derivation in detail for convolutional Poisson gamma belief network (CPGBN) with hidden layers, expressed as

(9)

Note using the relationship between the gamma and Dirichlet distributions (, Lemma IV.3 of Zhou & Carin (2012)), the elements of in the first hidden layer can be equivalently generated as

(10)

Note the random variable

, which pools the random weights of all words in document , follows

(11)

As described in Section 3.1, we have

(12)
(13)

leading to the following conditional posteriors:

(14)

Since , from (12) we have

(15)
(16)

Since by construction, we have

(17)

and hence the following conditional posteriors:

(18)
(19)
(20)

The derivation for the parameters of layer is the same as that of gamma belief network (GBN) (Zhou et al., 2016), omitted here for brevity.

Appendix B Sensitivity to Filter Width

To investigate the effect of the filter width of the convolutional kernel, we have evaluated the performance of CPFA (, CPGBN with a single hidden layer) on the SUBJ dataset with a variety of filter widths (unsupervised feature extraction + linear SVM for classification). We use the same CPFA code but vary its setting of the filter width. Averaging over five independent runs, the accuracy for filter wdith 1, 2, 3, 4, 5, 6, and 7 are , , , , , , and , respectively. Note when the filter width reduces to 1, CPFA reduces to PFA (, no convolution). These results suggest the performance of CPFA has low sensitivity to the filter width. While setting the filter width as three may not be the optimal choice, it is a common practice for existing text CNNs (Kim, 2014; Johnson & Zhang, 2015a).

Appendix C Hierarchical Visualization

Figure 4: The phrase-level tree that includes all the lower-layer nodes (directly or indirectly) linked to the node of the top layer, taken from the full network inferred by CPGBN on TREC dataset.
Figure 5: The phrase-level tree that include all the lower-layer nodes (directly or indirectly) linked to the node of the top layer, taken from the full network inferred by CPGBN on TREC dataset.

Distinct from word-level topics learned by traditional topic models (Deerwester et al., 1990; Papadimitriou et al., 2000; Lee & Seung, 2001; Blei et al., 2003; Hinton & Salakhutdinov, 2009; Zhou et al., 2012), we propose novel phrase-level topics preserving word order as shown in Table. 3, where each phrase-level topic is often combined with several frequently co-occurred short phrases. To explore the connections between phrase-level topics of different layers learned by CPGBN, we follow Zhou et al. (2016) to construct trees to understand the general and specific aspects of the corpus. More specifically, we construct trees learned from TREC dataset, with the network structure set as . We pick a node at the top layer as the root of a tree and grow the tree downward by drawing a line from node at layer to the top relevant nodes at layer .

As shown in Fig. 5, we select the top 3 relevant nodes at the second layer linked to the selected root node, and the top 2 relevant nodes at the third layer linked to the selected nodes at the second layer. Considering the TREC corpus only consists of questions (questions about abbreviation, entity, description, human, location, or numeric), most of the topics learned by CPGBN are focused on short phrases on asking specific questions, as shown in Table. 3. Following the branches of the tree in Fig. 5, the root node covers very general question types on “how many, how long, what, when, why,” and it is clear that the topics become more and more specific when moving along the tree from the top to bottom, where the shallow topics of the first layer tend to focus on a single question type, , the bottom-layer node queries “how many” and the one queries “how long.”