Probabilistic topic models focus on discovering the abstract “topics” that occur in a collection of documents, and represent a document as a weighted mixture of the discovered topics. Classical topic models  have achieved success in a range of applications [40, 4, 32, 34]. A major challenge of topic models is that the inference of the distribution over topics does not have a closed-form solution and must be approximated, using either MCMC sampling or variational inference. When some small changes are made on the model, we need to re-derive the inference algorithm. In contrast, black-box inference methods [31, 26, 18, 33] require only limited model-specific analysis and can be flexibly applied to new models.
is a promising one for topic models. AEVB contains an inference network that can map a document directly to a variational posterior without the need for further local variational updates on test data, and the Stochastic Gradient Variational Bayes (SGVB) estimator allows efficient approximate inference for a broad class of posteriors, which makes topic models more flexible. Hence, an increasing number of models are proposed recently to combine topic models with AEVB, such as[24, 37, 7, 25].
Although these AEVB based topic models achieve promising performance, the number of topics, which is important to the performance of these models, has to be specified manually with model selection methods. Nonparametric models, however, have the ability of adapting the topic number to data. For example, Teh et al.  proposed Hierarchical Dirichlet Process (HDP), which models each document with a Dirichlet Process (DP) and all DPs for the documents in a corpus share a base distribution that is itself sampled from a DP. HDP has potentially an infinite number of topics and allows the number to grow as more documents are observed. It is appealing that the nonparametric topic models can also be equipped with AEVB techniques to enjoy the benefit brought by neural black-box inference. We make progress on this problem by proposing an infinite Topic Model with Variational Auto-Encoders (iTM-VAE), which is a nonparametric topic model with AEVB.
For nonparametric topic models with stick breaking prior , the concentration parameter plays an important role in deciding the growth of topic numbers111Please refer to Section 3.1 for more details about the concentration parameter.. The larger the is, the more topics the model tends to discover. Hence, people can place a hyper-prior  over such that the model can adapt it to data [9, 38, 5]. Moreover, the AEVB framework suffers from the problem that the latent representation tends to collapse to the prior [6, 36, 8], which means, the prior parameter
will control the number of discovered topics tightly in our case, especially when the decoder is strong. Common heuristic tricks to alleviate this issue are 1) KL-annealing and 2) decoder regularizing . Introducing a hyper-prior into the AEVB framework is nontrivial and not well-done in the community. In this paper, we show that introducing a hyper-prior can increase the adaptive capability of the model, and also alleviate the collapse-to-prior issue in the training process.222The hyper-prior technique can also alleviate the collapse-to-prior issue in other scenarios, an example is demonstrated in Appendix 1.2.
To further increase the flexibility of iTM-VAE, we propose HiTM-VAE, which model the document-specific topic distribution in a hierarchical manner. This hierarchical construction can help to generate topic distributions with better variability, which is more suitable in handling heterogeneous documents.
The main contributions of the paper are:
We propose iTM-VAE and iTM-VAE-Prod, which are two novel nonparametric topic models equipped with AEVB, and outperform the state-of-the-art models on the benchmarks.
We propose iTM-VAE-HP, in which a hyper-prior helps the model to adapt the prior parameter to data. We also show that this technique can help other AEVB-based models to alleviate the collapse-to-prior problem elegantly.
We propose HiTM-VAE, which is a hierarchical extension of iTM-VAE. This construction and its corresponding AEVB-based inference method can help the model to learn more topics and produce topic proportions with higher variability and sparsity.
2 Related Work
Topic models have been studied extensively in a variety of applications such as document modeling, information retrieval, computer vision and bioinformatics[3, 4, 40, 30, 32, 34]
. Recently, with the impressive success of deep learning, the proposed neural topic models[11, 20, 26] achieve encouraging performance in document modeling tasks. Although these models achieve competitive performance, they do not explicitly model the generative story of documents, hence are less explainable.
Several recent work proposed to model the generative procedure explicitly, and the inference of the topic distributions in these models is computed by deep neural networks, which makes these models explainable, powerful and easily extendable. For example, Srivastava and Sutton  proposed AVITM, which embeds the original LDA  formulation with AEVB. By utilizing Laplacian approximation for the Dirichlet distribution, AVITM can be optimized by the SGVB estimator efficiently. AVITM achieves the state-of-the-art performance on the topic coherence metric , which indicates the topics learned match closely to human judgment.
Nonparametric topic models [38, 15, 1, 23], potentially have infinite topic capacity and can adapt the topic number to data. Nalisnick and Smyth  proposed Stick-Breaking VAE (SB-VAE), which is a Bayesian nonparametric version of traditional VAE with a stochastic dimensionality. iTM-VAE differs with SB-VAE in 3 aspects: 1) iTM-VAE is a kind of topic model for discrete text data. 2) A hyper-prior is introduced into the AEVB framwork to increase the adaptive capability. 3) A hierarchical extension of iTM-VAE is proposed to further increase the flexibility. Miao et al.  proposed GSM, GSB, RSB and RSB-TF to model documents. RSB-TF uses a heuristic indicator to guide the growth of the topic numbers, and can adapt the topic number to data.
3 The iTM-VAE Model
3.1 The Generative Procedure of iTM-VAE
Suppose the atom weights are drawn from a GEM distribution , i.e. , where the GEM distribution is defined as:
Let denotes the th topic, which is a multinomial distribution over vocabulary, is the parameter of , is the softmax function and is the vocabulary size. In iTM-VAE, there are unlimited number of topics and we denote and as the collections of these countably infinite topics and the corresponding parameters. The generation of a document by iTM-VAE can then be mathematically described as:
Get the document-specific , where
For each word in : 1) draw a topic ; 2)
where is the concentration parameter, is a categorical distribution parameterized by , and is a discrete dirac function, which equals to when and otherwise. In the following, we remove the superscript of for simplicity.
Thus, the joint probability of, and can be written as:
where , and .
In Equation 3, is a mixture of multinomials. This formulation cannot make any predictions that are sharper than the distributions being mixed , which may result in some topics that are of poor quality. Replacing the mixture of multinomials with a weighted product of experts is one method to make sharper predictions [10, 37]. Hence, a products-of-experts version of iTM-VAE (i.e. iTM-VAE-Prod) can be obtained by simply computing for each document as .
3.2 The Inference Procedure of iTM-VAE
In this section, we describe the inference procedure of iTM-VAE, i.e. how to draw given a document . To elaborate, suppose is a
dimensional vector, where
is a random variable sampled from a Kumaraswamy distributionparameterized by and [19, 28]
, iTM-VAE models the joint distributionas: 333
Ideally, Beta distribution is the most suitable probability candidate, since iTM-VAE assumesis drawn from a GEM distribution in the generative procedure. However, as Beta does not satisfy the differentiable, non-centered parameterization (DNCP)  requirement of SGVB , we use the Kumaraswamy distribution.
where is a neural network with parameters . Then, can be drawn by:
In the above procedure, we truncate the infinite sequence of mixture weights by elements, and is always set to to ensure . Notably, as is discussed in , the truncation of variational posterior does not indicate that we are using a finite dimensional prior, since we never truncate the GEM prior. Hence, iTM-VAE still has the ability to model the uncertainty of the number of topics and adapt it to data .
iTM-VAE can be optimized by maximizing the Evidence Lower Bound (ELBO):
where is the product of probabilistic density functions. The details of the optimization can be found in Appendix 1.3.
3.3 Modeling the Uncertainty of Prior Parameter
In the generative procedure, the concentration parameter of can have significant impact on the growth of number of topics. The larger the is, the more “breaks" it will create, and consequently, more topics will be used. Hence, it is generally reasonable to consider placing a hyper-prior on to model its uncertainty.[9, 5, 38]. For example, Escobar and West  placed a Gamma hyper-prior on for the urn-based samplers and implemented the corresponding Gibbs updates with auxiliary variable methods. Blei et al.  also placed a Gamma prior on and derived a closed-form update for the variational parameters. Different with previous work, we introduce the hyper-prior into the AEVB framework and propose to optimize the model by stochastic gradient decent (SGD) methods.
Concretely, since the Gamma distribution is conjugate to, we place a prior on . Then the ELBO of iTM-VAE-HP can be written as:
where , , is the corpus-level variational posterior for . The derivation for Equation 9 can be found in Appendix 1.4. In our experiments, we find iTM-VAE-Prod always performs better than iTM-VAE, therefore we only report the performance of iTM-VAE-Prod with hyper-prior, and refer this variant as iTM-VAE-HP. Actually, as discussed in Section 1, the hyper-prior technique can also be applied to other AEVB based models to alleviate the collapse-to-prior problem. In Appendix 1.2, we show that by introducing a hyper-prior to SB-VAE, more latent units can be activated and the model achieves better performance.
4 Hierarchical iTM-VAE
4.1 The Generative Procedure of HiTM-VAE
The generation of a document by HiTM-VAE is described as follows:
Get the corpus-level base distribution :
For each document in the corpus:
Draw the document-level stick breaking weights
Draw document-level atoms , ; Then we get a document-specific distribution
For each word in the document: 1) draw a topic ; 2)
To sample the document-level atoms , a series of indicator variables are drawn i.i.d: . Then, the document-level atoms are .
Let and denote the size of the dataset and the number of word in each document , respectively. After collapse the per-word assignment random variables , the joint probability of the corpus-level atom weights , documents , the stick breaking weights and the indicator variables can be written as:
where , , , .
4.2 The Inference Procedure of HiTM-VAE
Setting the truncation level of the corpus-level and document-level GEM to and , HiTM-VAE models the per-document posterior for every document as:
where is a neural network with parameters , and are the multinomial variational parameters for each document-level indicator variable . Then, can be constructed by the stick breaking process using .
As we shown in Section 4.1, the generation of the corpus-level atom weights is as follows:
The corpus-level variational posterior for with truncation level is , where are the corpus-level variational parameters.
The ELBO of the training dataset can be written as:
where , , , . The details of the derivation of the ELBO can be found in Appendix 1.5.
In iTM-VAE, we get the document-specific topic distribution by sampling the atom weights from a GEM. Instead of being drawn from a continuous base distribution, the atoms are modeled as trainable parameters as in [4, 37, 25]. Thus, the atoms are shared by all documents naturally without the need to use a hierarchical construction like HDP . The hierarchical extension, HiTM-VAE, which models in a hierarchical manner, is more flexible and can generate topic distributions with better variability. A detailed comparison is illustrated in Section 5.3.
In this section, we evaluate the performance of iTM-VAE and its variants on two public benchmarks: 20News and RCV1-V2, and demonstrate the advantage brought by the variants of iTM-VAE. To make a fair comparison, we use exactly the same data and vocabulary as .
The configuration of the experiments is as follows. We use a two-layer fully-connected neural network for of Equation 11, and the number of hidden units is set to and for 20News and RCV1-V2, respectively. The truncation level in Equation 7 is set to so that the maximum topic numbers will never exceed the ones used by baselines. 444In these baselines, at most topics are used. Please refer to Table 1 for details. The concentration parameter for GEM distribution is cross-validated on validation set from for iTM-VAE and iTM-VAE-Prod. Batch-Renormalization  is used to stabilize the training procedure. Adam  is used to optimize the model and the learning rate is set to . The code of iTM-VAE and its variants is available at http://anonymous.
5.1 Perplexity and Topic Coherence
Perplexity is widely used by topic models to measure the goodness-to-fit capability, which is defined as: , where is the number of documents , and is the number of words in the -th document . Following previous work, the variational lower bound is used to estimate the perplexity.
As the quality of the learned topics is not directly reflected by perplexity , topic coherence is designed to match the human judgment. We adopt NPMI  as the measurement of topic coherence, as is adopted by [25, 37].555We use the code provided by  at https://github.com/jhlau/topic_interpretability/ We define a topic to be an Effective Topic if it becomes the top- significant topic of a sample among the training set more than times, where is the training set size and is a ratio. We set to in our experiments. Following , we use an average over topic coherence computed by top-5 and top-10 words across five random runs, which is more robust .
Table 1 shows the perplexity and topic coherence of different topic models on 20News and RCV1-V2 datasets. We can clearly see that our models outperform the baselines, which indicates that our models have better goodness-to-fit capability and can discover topics that match more closely to human judgment. We can also see that HiTM-VAE achieves better perplexity than , in which a similar hierarchical construction is used. Note that comparing the ELBO-estimated perplexity of HiTM-VAE with other models directly is not suitable, as it has a lot more random variables, which usually leads to a higher ELBO. The possible reasons for the good coherence achieved by our models are 1) The “Product-of-Experts” enables the model to model sharper distributions. 2) The nonparametric characteristic means the models can adapt the number of topics to data, thus topics can be sufficiently trained and of high diversity. Table 1 in Appendix 1.1 illustrates the topics learned by iTM-VAE-Prod. Please refer to Appendix 1.1 in the supplementary for more details.
5.2 The Effect of Hyper-Prior on iTM-VAE
In this section, we provide quantitative evaluations on the effect of the hyper-prior for iTM-VAE. Specifically, a relatively non-informative hyper-prior is imposed on . And we initialize the global variational parameters and of Equation 9 the same as the non-informative Gamma prior. Thus the expectation of given the variational posterior is before training. A SGD optimizer with a learning rate of is used to optimize and . No KL annealing and decoder regularization are used for iTM-VAE-HP.
Table 2 reports the learned global variational parameter , and the expectation of given the variational poster on several subsets of 20News dataset, which contain , , , and classes, respectively.666Since there are no labels for the 20News dataset provided by , we preprocess the dataset ourselves in this illustrative experiment. We can see that, once the training is done, the variational posterior is very confident, and , the expectation of given the variational posterior, is adjusted to the training set. For example, if the training set contains only class of documents, after training is , Whereas, when the training set consists of classes of documents, after training is . This indicates that iTM-VAE-HP can learn to adjust to data, thus the number of discovered topics will adapt to data better. In contrast, for iTM-VAE-Prod (without the hyper-prior), when the decoder is strong, no matter how many classes the dataset contains, the number of topics will be constrained tightly due to the collapse-to-prior problem of AEVB, and KL-annealing and decoder regularizing tricks do not help much.
Figure 1 illustrates the training set coverage w.r.t the number of used topics when the training set contains , , , and classes, respectively. Specifically, we compute the average weight of every topic on the training dataset, and sort the topics according to their average weights. The topic coverage is then defined as the cumulative sum of these weights. Figure 1 shows that, with the increasing of the number of classes, more topics are utilized by iTM-VAE-HP to reach the same level of topic coverage, which indicates that the model has the ability to adapt to data.
5.3 The Evaluation of HiTM-VAE
In this section, by comparing the topic coverage and sparsity777To compare the sparsity of the posterior topic proportions of each model, we sort the topic weights of every training document and average across the dataset. Then, the logarithm of the average weights are plotted w.r.t the topic index. of iTM-VAE-Prod and HiTM-VAE, we show that the hierarchical construction can help the model to learn more topics, and produce posterior topic proportions with higher sparsity.
The model configurations are the same for iTM-VAE-Prod and HiTM-VAE, except that is set to and for iTM-VAE-Prod, and for HiTM-VAE. For HiTM-VAE, the corpus-level updates are done every epochs on 20News, and epochs on RCV1-V2.
As shown in Figure 2, HiTM-VAE can learn more topics than iTM-VAE-Prod (), and the sparsity of its posterior topic proportions is significantly higher. iTM-VAE-Prod () has higher sparsity than iTM-VAE-Prod (). However, its sparsity is still lower than HiTM-VAE with the same document-level concentration parameter , and it can only learn a small number of topics, which means that there might exist rare topics that are not learned by the model. The comparison of HiTM-VAE and iTM-VAE-Prod () shows that the superior sparsity not only comes from a smaller per-document concentration hyper-parameter , but also from the hierarchical construction itself.
In this paper, we propose iTM-VAE and iTM-VAE-Prod, which are nonparametric topic models that are modeled by Variational Auto-Encoders. Specifically, a stick-breaking prior is used to generate the atom weights of countably infinite shared topics, and the Kumaraswamy distribution is exploited such that the model can be optimized by AEVB algorithm. We also propose iTM-VAE-HP which introduces a hyper-prior into the VAE framework such that the model can adapt better to data. This technique is general and can be incorporated into other VAE-based models to alleviate the collapse-to-prior
problem. To further diversify the document-specific topic distributions, we use a hierarchical construction in the generative procedure. And we show that the proposed model HiTM-VAE can learn more topics and produce sparser posterior topic proportions. The advantage of iTM-VAE and its variants over traditional nonparametric topic models is that the inference is performed by feed-forward neural networks, which is of rich representation capacity and requires only limited knowledge of the data. Hence, it is flexible to incorporate more information sources to the model, and we leave it to future work. Experimental results on two public benchmarks show that iTM-VAE and its variants outperform the state-of-the-art baselines.
- Archambeau et al.  Cedric Archambeau, Balaji Lakshminarayanan, and Guillaume Bouchard. Latent ibp compound dirichlet allocation. IEEE transactions on pattern analysis and machine intelligence, 37(2):321–333, 2015.
- Bernardo and Smith  José M Bernardo and Adrian FM Smith. Bayesian theory, 2001.
- Blei  David M Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
- Blei et al.  David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. JMLR, 2003.
- Blei et al.  David M Blei, Michael I Jordan, et al. Variational inference for dirichlet process mixtures. Bayesian analysis, 1(1):121–143, 2006.
-  Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
- Card et al.  Dallas Card, Chenhao Tan, and Noah A Smith. A neural framework for generalized topic models. arXiv preprint arXiv:1705.09296, 2017.
Chen et al. 
Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John
Schulman, Ilya Sutskever, and Pieter Abbeel.
Variational lossy autoencoder.2017.
- Escobar and West  Michael D Escobar and Mike West. Bayesian density estimation and inference using mixtures. Journal of the american statistical association, 90(430):577–588, 1995.
Geoffrey E Hinton.
Training products of experts by minimizing contrastive divergence.Neural Computation, 14(8), 2006.
- Hinton and Salakhutdinov  Geoffrey E Hinton and Ruslan R Salakhutdinov. Replicated softmax: an undirected topic model. In NIPS, pages 1607–1614, 2009.
- Hoffman et al.  Matthew Hoffman, Francis R. Bach, and David M. Blei. Online learning for latent dirichlet allocation. In NIPS. 2010.
- Ioffe  Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. 2017. URL http://arxiv.org/abs/1702.03275.
- Jang et al.  Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. 2017. URL https://arxiv.org/abs/1611.01144.
- Kim and Sudderth  Dae I Kim and Erik B Sudderth. The doubly correlated nonparametric topic model. In Advances in Neural Information Processing Systems, pages 1980–1988, 2011.
- Kingma and Ba  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Kingma and Welling [2014a] Diederik Kingma and Max Welling. Efficient gradient-based inference through transformations between bayes nets and neural nets. In ICML, pages 1782–1790, 2014a.
- Kingma and Welling [2014b] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014b.
A generalized probability density function for double-bounded random processes.Journal of Hydrology, 46(1-2):79–88, 1980.
- Larochelle and Lauly  Hugo Larochelle and Stanislas Lauly. A neural autoregressive topic model. In NIPS, 2012.
- Lau and Baldwin  Jey Han Lau and Timothy Baldwin. The sensitivity of topic coherence evaluation to topic cardinality. In NAACL HLT, pages 483–487, 2016.
- Lau et al.  Jey Han Lau, David Newman, and Timothy Baldwin. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In EACL, pages 530–539, 2014.
- Lim et al.  Kar Wai Lim, Wray Buntine, Changyou Chen, and Lan Du. Nonparametric bayesian topic modelling with the hierarchical pitman–yor processes. International Journal of Approximate Reasoning, 78:172–191, 2016.
- Miao et al.  Yishu Miao, Lei Yu, and Phil Blunsom. Neural variational inference for text processing. In ICML, pages 1727–1736, 2016.
- Miao et al.  Yishu Miao, Edward Grefenstette, and Phil Blunsom. Discovering discrete latent topics with neural variational inference. In ICML, 2017.
- Mnih and Gregor  Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In ICML, 2014.
- Murphy  Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
- Nalisnick and Smyth  Eric Nalisnick and Padhraic Smyth. Stick-breaking variational autoencoders. In ICLR, 2017.
- Newman et al.  David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. Automatic evaluation of topic coherence. In NAACL HLT, pages 100–108, 2010.
- Putthividhy et al.  Duangmanee Putthividhy, Hagai T Attias, and Srikantan S Nagarajan. Topic regression multi-modal latent dirichlet allocation for image annotation. In CVPR, pages 3408–3415. IEEE, 2010.
- Ranganath et al.  Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In AISTATS, 2014.
- Rasiwasia and Vasconcelos  Nikhil Rasiwasia and Nuno Vasconcelos. Latent dirichlet allocation models for image classification. IEEE transactions on pattern analysis and machine intelligence, 35(11):2665–2679, 2013.
- Rezende et al.  Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and variational inference in deep latent gaussian models. In ICML, 2014.
- Rogers et al.  Simon Rogers, Mark Girolami, Colin Campbell, and Rainer Breitling. The latent process decomposition of cdna microarray data sets. IEEE/ACM TCBB, 2(2):143–156, 2005.
- Sethuraman  Jayaram Sethuraman. A constructive definition of dirichlet priors. Statistica sinica, 1994.
- Sønderby et al.  Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in Neural Information Processing Systems, pages 3738–3746, 2016.
- Srivastava and Sutton  Akash Srivastava and Charles Sutton. Autoencoding variational inference for topic models. In ICLR, 2017.
- Teh et al.  Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.
- Wang et al.  Chong Wang, John Paisley, and David Blei. Online variational inference for the hierarchical dirichlet process. In AISTATS, pages 752–760, 2011.
- Wei and Croft  Xing Wei and W Bruce Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, 2006.