1 Introduction
Probabilistic topic models focus on discovering the abstract “topics” that occur in a collection of documents, and represent a document as a weighted mixture of the discovered topics. Classical topic models [4] have achieved success in a range of applications [40, 4, 32, 34]. A major challenge of topic models is that the inference of the distribution over topics does not have a closedform solution and must be approximated, using either MCMC sampling or variational inference. When some small changes are made on the model, we need to rederive the inference algorithm. In contrast, blackbox inference methods [31, 26, 18, 33] require only limited modelspecific analysis and can be flexibly applied to new models.
Among all the blackbox inference methods, AutoEncoding Variational Bayes (AEVB) [18, 33]
is a promising one for topic models. AEVB contains an inference network that can map a document directly to a variational posterior without the need for further local variational updates on test data, and the Stochastic Gradient Variational Bayes (SGVB) estimator allows efficient approximate inference for a broad class of posteriors, which makes topic models more flexible. Hence, an increasing number of models are proposed recently to combine topic models with AEVB, such as
[24, 37, 7, 25].Although these AEVB based topic models achieve promising performance, the number of topics, which is important to the performance of these models, has to be specified manually with model selection methods. Nonparametric models, however, have the ability of adapting the topic number to data. For example, Teh et al. [38] proposed Hierarchical Dirichlet Process (HDP), which models each document with a Dirichlet Process (DP) and all DPs for the documents in a corpus share a base distribution that is itself sampled from a DP. HDP has potentially an infinite number of topics and allows the number to grow as more documents are observed. It is appealing that the nonparametric topic models can also be equipped with AEVB techniques to enjoy the benefit brought by neural blackbox inference. We make progress on this problem by proposing an infinite Topic Model with Variational AutoEncoders (iTMVAE), which is a nonparametric topic model with AEVB.
For nonparametric topic models with stick breaking prior [35], the concentration parameter plays an important role in deciding the growth of topic numbers^{1}^{1}1Please refer to Section 3.1 for more details about the concentration parameter.. The larger the is, the more topics the model tends to discover. Hence, people can place a hyperprior [2] over such that the model can adapt it to data [9, 38, 5]. Moreover, the AEVB framework suffers from the problem that the latent representation tends to collapse to the prior [6, 36, 8], which means, the prior parameter
will control the number of discovered topics tightly in our case, especially when the decoder is strong. Common heuristic tricks to alleviate this issue are 1) KLannealing
[36] and 2) decoder regularizing [6]. Introducing a hyperprior into the AEVB framework is nontrivial and not welldone in the community. In this paper, we show that introducing a hyperprior can increase the adaptive capability of the model, and also alleviate the collapsetoprior issue in the training process.^{2}^{2}2The hyperprior technique can also alleviate the collapsetoprior issue in other scenarios, an example is demonstrated in Appendix 1.2.To further increase the flexibility of iTMVAE, we propose HiTMVAE, which model the documentspecific topic distribution in a hierarchical manner. This hierarchical construction can help to generate topic distributions with better variability, which is more suitable in handling heterogeneous documents.
The main contributions of the paper are:

We propose iTMVAE and iTMVAEProd, which are two novel nonparametric topic models equipped with AEVB, and outperform the stateoftheart models on the benchmarks.

We propose iTMVAEHP, in which a hyperprior helps the model to adapt the prior parameter to data. We also show that this technique can help other AEVBbased models to alleviate the collapsetoprior problem elegantly.

We propose HiTMVAE, which is a hierarchical extension of iTMVAE. This construction and its corresponding AEVBbased inference method can help the model to learn more topics and produce topic proportions with higher variability and sparsity.
2 Related Work
Topic models have been studied extensively in a variety of applications such as document modeling, information retrieval, computer vision and bioinformatics
[3, 4, 40, 30, 32, 34]. Recently, with the impressive success of deep learning, the proposed neural topic models
[11, 20, 26] achieve encouraging performance in document modeling tasks. Although these models achieve competitive performance, they do not explicitly model the generative story of documents, hence are less explainable.Several recent work proposed to model the generative procedure explicitly, and the inference of the topic distributions in these models is computed by deep neural networks, which makes these models explainable, powerful and easily extendable. For example, Srivastava and Sutton [37] proposed AVITM, which embeds the original LDA [4] formulation with AEVB. By utilizing Laplacian approximation for the Dirichlet distribution, AVITM can be optimized by the SGVB estimator efficiently. AVITM achieves the stateoftheart performance on the topic coherence metric [22], which indicates the topics learned match closely to human judgment.
Nonparametric topic models [38, 15, 1, 23], potentially have infinite topic capacity and can adapt the topic number to data. Nalisnick and Smyth [28] proposed StickBreaking VAE (SBVAE), which is a Bayesian nonparametric version of traditional VAE with a stochastic dimensionality. iTMVAE differs with SBVAE in 3 aspects: 1) iTMVAE is a kind of topic model for discrete text data. 2) A hyperprior is introduced into the AEVB framwork to increase the adaptive capability. 3) A hierarchical extension of iTMVAE is proposed to further increase the flexibility. Miao et al. [25] proposed GSM, GSB, RSB and RSBTF to model documents. RSBTF uses a heuristic indicator to guide the growth of the topic numbers, and can adapt the topic number to data.
3 The iTMVAE Model
In this section, we describe the generative and inference procedure of iTMVAE and iTMVAEProd in Section 3.1 and Section 3.2. Then, Section 3.3 describes the hyperprior extension iTMVAEHP.
3.1 The Generative Procedure of iTMVAE
Suppose the atom weights are drawn from a GEM distribution [27], i.e. , where the GEM distribution is defined as:
(1) 
Let denotes the th topic, which is a multinomial distribution over vocabulary, is the parameter of , is the softmax function and is the vocabulary size. In iTMVAE, there are unlimited number of topics and we denote and as the collections of these countably infinite topics and the corresponding parameters. The generation of a document by iTMVAE can then be mathematically described as:

Get the documentspecific , where

For each word in : 1) draw a topic ; 2)
where is the concentration parameter, is a categorical distribution parameterized by , and is a discrete dirac function, which equals to when and otherwise. In the following, we remove the superscript of for simplicity.
In Equation 3, is a mixture of multinomials. This formulation cannot make any predictions that are sharper than the distributions being mixed [11], which may result in some topics that are of poor quality. Replacing the mixture of multinomials with a weighted product of experts is one method to make sharper predictions [10, 37]. Hence, a productsofexperts version of iTMVAE (i.e. iTMVAEProd) can be obtained by simply computing for each document as .
3.2 The Inference Procedure of iTMVAE
In this section, we describe the inference procedure of iTMVAE, i.e. how to draw given a document . To elaborate, suppose is a
dimensional vector, where
is a random variable sampled from a Kumaraswamy distribution
parameterized by and [19, 28], iTMVAE models the joint distribution
as: ^{3}^{3}3Ideally, Beta distribution is the most suitable probability candidate, since iTMVAE assumes
is drawn from a GEM distribution in the generative procedure. However, as Beta does not satisfy the differentiable, noncentered parameterization (DNCP) [17] requirement of SGVB [18], we use the Kumaraswamy distribution.(4)  
(5) 
where is a neural network with parameters . Then, can be drawn by:
(6)  
(7) 
In the above procedure, we truncate the infinite sequence of mixture weights by elements, and is always set to to ensure . Notably, as is discussed in [5], the truncation of variational posterior does not indicate that we are using a finite dimensional prior, since we never truncate the GEM prior. Hence, iTMVAE still has the ability to model the uncertainty of the number of topics and adapt it to data [28].
iTMVAE can be optimized by maximizing the Evidence Lower Bound (ELBO):
(8) 
where is the product of probabilistic density functions. The details of the optimization can be found in Appendix 1.3.
3.3 Modeling the Uncertainty of Prior Parameter
In the generative procedure, the concentration parameter of can have significant impact on the growth of number of topics. The larger the is, the more “breaks" it will create, and consequently, more topics will be used. Hence, it is generally reasonable to consider placing a hyperprior on to model its uncertainty.[9, 5, 38]. For example, Escobar and West [9] placed a Gamma hyperprior on for the urnbased samplers and implemented the corresponding Gibbs updates with auxiliary variable methods. Blei et al. [5] also placed a Gamma prior on and derived a closedform update for the variational parameters. Different with previous work, we introduce the hyperprior into the AEVB framework and propose to optimize the model by stochastic gradient decent (SGD) methods.
Concretely, since the Gamma distribution is conjugate to
, we place a prior on . Then the ELBO of iTMVAEHP can be written as:(9)  
where , , is the corpuslevel variational posterior for . The derivation for Equation 9 can be found in Appendix 1.4. In our experiments, we find iTMVAEProd always performs better than iTMVAE, therefore we only report the performance of iTMVAEProd with hyperprior, and refer this variant as iTMVAEHP. Actually, as discussed in Section 1, the hyperprior technique can also be applied to other AEVB based models to alleviate the collapsetoprior problem. In Appendix 1.2, we show that by introducing a hyperprior to SBVAE, more latent units can be activated and the model achieves better performance.
4 Hierarchical iTMVAE
In this section, we describe the generative and inference procedures of HiTMVAE in Section 4.1 and Section 4.2. The relationship between iTMVAE and HiTMVAE is discussed in Section 4.3
4.1 The Generative Procedure of HiTMVAE
The generation of a document by HiTMVAE is described as follows:

Get the corpuslevel base distribution :

For each document in the corpus:

Draw the documentlevel stick breaking weights

Draw documentlevel atoms , ; Then we get a documentspecific distribution

For each word in the document: 1) draw a topic ; 2)

To sample the documentlevel atoms , a series of indicator variables are drawn i.i.d: . Then, the documentlevel atoms are .
Let and denote the size of the dataset and the number of word in each document , respectively. After collapse the perword assignment random variables , the joint probability of the corpuslevel atom weights , documents , the stick breaking weights and the indicator variables can be written as:
(10) 
where , , , .
4.2 The Inference Procedure of HiTMVAE
Setting the truncation level of the corpuslevel and documentlevel GEM to and , HiTMVAE models the perdocument posterior for every document as:
(11)  
(12)  
(13) 
where is a neural network with parameters , and are the multinomial variational parameters for each documentlevel indicator variable . Then, can be constructed by the stick breaking process using .
As we shown in Section 4.1, the generation of the corpuslevel atom weights is as follows:
(14) 
The corpuslevel variational posterior for with truncation level is , where are the corpuslevel variational parameters.
The ELBO of the training dataset can be written as:
where , , , . The details of the derivation of the ELBO can be found in Appendix 1.5.
GumbelSoftmax estimator [14]
is used for backpropagating through the categorical random variables
. Instead of joint training with the NN parameters, meanfield updates are used to learn the corpuslevel variational parameters :(16) 
4.3 Discussion
In iTMVAE, we get the documentspecific topic distribution by sampling the atom weights from a GEM. Instead of being drawn from a continuous base distribution, the atoms are modeled as trainable parameters as in [4, 37, 25]. Thus, the atoms are shared by all documents naturally without the need to use a hierarchical construction like HDP [38]. The hierarchical extension, HiTMVAE, which models in a hierarchical manner, is more flexible and can generate topic distributions with better variability. A detailed comparison is illustrated in Section 5.3.
5 Experiments
In this section, we evaluate the performance of iTMVAE and its variants on two public benchmarks: 20News and RCV1V2, and demonstrate the advantage brought by the variants of iTMVAE. To make a fair comparison, we use exactly the same data and vocabulary as [37].
The configuration of the experiments is as follows. We use a twolayer fullyconnected neural network for of Equation 11, and the number of hidden units is set to and for 20News and RCV1V2, respectively. The truncation level in Equation 7 is set to so that the maximum topic numbers will never exceed the ones used by baselines. ^{4}^{4}4In these baselines, at most topics are used. Please refer to Table 1 for details. The concentration parameter for GEM distribution is crossvalidated on validation set from for iTMVAE and iTMVAEProd. BatchRenormalization [13] is used to stabilize the training procedure. Adam [16] is used to optimize the model and the learning rate is set to . The code of iTMVAE and its variants is available at http://anonymous.
5.1 Perplexity and Topic Coherence
Perplexity is widely used by topic models to measure the goodnesstofit capability, which is defined as: , where is the number of documents , and is the number of words in the th document . Following previous work, the variational lower bound is used to estimate the perplexity.
As the quality of the learned topics is not directly reflected by perplexity [29], topic coherence is designed to match the human judgment. We adopt NPMI [22] as the measurement of topic coherence, as is adopted by [25, 37].^{5}^{5}5We use the code provided by [22] at https://github.com/jhlau/topic_interpretability/ We define a topic to be an Effective Topic if it becomes the top significant topic of a sample among the training set more than times, where is the training set size and is a ratio. We set to in our experiments. Following [25], we use an average over topic coherence computed by top5 and top10 words across five random runs, which is more robust [21].
Table 1 shows the perplexity and topic coherence of different topic models on 20News and RCV1V2 datasets. We can clearly see that our models outperform the baselines, which indicates that our models have better goodnesstofit capability and can discover topics that match more closely to human judgment. We can also see that HiTMVAE achieves better perplexity than [39], in which a similar hierarchical construction is used. Note that comparing the ELBOestimated perplexity of HiTMVAE with other models directly is not suitable, as it has a lot more random variables, which usually leads to a higher ELBO. The possible reasons for the good coherence achieved by our models are 1) The “ProductofExperts” enables the model to model sharper distributions. 2) The nonparametric characteristic means the models can adapt the number of topics to data, thus topics can be sufficiently trained and of high diversity. Table 1 in Appendix 1.1 illustrates the topics learned by iTMVAEProd. Please refer to Appendix 1.1 in the supplementary for more details.
5.2 The Effect of HyperPrior on iTMVAE
In this section, we provide quantitative evaluations on the effect of the hyperprior for iTMVAE. Specifically, a relatively noninformative hyperprior is imposed on . And we initialize the global variational parameters and of Equation 9 the same as the noninformative Gamma prior. Thus the expectation of given the variational posterior is before training. A SGD optimizer with a learning rate of is used to optimize and . No KL annealing and decoder regularization are used for iTMVAEHP.

Table 2 reports the learned global variational parameter , and the expectation of given the variational poster on several subsets of 20News dataset, which contain , , , and classes, respectively.^{6}^{6}6Since there are no labels for the 20News dataset provided by [37], we preprocess the dataset ourselves in this illustrative experiment. We can see that, once the training is done, the variational posterior is very confident, and , the expectation of given the variational posterior, is adjusted to the training set. For example, if the training set contains only class of documents, after training is , Whereas, when the training set consists of classes of documents, after training is . This indicates that iTMVAEHP can learn to adjust to data, thus the number of discovered topics will adapt to data better. In contrast, for iTMVAEProd (without the hyperprior), when the decoder is strong, no matter how many classes the dataset contains, the number of topics will be constrained tightly due to the collapsetoprior problem of AEVB, and KLannealing and decoder regularizing tricks do not help much.
Figure 1 illustrates the training set coverage w.r.t the number of used topics when the training set contains , , , and classes, respectively. Specifically, we compute the average weight of every topic on the training dataset, and sort the topics according to their average weights. The topic coverage is then defined as the cumulative sum of these weights. Figure 1 shows that, with the increasing of the number of classes, more topics are utilized by iTMVAEHP to reach the same level of topic coverage, which indicates that the model has the ability to adapt to data.
5.3 The Evaluation of HiTMVAE
In this section, by comparing the topic coverage and sparsity^{7}^{7}7To compare the sparsity of the posterior topic proportions of each model, we sort the topic weights of every training document and average across the dataset. Then, the logarithm of the average weights are plotted w.r.t the topic index. of iTMVAEProd and HiTMVAE, we show that the hierarchical construction can help the model to learn more topics, and produce posterior topic proportions with higher sparsity.
The model configurations are the same for iTMVAEProd and HiTMVAE, except that is set to and for iTMVAEProd, and for HiTMVAE. For HiTMVAE, the corpuslevel updates are done every epochs on 20News, and epochs on RCV1V2.
As shown in Figure 2, HiTMVAE can learn more topics than iTMVAEProd (), and the sparsity of its posterior topic proportions is significantly higher. iTMVAEProd () has higher sparsity than iTMVAEProd (). However, its sparsity is still lower than HiTMVAE with the same documentlevel concentration parameter , and it can only learn a small number of topics, which means that there might exist rare topics that are not learned by the model. The comparison of HiTMVAE and iTMVAEProd () shows that the superior sparsity not only comes from a smaller perdocument concentration hyperparameter , but also from the hierarchical construction itself.
6 Conclusion
In this paper, we propose iTMVAE and iTMVAEProd, which are nonparametric topic models that are modeled by Variational AutoEncoders. Specifically, a stickbreaking prior is used to generate the atom weights of countably infinite shared topics, and the Kumaraswamy distribution is exploited such that the model can be optimized by AEVB algorithm. We also propose iTMVAEHP which introduces a hyperprior into the VAE framework such that the model can adapt better to data. This technique is general and can be incorporated into other VAEbased models to alleviate the collapsetoprior
problem. To further diversify the documentspecific topic distributions, we use a hierarchical construction in the generative procedure. And we show that the proposed model HiTMVAE can learn more topics and produce sparser posterior topic proportions. The advantage of iTMVAE and its variants over traditional nonparametric topic models is that the inference is performed by feedforward neural networks, which is of rich representation capacity and requires only limited knowledge of the data. Hence, it is flexible to incorporate more information sources to the model, and we leave it to future work. Experimental results on two public benchmarks show that iTMVAE and its variants outperform the stateoftheart baselines.
References
 Archambeau et al. [2015] Cedric Archambeau, Balaji Lakshminarayanan, and Guillaume Bouchard. Latent ibp compound dirichlet allocation. IEEE transactions on pattern analysis and machine intelligence, 37(2):321–333, 2015.
 Bernardo and Smith [2001] José M Bernardo and Adrian FM Smith. Bayesian theory, 2001.
 Blei [2012] David M Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
 Blei et al. [2003] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. JMLR, 2003.
 Blei et al. [2006] David M Blei, Michael I Jordan, et al. Variational inference for dirichlet process mixtures. Bayesian analysis, 1(1):121–143, 2006.
 [6] Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
 Card et al. [2017] Dallas Card, Chenhao Tan, and Noah A Smith. A neural framework for generalized topic models. arXiv preprint arXiv:1705.09296, 2017.

Chen et al. [2017]
Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John
Schulman, Ilya Sutskever, and Pieter Abbeel.
Variational lossy autoencoder.
2017.  Escobar and West [1995] Michael D Escobar and Mike West. Bayesian density estimation and inference using mixtures. Journal of the american statistical association, 90(430):577–588, 1995.

Hinton [2006]
Geoffrey E Hinton.
Training products of experts by minimizing contrastive divergence.
Neural Computation, 14(8), 2006.  Hinton and Salakhutdinov [2009] Geoffrey E Hinton and Ruslan R Salakhutdinov. Replicated softmax: an undirected topic model. In NIPS, pages 1607–1614, 2009.
 Hoffman et al. [2010] Matthew Hoffman, Francis R. Bach, and David M. Blei. Online learning for latent dirichlet allocation. In NIPS. 2010.
 Ioffe [2017] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batchnormalized models. 2017. URL http://arxiv.org/abs/1702.03275.
 Jang et al. [2017] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. 2017. URL https://arxiv.org/abs/1611.01144.
 Kim and Sudderth [2011] Dae I Kim and Erik B Sudderth. The doubly correlated nonparametric topic model. In Advances in Neural Information Processing Systems, pages 1980–1988, 2011.
 Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 Kingma and Welling [2014a] Diederik Kingma and Max Welling. Efficient gradientbased inference through transformations between bayes nets and neural nets. In ICML, pages 1782–1790, 2014a.
 Kingma and Welling [2014b] Diederik P Kingma and Max Welling. Autoencoding variational bayes. In ICLR, 2014b.

Kumaraswamy [1980]
Ponnambalam Kumaraswamy.
A generalized probability density function for doublebounded random processes.
Journal of Hydrology, 46(12):79–88, 1980.  Larochelle and Lauly [2012] Hugo Larochelle and Stanislas Lauly. A neural autoregressive topic model. In NIPS, 2012.
 Lau and Baldwin [2016] Jey Han Lau and Timothy Baldwin. The sensitivity of topic coherence evaluation to topic cardinality. In NAACL HLT, pages 483–487, 2016.
 Lau et al. [2014] Jey Han Lau, David Newman, and Timothy Baldwin. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In EACL, pages 530–539, 2014.
 Lim et al. [2016] Kar Wai Lim, Wray Buntine, Changyou Chen, and Lan Du. Nonparametric bayesian topic modelling with the hierarchical pitman–yor processes. International Journal of Approximate Reasoning, 78:172–191, 2016.
 Miao et al. [2016] Yishu Miao, Lei Yu, and Phil Blunsom. Neural variational inference for text processing. In ICML, pages 1727–1736, 2016.
 Miao et al. [2017] Yishu Miao, Edward Grefenstette, and Phil Blunsom. Discovering discrete latent topics with neural variational inference. In ICML, 2017.
 Mnih and Gregor [2014] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In ICML, 2014.
 Murphy [2012] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
 Nalisnick and Smyth [2017] Eric Nalisnick and Padhraic Smyth. Stickbreaking variational autoencoders. In ICLR, 2017.
 Newman et al. [2010] David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. Automatic evaluation of topic coherence. In NAACL HLT, pages 100–108, 2010.
 Putthividhy et al. [2010] Duangmanee Putthividhy, Hagai T Attias, and Srikantan S Nagarajan. Topic regression multimodal latent dirichlet allocation for image annotation. In CVPR, pages 3408–3415. IEEE, 2010.
 Ranganath et al. [2014] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In AISTATS, 2014.
 Rasiwasia and Vasconcelos [2013] Nikhil Rasiwasia and Nuno Vasconcelos. Latent dirichlet allocation models for image classification. IEEE transactions on pattern analysis and machine intelligence, 35(11):2665–2679, 2013.
 Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and variational inference in deep latent gaussian models. In ICML, 2014.
 Rogers et al. [2005] Simon Rogers, Mark Girolami, Colin Campbell, and Rainer Breitling. The latent process decomposition of cdna microarray data sets. IEEE/ACM TCBB, 2(2):143–156, 2005.
 Sethuraman [1994] Jayaram Sethuraman. A constructive definition of dirichlet priors. Statistica sinica, 1994.
 Sønderby et al. [2016] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in Neural Information Processing Systems, pages 3738–3746, 2016.
 Srivastava and Sutton [2017] Akash Srivastava and Charles Sutton. Autoencoding variational inference for topic models. In ICLR, 2017.
 Teh et al. [2006] Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.
 Wang et al. [2011] Chong Wang, John Paisley, and David Blei. Online variational inference for the hierarchical dirichlet process. In AISTATS, pages 752–760, 2011.
 Wei and Croft [2006] Xing Wei and W Bruce Croft. Ldabased document models for adhoc retrieval. In SIGIR, 2006.
Comments
There are no comments yet.