1 Introduction
Probabilistic topic models, such as latent Dirichlet allocation (LDA) (Blei et al., 2003) and Poisson factor analysis (PFA) (Zhou et al., 2012), have the ability to discover the underlying semantic themes from a collection of documents, achieving great success in text analysis. In general, a topic model represents each document as a mixture of latent topics, each of which describes an interpretable semantic concept. While being widely used, vanilla topic models assume that topics are independent and there are no structures among them, which limits those models’ ability to explore any hierarchical thematic structures. To remove its limitation, a series of hierarchical extensions, including nonparametric Bayesian hierarchical prior based topic models (Blei et al., 2010; Paisley et al., 2014), deep PFA (Gan et al., 2015; Henao et al., 2015), gamma belief network (GBN) (Zhou et al., 2015; Cong et al., 2017), and Dirichlet belief network (DirBN) (Zhao et al., 2018a), have been proposed. Commonly, these models learn directed acyclic graph (DAG)structured hierarchical topics, which assumes that the topics in the upper layers are more general/abstract than those in the lower layers. Consequently, revealing hierarchical relations between topics provides the user an intuitive way to better understand text data.
With the development of deep neural networks (DNN), there is a growing interest in developing neural topic models (NTMs). Specifically, most neural topic models are based on variational autoencoders (VAEs)
(Kingma and Welling, 2013; Rezende et al., 2014), which employ a variational inference network (encoder) to approximate the posterior distribution and are equipped with a decoder to reconstruct the document’s BagofWords (BOW) representation (Miao et al., 2016; Srivastava and Sutton, 2017; Card et al., 2017). However, most NTMs rely on Gaussian latent variables, which often fail to well approximate the posterior distributions of sparse and nonnegative document latent representations. To address this limition, Zhang et al. (2018)develop Weibull hybrid autoencoding inference (WHAI) for deep LDA, which infers posterior samples via a hybrid of stochasticgradient MCMC and autoencoding variational Bayes. As a hierarchical neural topic model, WHAI shows attractive qualities in multilayer document representation learning and hierarchical explainable topic discovery. Compared with traditional Bayesian probabilistic topic models, these NTMs usually enjoy better flexibility and scalability, which are important for modeling largescale data and performing downstream tasks
(Zhang et al., 2019; Wang et al., 2020c; chen et al., 2020; Wang et al., 2020a; Duan et al., 2021; Zhao et al., 2021).Despite their attractive performance, existing hierarchical topic models such as GBN often assume in the prior that the topics at each layer are independently drawn from the Dirichlet distribution, ignoring the dependencies between the topics both at the same layer and across different layers. To relax this assumption, we propose the Sawtooth Factorial Topic Embeddings Guided GBN (SawETM), a deep generative model of documents that captures the dependencies and semantic similarities between the topics in the embedding space. Specifically, as sketched in Fig. 1, both the words and hierarchical topics are first converted into the shared embedding space. Then we develop the Sawtooth Connection technique to capture the dependencies between the topics at different layers, where the factor loading at layer is the factor score at layer , which enables the hierarchical topics to be coupled together across all layers. Our work is inspired by both GBN (Zhou et al., 2015), a multistochasticlayer hierarchical topic model, and the embedding topic models (Dieng et al., 2019, 2020), which represent the words and single layer topic as embedding vectors. The proposed Sawtooth Connector is a novel method that combines the advantages of both models for hierarchical topic modeling.
We further note that previous work on NTMs has been restricted to shallow models with one or three layers of stochastic latent variables, which could limit their ability. Generally, due to the wellknown component collapsing problem of VAEs (Sønderby et al., 2016), constructing a deep latent variable model is challenging work. As discussed in Child (2020), the hierarchical VAEs with a sufficient depth can not only learn arbitrary orderings over observed variables but also learn more effective latent variable distributions, if such distributions exist. Moving beyond text modeling, the recent development on image generation has shown its promising performance and outstanding generation ability (Maaløe et al., 2019; Vahdat and Kautz, 2020; Child, 2020). Inspire by their work, we carefully design the inference network of SawETM in a deep hierarchical VAE framework to improve the model’s ability of modeling textual data. In particular, we propose the integration of a skipconnected deterministic upward path and a stochastic path to approximate the posterior of the latent variables and obtain hierarchical representations of a document. We also provide customized training strategies to build deeper neural topic models. To the best of our knowledge, SawETM is the first neural topic model that well supports a deep network structure (e.g., 15).
Our main contributions are summarized as follows:

To move beyond the independence assumption between the topics of two adjacent layers in most hierarchical topic models, the Sawtooth Connection technique is developed to extend GBN by capturing the dependencies and semantic similarities between the topics in the embedding space.

To avoid posterior collapse, We carefully design a residual upwarddownward inference network for SawETM to improve the model’s ability of modeling count data and approximating sparse, nonnegative and skewed document latent representations.

Overall, SawETM, a novel hierarchical NTM equipped with a flexible training algorithm, is proposed to infer multilayer document representations and discover topic hierarchies in both the embedding space and vocabulary space. Experiments on big corpora show that our models outperform other NTMs on extracting deeper interpretable topics and deriving better document representation.
2 Related work
The proposed model in this paper marries a hierarchical neural topic model with word embedding, resulting in a deep generative framework for text modeling. The related work can be roughly divided into two categories, one is the research on constructing neural topic model and the other is on leveraging word embedding for topic models.
Neural topic models
Most existing NTMs can be regarded as an extension of Bayesian topic models like LDA within the VAE framework for text modeling, where the latent variables can be viewed as topic proportions. NTMs usually utilize a singlelayer network as their decoder, e.g., (Srivastava and Sutton, 2017), where is a learnable weights between topics and words. Different NTMs may place different distributions on latent variable , such as Gaussian and Dirichlet distributions (Miao et al., 2016; Burkhardt and Kramer, 2019; Nan et al., 2019; Wang et al., 2020b). Different from these models that only focus on a single layer latent variable model, Zhang et al. (2018) propose WHAI, a hierarchical neural topic model that employs a Weibull upwarddownward variational encoder to infer multilayer document latent representations and use GBN as a decoder. All of these works focus on relatively shallow models with one or three layers of stochastic latent variables. We note that our work is based on WHAI, but different from it, we propose a new decoder to capture the dependencies of topics and a new powerful encoder to approximate the posteriors, which result in a deeper neural topic model.
Word embedding topic models
Word embeddings can capture word semantics at lowdimensional continuous space and are well studied in neural language models
(Bengio et al., 2003; Mikolov et al., 2013a, b; Levy and Goldberg, 2014). Due to their ability to capture semantic information, there is a growing interest in applying word embeddings to topic models. Pretrained word embeddings, such as GloVe (Pennington et al., 2014) and word2vec (Mikolov et al., 2013b), can serve as complementary information to guide topic discovery, which is effective to alleviate the sparsity issue in topic models (Zhao et al., 2017, 2018b; Li et al., 2016). Dieng et al. (2020) propose an embeddingbased topic model (ETM), which directly models the similarity in its generative process, rather than via a Dirichlet distribution. What should be noted is that our model is related to ETM in modeling the correlation between words and topics using their semantic similarities. Different from ETM, the Sawtooth Connection can be seen as injecting the learned knowledge information of a lower layer to a higher layer, which alleviates the sparsity issue in the higher layer.3 The proposed model
In this section, we develop SawETM for text analysis, which aims at mining document’s multilayer representations and exploring topic hierarchies. The motivation for designing SawETM focuses on tackling two main challenges: a hierarchical decoder to construct the dependencies between topics; designing expressive neural networks to approximate the posterior distribution more accurately. Below, we will first describe the decoder and encoder of SawETM, and then discuss the model’s properties. Finally, we will provide details of model inference and stable training techniques.
3.1 Document decoder: Sawtooth Factorial Topic Embeddings Guided GBN
To explore hierarchical thematic structure from a collection of documents, SawETM adapts GBN of Zhou et al. (2015) as its generative module (decoder). Different from GBN, where the interactions between the topics of two adjacent layers are modeled as a Dirichlet distribution, SawETM utilizes the Sawtooth Connection (SC) technique to couple hierarchical topics across all layers in the shared embedding space. Specifically, assuming the observations are multivariate count vectors , the generative model of SawETM with layers, from top to bottom, can be expressed as
(1) 
where, the count vector (e.g., the bagofword of document ) is factorized as the product of the factor loading matrix
(topics), and gamma distributed factor scores
(topic proportions), under the Poisson likelihood; and the hidden units of layer are further factorized into the product of the factor loading and hidden units of the next layer to infer a multilayer latent representations. denotes the topic numbers at layer . the vectoris a distributed representation of the
topic at layer in the semantic space of words. Especially, is the word embedding matrix. captures the relationship between topics of two adjacent layers and is calculated by the SC technique. In detail, SC first calculates the semantic similarities between the topics of two adjacent layers by the inner product of their embedding vectors and then applies softmax normalization to make sure the sum of each column of is equal to one. Note that, the factor loading at layer is the factor score at layer , which constructs the dependency between the parameters of the adjacent layers. Repeating this process, the hierarchical topic can be coupled together across all layers.To verify the role of SC, we also consider two simple variants for ablation study, formulated as
(2)  
(3) 
where, Eq. (2) directly models the dependence via the learnable parameters , which is a common choice in most NTMs. Eq. (3) employs a similar decomposition form as SC but does not share topic embeddings between two adjacent layers. We refer to the first variant as a deep neural topic model (DNTM), and the second as a deep embedding topic model (DETM). Note that when the number of layers is set to one, the DETM has the same structure as SawETM.
3.2 Document encoder: upward and downward encoder networks
The goal of encoder is to reparameterize the variational posterior distribution , which denotes topic proportions. While the gamma distribution, satisfying the nonnegative constraint and encouraging sparsity, appears to be a natural choice, it is not reparameterizable and therefore not amenable to gradient descent based optimization. Here we consider Weibull distribution because , latent variable can be easily reparameterized as:
(4) 
, The KL divergence from the gamma to Weibull distributions has an analytic expression (Zhang et al., 2018):
where is the EulerMascheroni constant. What’s more, designing a capacious inference network is necessary for deep VAEs training. Inspired by the series of studies about deep generative models (Sønderby et al., 2016; Zhang et al., 2018; Vahdat and Kautz, 2020; Child, 2020), we develop an upwarddownward inference network, which contains a bottomup residual deterministic path and a topdown stochastic path.
Upwarddownward inference model:
Like most neural topic models, the exact posterior distribution for is intractable and needs to be approximated. Following VAEs (Kingma and Welling, 2013; Rezende et al., 2014), we define the variational distribution as , which need to be flexible enough to well approximate the true posterior distribution. The variational distribution is factorized with a bottomup structure as
(5) 
Here we emphasize that this hierarchical structure makes the top stochastic latent variables tend to collapse into the prior, especially when the layer is large (Sønderby et al., 2016). To address this problem, SawETM first parameters a skipconnected deterministic upward path to obtain the latent representations of input :
where , MLP is a two layer fully connected network, Linear is a single linear layer, and Relu
applies nonlinear activation function. SawETM combines the obtained latent features with the prior from the stochastic updown path to construct the variational posterior:
(6)  
where denotes the concatenation at topic dimension, and Softplus applies nonlinearity to each element to ensure positive Weibull shape and scale parameters. The Weibull distribution is used to approximate the gammadistributed conditional posterior, and its parameters and are both inferred by combining the bottomup likelihood information and the prior information from the generative distribution using the neural networks. The inference network is structured as
(7) 
compared with Eq. (5), the residual upward pass in SawETM allows all the latent variables to have a deterministic dependency on input , thus the top stochastic latent variables could receive efficient information and will be empirically less likely to collapse. Note that, compared with the inference network in WHAI (Zhang et al., 2018), we construct a more powerful residual network structure to better approximate the true posteriors.
3.3 Model properties
SawETM inherits the good properties of both deep topic model and word embeddings, as described below.
Semantic topic taxonomies:
The loading matrices in Eq. (3.1) capture the semantic correlations of the topics of adjacent layers. Using the law of total expectation, we have
(8) 
Therefore, is naturally interpreted as the projection of topic to the vocabulary space, providing us with a way to visualize the topics at different semantic layers. The topics in the bottom layers are more specific and become increasingly more general when moving upward, as shown in Fig.4.
Hierarchical topics in the same embedding space:
In SawETM, both words and hierarchical topics are represented with embedding vector (e.g., ), and the topic can be defined by the Sawtoorh Connection (e.g. the layer topic is defined as ). The first advantage is that the learned words and hierarchical topics can be projected into the same embeding space, which is shown in Fig. 3. And the second advantage is that SawETM can establish dependencies between different layers, which can be seen as the learned knowledge information of a lower layer can be injected into a higher layer. The intuition is there is semantic relation between the same layer topics, such as the topic about ‘basketball’ have a strong relation with the topic about ‘game’, which should be considered by the higher layer topics. Note that, other hierarchical topic models such as GBN usually assume the hierarchical topic is independent and ignore this semantic structure (Blei et al., 2007), while SawETM try to capture this structure in the embedding space.
3.4 Inference and estimation
Similar to VAEs, the training objective of SawETM is the maximization of an Evidence Lower Bound (ELBO):
(9)  
where, , is the Weibull variational distribution in Eq. (6), and is the gamma prior distribution in Eq. (3.1). The first term is the expected loglikelihood or reconstruction error, while the second term is the Kullback–Leibler (KL) divergence that constrains to be close to its prior in the generative model. Thanks to the analytic KL expression and easy reparameterization of the Weibull distribution, the gradient of the ELBO with respect to and other parameters in the inference network can be accurately evaluated. As describe in Algorithm. 1, the encoder parameters and decoder parameters in SawETM are updated by SGD, which makes faster inference at both train and test stages compared to Gibbs Sampling. This also helps differ SawETM from WHAI, which updates the global parameters by SGMCMC and is limited to update local and global parameters alternately.
3.5 Stable training
It is a great optimization challenge to train a deep hierarchical VAE in practice, due to the wellknown posterior collapse and unbounded KL divergence in the objective (Razavi et al., 2019; Child, 2020). Here, we propose three approaches for stabilizing the training. We emphasize that all these approaches are applied in WHAI and other hierarchical neural topic model variants for fair comparison when we perform experiments.
Shape parameter skipping of Weibull distribution:
As shown in Eq. (4), when the sampled noise is close to 1, e.g., , and the Weibull shape parameter is less than 1e3, the will be extremely huge, which could destabilize the training process. In practice, we constrain the shape parameter such that to avoid extreme value. A similar setting can be found in Fan et al. (2020), who view
as a hyperparameter and set it as
.Warmup:
The variational training criterion in Eq. (10) contains the likelihood term and the variational regularization term. During the early training stage, the variational regularization term causes some of the latent units to become inactive before their learning useful representation (Sønderby et al., 2016). We solve this problem by first training the parameters only using the reconstruction error,and then adding the KL loss gradually with a temperature coefficient:
(10) 
where
is increased from 0 to 1 during the first N training epochs. This idea has been considered in previous VAEbased algorithms
(Raiko et al., 2007; Higgins et al., 2016).Gradient clipping:
Optimizing the unbounded KL loss often causes the sharp gradient during training (Child, 2020), we address this by clipping gradient with a large L2norm above a certain threshold, which we set 20 in all experiments. This technique can be easily implemented and allows networks to train smoothly.
4 Experiments
In this paper, SawETM is proposed for extracting deep latent features and analyzing documents unsupervised. So we evaluate the effectiveness of the proposed model by unsupervised learning tasks in this section. Specifically, four widelyused performance measure of topic models are used in the experiments, which include perplexity, topic quality, and document clustering accuracy/normalized mutual information metric. The experiments are conducted on four realworld datasets including both regular and big corpora. In order to further understand the proposed model and verify our motivation, we visually inspect the learned word and topic embeddings as well as topic hierarchies. Our code is available at
https://github.com/BoChenGroup/SawETM.4.1 Experimental settings
Datasets. We run our experiments on four widely used benchmark corpora including R8, 20Newsgroups (20NG), Reuters Corpus Volume I (RCV1), and PG19. The R8 dataset is a subset of the Reuters 21578 dataset, which consists of documents from 8 different review groups. R8 is partitioned into a training set of ones and a testing set of ones. The 20NG dataset, with a vocabulary size , consists of documents from different news groups and its average document length is . 20NG is split into a training set of ones and a testing set of one. The RCV1 dataset, with a vocabulary size , consists of documents (Lewis et al., 2004) and its average dcoument length is . The PG19 dataset is extracted from Project Gutenberg (Rae et al., 2019) and contains book. We first build a vocabulary with words from this dataset, and then split each book with tokens, which result in documents. For data processing, we first preprocess all the datasets by cleaning and tokenizing text, then removed stop words and low frequency words appearing less than 5 times, and finally select the N most frequent words to build a vocabulary. Note that, the R8 and 20NG datasets are used for document clustering experiments as there are groundtruth labels.
Baselines. We compare SawETM with basic Bayesian topic models and neural topic models: 1. LDA Group, single layer topic models, including Latent Dirichlet Allocation (LDA) (Blei et al., 2003)
, which is a basic probability topic model; LDA with Products of Experts (
AVITM) (Srivastava and Sutton, 2017), which replaces the mixture model in LDA with a product of experts and uses the variational inference update parameters; LDA with word embeddings (ETM) (Dieng et al., 2020), a generative model of documents that marries traditional topic models with word embeddings. 2. DLDA Group, hierarchical topic models, including Deep latent allocation inferred by Gibbs sampling (DLDAGibbs) (Zhou et al., 2015) and by TLASGRMCMC (DLDATLASGR) (Cong et al., 2017); and Weibull hybrid autoencoding inference model for Deep latent allocation (WHAI) (Zhang et al., 2018), which employs a deep variational encoder to infer hierarchical document representations and updates the parameters by a hybrid of stochasticgradient MCMC and variational inference. 3. Variants Group, variants of SawETM, including deep neural topic model (DNTM) as introduced in Eq. 2, which directly models the dependence via the learnable parameters ; and deep embedding topic models (DETM) as introduced in Eq. 3, which employs a similar decomposition form as SC but does not share topic embeddings between two adjacent layers. These two Variant models use the same encoder with SawETM. For all the baseline models, we use their official default parameters with bestreported settings.Setting. The hyperparameter settings used for the GBN group are similar to the ones used in Zhang et al. (2018). For the hierarchical topic models, the network structures of 15layer models are set as [256, 224, 192, 160, 128, 112, 96, 80, 64, 56, 48, 40, 32, 16, 8]. For the embeddingbased topic models such as ETM, DETM, and SawETM, we set the embedding size as 100. For the NTMs, we set the hidden size as 256. For optimization, the Adam optimizer (Kingma and Ba, 2014) is utilized with a learning rate of
. The minibatch size is set as 200 in all experiments. All experiments are performed on Nvidia GTX 8000 GPU and coded with PyTorch.
4.2 Perheldoutword perplexity
Perheldoutword perplexity is a widelyused performance measure of topic models. Similar to Zhou et al. (2015), for each corpus, we randomly select 80% of the word token from each document to form a training matrix , holding out the remaining 20% to form a testing matrix . We use to train the model and calculate the perheldword perplexity as
where is the total number of collected samples and .
Fig. 2 (a)(c) show how the perplexity changes as a function of the number of layers for various models over three different datasets. For both RCV1 and PG19, which are too large to run Gibbs sampling, we omit DLDAGibbs and only include DLDATLASGR for comparison. In the LDA group, ETM gets the best performance compared with LDA and AVITM, which can be attributed to the powerful word embeddings decoder (Dieng et al., 2020). LDA has the better performance compared with AVITM, which is not surprising as this batch algorithm can sample from the true posteriors given enough Gibbs sampling iterations. But these models are limited to singlelayer shallow models, which can’t benefit from the deep structure and results in the gap with the second group of models.
Among these DLDA based models in group two, the DLDAGibbs outperforms other models, attributed to the more accurate posterior estimations, while DLDATLASGR is a minibatch algorithm and slightly degraded performance in outofsample prediction. As a hierarchical neural topic model, WHAI gets the worse performance compared with DLDAGibbs/DLDATLASGR. Meanwhile, we can see that WHAI with a single hidden layer clearly outperforms AVITM, indicating that using the Weibull distribution is more appropriate than using the logistic normal distribution to model the document latent representation. Besides, the performance of DLDAGibbs/DLDA TLASGR can be effectively improved by increasing the number of layers, while WHAI fails to improve its performance when the layer size becomes greater than three. This maybe due to all layers of the natural topic models are trained by SGD together, which makes it difficult to learn meaningful prior
(Wu et al., 2021). And similar to deep VAE, this phenomenon is called posterior collapse (Sønderby et al., 2016; Maaløe et al., 2019).With the powerful word embedding decoder and effective Weibull upwarddownward variational encoder, DETM of the variants group gets significant performance improvement with single layer. However, it also experience the similar problem with WHAI that no clear performance improvement when the number of layers becomes greater than three. Benefiting from the SC module between different layers, the learned knowledge at lower layers can flow to the upper layer, which can help the higher layers learn meaningful topics, resulting in better prior learned by SawETM. We can see that SawETM further improve the performance with the layer size becomes bigger, and get comparable performance with DLDAGibbs/DLDATLASGR. Note that, although the improvement of SawETM is not that significant compared with DLDAGibbs/DLDATLASGR, DLDAGibbs/DLDATLASGR require iterative sampling to infer latent document representations in the testing stage, while SawETM can infer latent representations via direct projection, which makes it both scalable to large corpora and fast in outofsample prediction. Besides, thanks to SC and the improved encoder, SawETM can significantly outperform other NTMs.
4.3 Topic quality
A good topic model can provide interpretable topics. In this section, we measure the model’s performance in terms of topic interpretable (Dieng et al., 2020). Specifically, topic coherence and topic diversity are combined here to evaluate topic interpretable/quality. Topic coherence is obtained by taking the average Normalized Pointwise Mutual Information (NPMI) of the top 20 words of each topic (Aletras and Stevenson, 2013). It provides a quantitative measure of the interpretability of a topic (Mimno et al., 2011). The second metric is topic diversity (Dieng et al., 2020), which denotes the percentage of unique words in the top 20 words of all topics. Diversity close to 1 means more diverse topics. Topic quality is defined as the product between topic coherence and diversity.
Fig. 2 (d)(f) show the topic quality results of different layer for various models over three different datasets. Clearly, DLDA performs the best in terms of topic quality especially the higher layer, which is not surprising as all its parameters are updated by Gibbs sampling/TLASGRMCMC (Cong et al., 2017). Thanks to the use TLASGRMCMC rather than a simple SGD procedure, WHAI consistently outperforms DNTM and DETM, which update all the parameters by SGD. Although equipped with a powerful word embeddings decoder, the topic quality of DETM clearly decreases as the number of layers increases. Through the above experimental phenomenon, we can find that it is difficult for NTMs to learn meaningful hierarchical topics. This is probably because NTMs often suffer from the posterior collapse problem in VAEs, making it hard to learn deeper semantic structure. However, SawETM achieving comparable performance to DLDA, which clearly outperforms the other deep neural topic models. As discussed in the Sec. 4.2, this improvement come from the SC module. The results of topic quality also agree with the results of perplexity, which are shown in Fig. 2 (a)(c).
Model  Layer  20News  R8  

AC  NMI  AC  NMI  
LDA  1  46.52  45.15  51.41  40.47 
AVITM  1  48.31  46.33  52.43  41.20 
ETM  1  49.79  48.40  55.34  41.28 
PGBN  1  46.62  45.43  51.67  40.76 
PGBN  5  48.33  46.51  54.21  41.21 
WHAI  1  49.43  46.56  57.86  42.31 
WHAI  5  49.51  46.98  60.45  43.98 
DNTM  1  49.17  46.32  57.58  42.12 
DNTM  5  49.25  46.79  59.93  43.90 
DETM  1  50.24  48.69  61.21  43.45 
DETM  5  50.33  48.87  61.86  44.12 
SawETM  5  51.25  50.77  63.82  45.90 
4.4 Document clustering
We consider the multiclass classification task for predicting the clusters for test documents to evaluate the quality of latent document representations extracted by these models. In detail, we use the trained topic models to extract the latent representations of the testing documents and then use kmeans to predict the clusters. As shown in Table
1, the accuracy (AC) and normalized mutual information metric (NMI) are used to measure the clustering performance, both of which are the higher the better. Tab. 1 shows the clustering results of all the models on 20NG and R8 dataset. It can be observed that with powerful word embeddings decoder and the Sawtooth Connection, SawETM can extract more expressive document latent representations and outperforms the other models included for comparison.4.5 Qualitative analysis
One of the most appealing properties of SawETM is interpretability, we can visually inspect the inferred topics at different layers and the inferred connection weights between the topics of adjacent layers. Specifically, we conduct an extensive qualitative evaluation on the quality of the topics discovered by SawETM, including word embedding, topic embedding, and topic hierarchies. In this section, we use a 15layer SawETM trained on PG19 for the visualization of embedding space and hierarchical structure experiments.
Visualisation of embedding space
The top 10 words from six topics are visualized in Fig. 3(a) by tSNE visualization (Van der Maaten and Hinton, 2008). As we can see, the words under the same topic stay closer together, and the words under different topics are far apart. Besides, the words under the same topic are semantically more similar, which can demonstrate the meaning of the learned word embeddings. Apart from the visualization of word embeddings, we visualize the topic embeddings. As shown in Fig. 3(b), we select the top 2 subtopics of the topic at the second layer. We can see that the subtopic from the same topic have semantic similarity, and are closer in the embedding space. The above experiments show that the proposed SawETM learns not only meaningful word embeddings but also meaningful topic embeddings. More importantly, the learned words and hierarchical topics embedding can be projected into the same space, which can further support our motivations.
Hierarchical structure of topic model:
Limited by the paper space, we only visualize topics at the top two layers and the bottom two layers. As shown in Fig. 4, the semantic meaning of each topic and the connections between the topics of adjacent layers are highly interpretable. In particular, SawETM can learn meaningful hierarchical topics at higher layers, indicating that it is able to support a deep structure.
5 Conclusion
In this paper, we propose SawETM, a deep neural topic model that captures the dependencies and semantic similarities between the topics at different layers. We design a skipconnected upwarddownward inference network to approximate the posterior distribution of a document. Note that with the Sawtooth Connection technique, SawETM provides different views to a deep topic model, and further improves the performance of the neural deep topic model. As a fully variational deep topic model, SawETM can be optimized by SGD. Extensive experiments have shown that SawETM achieves comparable performance on perplexity, document clustering, and topic quality with the startoftheart model. In addition, with learned word and topic embeddings, and topic hierarchies, SawETM can discover interpretable structured topics, which helps to gain a better understanding of text data.
Acknowledgments
Bo Chen acknowledges the support of NSFC (61771361), Shaanxi Youth Innovation Team Project, the 111 Project (No. B18039) and the Program for Oversea Talent by Chinese Central Government.
References
 Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)–Long Papers, pp. 13–22. Cited by: §4.3.
 A neural probabilistic language model. Journal of machine learning research 3 (Feb), pp. 1137–1155. Cited by: §2.
 The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM) 57 (2), pp. 1–30. Cited by: §1.
 A correlated topic model of science. The Annals of Applied Statistics 1 (1), pp. 17–35. Cited by: §3.3.
 Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: §1, §4.1.
 Decoupling sparsity and smoothness in the dirichlet variational autoencoder topic model.. Journal of Machine Learning Research 20 (131), pp. 1–27. Cited by: §2.
 A neural framework for generalized topic models. arXiv preprint arXiv:1705.09296. Cited by: §1.
 Bidirectional convolutional poisson gamma dynamical systems. In Advances in Neural Information Processing Systems, Vol. 33. Cited by: §1.

Very deep vaes generalize autoregressive models and can outperform them on images
. arXiv preprint arXiv:2011.10650. Cited by: §1, §3.2, §3.5, §3.5.  Deep latent dirichlet allocation with topiclayeradaptive stochastic gradient riemannian mcmc. arXiv preprint arXiv:1706.01724. Cited by: §1, §4.1, §4.3.
 The dynamic embedded topic model. arXiv preprint arXiv:1907.05545. Cited by: §1.
 Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics 8, pp. 439–453. Cited by: §1, §2, §4.1, §4.2, §4.3.

EnsLM: ensemble language model for data diversity by semantic clustering.
In
ACLIJCNLP 2021: The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing
, Cited by: §1.  Bayesian attention modules. arXiv preprint arXiv:2010.10604. Cited by: §3.5.
 Scalable deep poisson factor analysis for topic modeling. In International Conference on Machine Learning, pp. 1823–1832. Cited by: §1.
 Deep poisson factor modeling. Advances in Neural Information Processing Systems 28, pp. 2800–2808. Cited by: §1.
 Betavae: learning basic visual concepts with a constrained variational framework. Cited by: §3.5.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.2.
 Neural word embedding as implicit matrix factorization. Advances in neural information processing systems 27, pp. 2177–2185. Cited by: §2.
 Rcv1: a new benchmark collection for text categorization research. Journal of machine learning research 5 (Apr), pp. 361–397. Cited by: §4.1.
 Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 165–174. Cited by: §2.
 Biva: a very deep hierarchy of latent variables for generative modeling. In Advances in neural information processing systems, pp. 6551–6562. Cited by: §1, §4.2.
 Neural variational inference for text processing. In International conference on machine learning, pp. 1727–1736. Cited by: §1, §2.
 Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.
 Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546. Cited by: §2.
 Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing, pp. 262–272. Cited by: §4.3.
 Topic modeling with wasserstein autoencoders. arXiv preprint arXiv:1907.12374. Cited by: §2.
 Nested hierarchical dirichlet processes. IEEE transactions on pattern analysis and machine intelligence 37 (2), pp. 256–270. Cited by: §1.
 Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §2.
 Compressive transformers for longrange sequence modelling. arXiv preprint. External Links: Link Cited by: §4.1.
 Building blocks for variational bayesian learning of latent variable models.. Journal of Machine Learning Research 8 (1). Cited by: §3.5.
 Preventing posterior collapse with deltavaes. arXiv preprint arXiv:1901.03416. Cited by: §3.5.

Stochastic backpropagation and approximate inference in deep generative models
. In International conference on machine learning, pp. 1278–1286. Cited by: §1, §3.2.  Ladder variational autoencoders. In Advances in neural information processing systems, pp. 3738–3746. Cited by: §1, §3.2, §3.2, §3.5, §4.2.
 Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488. Cited by: §1, §2, §4.1.
 Nvae: a deep hierarchical variational autoencoder. arXiv preprint arXiv:2007.03898. Cited by: §1, §3.2.
 Visualizing data using tsne.. Journal of machine learning research 9 (11). Cited by: §4.5.
 Deep relational topic modeling via graph poisson gamma belief network. Advances in Neural Information Processing Systems 33. Cited by: §1.
 Neural topic modeling with bidirectional adversarial training. arXiv preprint arXiv:2004.12331. Cited by: §2.

Learning dynamic hierarchical topic graph with graph convolutional network for document classification.
In
International Conference on Artificial Intelligence and Statistics
, pp. 3959–3969. Cited by: §1.  Greedy hierarchical variational autoencoders for largescale video prediction. arXiv preprint arXiv:2103.04174. Cited by: §4.2.
 WHAI: weibull hybrid autoencoding inference for deep topic modeling. arXiv preprint arXiv:1803.01328. Cited by: §1, §2, §3.2, §3.2, §4.1, §4.1.
 Variational heteroencoder randomized gans for joint imagetext modeling. arXiv preprint arXiv:1905.08622. Cited by: §1.
 Dirichlet belief networks for topic structure learning. arXiv preprint arXiv:1811.00717. Cited by: §1.
 Inter and intra topic structure learning with word embeddings. In International Conference on Machine Learning, pp. 5892–5901. Cited by: §2.
 A word embeddings informed focused topic model. In Asian Conference on Machine Learning, pp. 423–438. Cited by: §2.
 Topic modelling meets deep neural networks: a survey. arXiv preprint arXiv:2103.00498. Cited by: §1.
 The poisson gamma belief network. Advances in Neural Information Processing Systems 28, pp. 3043–3051. Cited by: §1, §1, §3.1, §4.1, §4.2.
 Betanegative binomial process and poisson factor analysis. In Artificial Intelligence and Statistics, pp. 1462–1471. Cited by: §1.