With the successes of deep learning models, neural variational inference (NVI)(miao2016neural)
—also called variational autoencoders (VAE)—has emerged as an important tool for neural-based, probabilistic modeling(kingma2013auto; rezende2014stochastic; ranganath2014black)
. NVI is relatively straight-forward when dealing with continuous random variables, but necessitates more complicated approaches for discrete random variables.
The above can pose problems when we use discrete variables to model data, such as capturing both syntactic and semantic/thematic word dynamics in natural language processing (NLP). Short-term memory architectures have enabled Recurrent Neural Networks (RNNs) to capture local, syntactically-driven lexical dependencies, but they can still struggle to capture longer-range, thematic dependencies(sutskever2014sequence; mikolov2014learning; chowdhury-zamparelli-2018-rnn; lakretz-etal-2019-emergence). Topic modeling, with its ability to effectively cluster words into thematically-similar groups, has a rich history in NLP and semantics-oriented applications (blei2003latent; blei2006dynamic; wang2011action; crain2010dialect; wahabzada2016plant; nguyen2012duplicate, i.a.). However, they can struggle to capture shorter-range dependencies among the words in a document (Dieng0GP17). This suggests these problems are naturally complementary (Srivastava2017AutoencodingVI; zhang2018whai; tan2016learning; samatthiyadikun2018supervised; miao2016neural; gururangan-etal-2019-variational; Dieng0GP17; Wang2017TopicCN; wen2018latent; lau2017topically).
NVI has allowed the above recurrent topic modeling approaches to be studied, but with two primary modifications: the discrete variables can be reparametrized and then sampled, or each word’s topic assignment can be analytically marginalized out, prior to performing any learning. However, previous work has shown that topic models that preserve explicit topics yield higher quality topics than similar models that do not (may-2015-topic), and recent work has shown that topic models that have relatively consistent word-level topic assignments are preferred by end-users (Lund2019AutomaticAH). Together, these suggest that there are benefits to preserving these assignments that are absent from standard RNN-based language models. Specifically, preservation of word-level topics within a recurrent neural language model may both improve language prediction as well as yield higher quality topics.
To illustrate this idea of thematic vs. syntactic importance, consider the sentence “She received bachelor’s and master’s degrees in electrical engineering from Anytown University.” While a topic model can easily learn to group these “education” words together, an RNN is designed explicitly to capture the sequential dependencies, and thereby predict coordination (“and”), metaphorical (“in”) and transactional (“from”) concepts given the previous thematically-driven tokens.
In this paper we reconsider core modeling decisions made by a previous recurrent topic model (Dieng0GP17), and demonstrate how the discrete topic assignments can be maintained in both learning and inference without resorting to reparametrizing them. In this reconsideration, we present a simple yet efficient mechanism to learn the dynamics between thematic and non-thematic (e.g., syntactic) words. We also argue the design of the model’s priors still provides a key tool for language understanding, even when using neural methods. The main contributions of this work are:
We provide a recurrent topic model and NVI algorithm that (a) explicitly considers the surface dynamics of modeling with thematic and non-thematic words and (b) maintains word-level, discrete topic assignments without relying on reparametrizing or otherwise approximating them.
We analyze this model, both theoretically and empirically, to understand the benefits the above modeling decisions yield. This yields a deeper understanding of certain limiting behavior of this model and inference algorithm, especially as it relates to past efforts.
We show that careful consideration of priors and the probabilistic model still matter when using neural methods, as they can provide greater control over the learned values/distributions. Specifically, we find that a Dirichlet prior for a document’s topic proportions provides fine-grained control over the statistical structure of the learned variational parameters vs. using a Gaussian distribution.
Our code, scripts, and models are available at https://github.com/mmrezaee/VRTM.
Latent Dirichlet Allocation (blei2003latent, LDA) defines an admixture model over the words in documents. Each word in a document is stochastically drawn from one of topics—discrete distributions over vocabulary words. We use the discrete variable to represent the topic that the th word is generated from. We formally define the well-known generative story as drawing document-topic proportions , then drawing individual topic-word assignments , and finally generating each word , based on the topics (
is the conditional probability of wordgiven topic ).
Two common ways of learning an LDA model are either through Monte Carlo sampling techniques, that iteratively sample states for the latent random variables in the model, or variational/EM-based methods, which minimize the distribution distance between the posterior and an approximation to that posterior that is controlled by learnable parameters and . Commonly, this means minimizing the negative KL divergence, with respect to and . We focus on variational inference as it cleanly allows neural components to be used.
Supported by work that has studied VAEs for a variety of distributions (figurnov2018implicit; naesseth2016reparameterization), the use of deep learning models in the VAE framework for topic modeling has received recent attention (Srivastava2017AutoencodingVI; zhang2018whai; tan2016learning; samatthiyadikun2018supervised; miao2016neural; gururangan-etal-2019-variational). This complements other work that focused on extending the concept of topic modeling using undirected graphical models (srivastava2013modeling; hinton2009replicated; larochelle2012neural; mnih2014neural).
Traditionally, stop words have been a nuisance when trying to learn topic models. Many existing algorithms ignore stop-words in initial pre-processing steps, though there have been efforts that try to alleviate this problem. wallach2009rethinking suggested using an asymmetric Dirichlet prior distribution to include stop-words in some isolated topics. background_ref introduced a constant background factor derived from log-frequency of words to avoid the need for latent switching, and paul2015phd studied structured residual models that took into consideration the use of stop words. These approaches do not address syntactic language modeling concerns.
A recurrent neural network (RNN) can address those concerns. A basic RNN cell aims to predict each wordgiven previously seen words
. Words are represented with embedding vectors and the model iteratively computes a new representationwhere is generally a non-linear function.
There have been a number of attempts at leveraging the combination of RNNs and topic models to capture local and global semantic dependencies (Dieng0GP17; Wang2017TopicCN; wen2018latent; lau2017topically; rezaee2020event)). Dieng0GP17 passed the topics to the output of RNN cells through an additive procedure while lau2017topically proposed using convolutional filters to generate document-level features and then exploited the obtained topic vectors in a gating unit. Wang2017TopicCN
provided an analysis of incorporating topic information inside a Long Short-Term Memory (LSTM) cell, followed by a topic diversity regularizer to make a complete compositional neural language model.nallapati2017sengen
introduced sentence-level topics as a way to better guide text generation: from a document’s topic proportions, a topic for a sentence would be selected and then used to generate the words in the sentence. The generated topics are shared among the words in a sentence and conditioned on the assigned topic.li2017recurrent followed the same strategy, but modeled a document as a sequence of sentences, such that recurrent attention windows capture dependencies between successive sentences. wang2019topic
recently proposed using a normal distribution for document-level topics and a Gaussian mixture model (GMM) for word topics assignments. While they incorporate both the information from the previous words and their topics to generate a sentence, their model is based on a mixture of experts, with word-level topics neglected. To cover both short and long sentences,gupta2018texttovec introduced an architecture that combines the LSTM hidden layers and the pre-trained word embeddings with a neural auto-regressive topic model. One difference between our model and theirs is that their focus is on predicting the words within a context, while our aim is to generate words given a particular topic without marginalizing.
We first formalize the problem and propose our variationally-learned recurrent neural topic model (VRTM) that accounts for each word’s generating topic. We discuss how we learn this model and the theoretical benefits of our approach. Subsequently we demonstrate the empirical benefits of our approach (Sec. 4).
3.1 Generative Model
Fig. 1 provides both an unrolled graphical model and the full generative story. We define each document as a sequence of words . For simplicity we omit the document index when possible. Each word has a corresponding generating topic , drawn from the document’s topic proportion distribution . As in standard LDA, we assume there are topics, each represented as a -dimensional vector where is the probability of word given topic . However, we sequentially model each word via an RNN. The RNN computes for each token, where is designed to consider the previous words observed in sequence. During training, we define a sequence of observed semantic indicators (one per word) , where if that token is thematic (and 0 if not); a sequence of latent topic assignments ; and a sequence of computed RNN states . We draw each word’s topic from an assignment distribution , and generate each word from a decoding/generating distribution .
The joint model can be decomposed as
where is a vector of positive parameters governing what topics are likely in a document (). The corresponding graphical model is shown in Fig. 0(a). Our decoder is
where stands for the learned projection vector from the hidden space to output. When the model faces a non-thematic word it just uses the output of the RNN and otherwise it uses a mixture of LDA and RNN predictions. As will be discussed in Sec. 3.2.2, separating and in this way allows us to marginalize out during inference without reparametrizing it.
A similar type of decoder has been applied by both Dieng0GP17 and wen2018latent to combine the outputs of the RNN and LDA. However, as they marginalize out the topic assignments their decoders do not condition on the topic assignment . Mathematically, their decoders compute instead of , and compute . This new decoder structure helps us preserve the word-level topic information which is neglected in other models. This seemingly small change has out-sized empirical benefits (Tables 2 and 7).111 Since the model drives learning and inference, this model’s similarities and adherence to both foundational blei2003latent and recent neural topic models Dieng0GP17 are intentional. We argue, and show empirically, that this adherence yields language modeling performance and allows more exact marginalization during learning.
Note that we explicitly allow the model to trade off between thematic and non-thematic words. This is very much similar to Dieng0GP17. We assume during training these indicators are observable, though their occurrence is controlled via the output of a neural factor: (
is the sigmoid function). Using the recurrent network’s representations, we allow the model to learn the presence of thematic words to be learned via—a long recognized area of interest(ferret-1998-thematically; merlo-stevenson-2001-automatic; yang-etal-2017-detecting; hajicova-mirovsky-2018-discourse).
For a -topic VRTM, we formalize this intuition by defining if and otherwise. The rationale behind this assumption is that in our model the RNN is sufficient to draw non-thematic words and considering a particular topic for these words makes the model unnecessarily complicated.
Maximizing the data likelihood in this setting is intractable due to the integral over . Thus, we rely on amortized variational inference as an alternative approach. While classical approaches to variational inference have relied on statistical conjugacy and other approximations to make inference efficiently computable, the use of neural factors complicates such strategies. A notable difference between our model and previous methods is that we take care to account for each word and its assignment in both our model and variational approximation.
3.2 Neural Variational Inference
We use a mean field assumption to define the variational distribution as
where is a Dirichlet distribution parameterized by and we define similarly to : if and the neural parametrized output otherwise. In Sec. 3.2.1 we define and in terms of the output of neural factors. These allow us to “encode” the input into our latent variables, and then “decode” them, or “reconstruct” what we observe.
Variational inference optimizes the evidence lower bound (ELBO). For a single document (subscript omitted for clarity), we write the ELBO as:
This objective can be decomposed into separate loss functions and written aswhere, using , we have , , , , and .
The algorithmic complexity, both time and space, will depend on the exact neural cells used. The forward pass complexity for each of the separate summands follows classic variational LDA complexity.
In Fig. 2 we provide a system overview of our approach, which can broadly be viewed as an encoder (the word embedding, bag-of-words, masked word embeddings, and RNN components) and a decoder (everything else). Within this structure, our aim is to maximize to find the best variational and model parameters. To shed light onto what the ELBO is showing, we describe each term below.
Our encoder structure maps the documents into variational parameters. Previous methods posit that while an RNN can adequately perform language modeling (LM) using word embeddings for all vocabulary words, a modified, bag-of-words-based embedding representation is an effective approach for capturing thematic associations (Dieng0GP17; Wang2017TopicCN; lau2017topically). We note however that more complex embedding methods can be studied in the future (reisinger2010spherical; das-etal-2015-gaussian; batmanghelich-etal-2016-nonparametric).
To avoid “representation degeneration” (gao2018representation), care must be taken to preserve the implicit semantic relationships captured in word embeddings.222“Representation degeneration” happens when after training the model, the word embeddings are not well separated in the embedding space and all of them are highly correlated (gao2018representation). Let be the overall vocabulary size. For RNN language modeling [LM], we represent each word in the vocabulary with an -dimensional real-valued vector . As a result of our model definition, the topic modeling [TM] component of VRTM effectively operates with a reduced vocabulary; let be the vocabulary size excluding non-thematic words (). Each document is represented by a -dimensional integer vector , recording how often each thematic word appeared in the document.
In contrast to a leading approach (wang2019topic), we use the entire vocabulary embeddings for LM and mask out the non-thematic words to compute the variational parameters for TM as , where denotes element-wise multiplication and if is thematic ( otherwise). We refer to as the masked embedding representation. This motivates the model to learn the word meaning and importance of word counts jointly. The token representations are gathered into , which is fed to a feedforward network with one hidden layer to infer word-topics . Similarly, we use tensordot to produce the parameters for each document defined as , where
is the weight tensor and the summation is over the common axes.
3.2.2 Decoder/Reconstruction Terms
The first term in the objective, , is the reconstruction part, which simplifies to
For each word , if , then the word is not thematically relevant. Because of this, we just use . Otherwise () we let the model benefit from both the globally semantically coherent topics and the locally fluent recurrent states . With our definition of , simplifies to Finally, note that there is an additive separation between and in the word reconstruction model (eq. 2). The marginalization that occurs per-word can be computed as where represents the per-word normalization factor. This means that the discrete can be marginalized out and do not need to be sampled or approximated.
Given the general encoding architecture used to compute , computing requires computing two expectations. We can exactly compute the expectation over , but we approximate the expectation over with Monte Carlo samples from the learned variational approximation, with . Imposing the basic assumptions from and , we have where is the number of the non-thematic words in the document and is the output of Monte Carlo sampling. Similarly, we have
Modeling which words are thematic or not can be seen as a negative binary cross-entropy loss:
This term is a classifier forthat motivates the output of the RNN to discriminate between thematic and non-thematic words. Finally, as a KL-divergence between distributions of the same type, can be calculated in closed form (joo2019dirichlet).
We conclude with a theorem, which shows VRTM is an extension of RNN under specific assumptions. This demonstrates the flexibility of our proposed model and demonstrates why, in addition to natural language processing considerations, the renewed accounting mechanisms for thematic vs. non-thematic words in both the model and inference is a key idea and contribution. For the proof, please see the appendix. The central idea is to show that all the terms related to TM either vanish or are constants; it follows intuitively from the ELBO and eq. 2.
If in the generative story , and we have a uniform distribution for the prior and variational topic posterior approximation
, and we have a uniform distribution for the prior and variational topic posterior approximation(), then VRTM reduces to a Recurrent Neural Network to just reconstruct the words given previous words.
4 Experimental Results
We test the performance of our algorithm on the APNEWS, IMDB and BNC datasets that are publicly available.333https://github.com/jhlau/topically-driven-language-model Roughly, there are between 7.7k and 9.8k vocab words in each corpus, with between 15M and 20M training tokens each; Table A1 in the appendix details the statistics of these datasets. APNEWS contains 54k newswire articles, IMDB contains 100k movie reviews, and BNC contains 17k assorted texts, such as journals, books excerpts, and newswire. These are the same datasets including the train, validation and test splits, as used by prior work, where additional details can be found (Wang2017TopicCN). We use the publicly provided tokenization and following past work we lowercase all text and map infrequent words (those in the bottom of frequency) to a special unk token. Following previous work (Dieng0GP17), to avoid overfitting on the BNC dataset we grouped 10 documents in the training set into a new pseudo-document.
Following Dieng0GP17, we find that, for basic language modeling and core topic modeling, identifying which words are/are not stopwords is a sufficient indicator of thematic relevance. We define if is within a standard English stopword list, and 1 otherwise.444 Unlike Asymmetric Latent Dirichlet Allocation (ALDA) (wallach2009rethinking), which assigns a specific topic for stop words, we assume that stop words do not belong to any specific topic.
We use the softplus function as the activation function and batch normalization is employed to normalize the inputs of variational parameters. We use a single-layer recurrent cell; while some may be surprised at the relative straightforwardness of this choice, we argue that this choice is a benefit. Our models are intentionally trained usingonly the provided text in the corpora so as to remain comparable with previous work and limit confounding factors; we are not using pre-trained models. While transformer methods are powerful, there are good reasons to research non-transformer approaches, e.g., the data & computational requirements of transformers. We demonstrate the benefits of separating the “semantic” vs. “syntactic” aspects of LM, and could provide the blueprint for future work into transformer-based TM. For additional implementation details, please see appendix C.
Language Modeling Baselines
In our experiments, we compare against the following methods:
basic-LSTM: A single-layer LSTM with the same number of units like our method implementation but without any topic modeling, i.e. for all tokens.
LDA+LSTM (lau2017topically): Topics, from a trained LDA model, are extracted for each word and concatenated with the output of an LSTM to predict next words.
Topic-RNN (Dieng0GP17): This work is the most similar to ours. They use the similar decoder strategy (Eq. 2) but marginalize topic assignments prior to learning.
TDLM (lau2017topically): A convolutional filter is applied over the word embeddings to capture long range dependencies in a document and the extracted features are fed to the LSTM gates in the reconstruction phase.
TCNLM (Wang2017TopicCN): A Gaussian vector defines document topics, then a set of expert networks are defined inside an LSTM cell to cover the language model. The distance between topics is maximized as a regularizer to have diverse topics.
TGVAE (wang2019topic): The structure is like TCNLM, but with a Gaussian mixture model to define each sentence as a mixture of topics. The model inference is done by using householder flows to increase the flexibility of variational distributions.
4.1 Evaluations and Analysis
For an RNN, we used a single-layer LSTM with 600 units in the hidden layer, set the size of embedding to be 400, and had a fixed/maximum sequence length of 45. We also present experiments demonstrating the performance characteristics of using basic RNN, GRU and LSTM cells in Table 0(a). Note that although our embedding size is higher we do not use pretrained word embeddings, but instead learn them from scratch via end-to-end training.555Perplexity degraded slightly with pretrained embeddings, e.g., with , VRTM-LSTM on APNEWS went from 51.35 (from scratch) to 52.31 (word2vec). The perplexity values of the baselines and our VRTM across our three heldout evaluation sets are shown in Footnote 7. We note that TGVAE (wang2019topic) also compared against, and outperformed, all the baselines in this work. In surpassing TGVAE, we are also surpassing them. We find that our proposed method achieves lower perplexity than LSTM. This is consistent with Theorem 1. We observe that increasing the number of topics yields better overall perplexity scores. Moreover, as we see VRTM outperforms other baselines across all the benchmark datasets.
Topic Switch Percent
Automatically evaluating the quality of learned topics is notoriously difficult. While “coherence” (mimno2011optimizing) has been a popular automated metric, it can have peculiar failure points especially regarding very common words (paul2015phd). To counter this, Lund2019AutomaticAH recently introduced switch percent (SwitchP). SwitchP makes the very intuitive yet simple assumption that “good” topics will exhibit a type of inertia: one would not expect adjacent words to use many different topics. It aims to show the consistency of the adjacent word-level topics by computing the number of times the same topic was used between adjacent words: where and is the length of document after removing the stop-words. SwitchP is bounded between 0 (more switching) and 1 (less switching), where higher is better. Despite this simplicity, SwitchP was shown to correlate significantly better with human judgments of “good” topics than coherence (e.g., a coefficient of determination of 0.91 for SwitchP vs 0.49 for coherence). Against a number of other potential methods, SwitchP was consistent in having high, positive correlation with human judgments. For this reason, we use SwitchP. However, like other evaluations that measure some aspect of meaning (e.g., BLEU attempting to measure meaning-preserving translations) comparisons can only be made in like-settings.
In Table 1(a) we compare SwitchP for our proposed algorithm against a variational Bayes implementation of LDA blei2003latent. We compare to this classic method, which we call LDA VB, since both it and VRTM use variational inference: this lessens the impact of the learning algorithm, which can be substantial (may-2015-topic; ferraro-phd), and allows us to focus on the models themselves.888Given its prominence in the topic modeling community, we initially examined Mallet. It uses Collapsed Gibbs Sampling, rather than variational inference, to learn the parameters—a large difference that confounds comparisons. Still, VRTM’s SwitchP was often competitive with, and in some cases surpassed, Mallet’s. We note that it is difficult to compare to many recent methods: our other baselines do not explicitly track each word’s assigned topic, and evaluating them with SwitchP would necessitate non-trivial, core changes to those methods.999 For example, evaluating the effect of a recurrent component in a model like LDA+LSTM (lau2017topically) is difficult, since the LDA and LSTM models are trained separately. While the LSTM model does include inferred document-topic proportions , these values are concatenated to the computed LSTM hidden state. The LDA model is frozen, so there is no propagation back to the LDA model: the results would be the same as vanilla LDA. We push VRTM to associate select words with topics, by (stochastically) assigning topics. Because standard LDA VB does not handle stopwords well, we remove them for LDA VB, and to ensure a fair comparison, we mask all stopwords from VRTM. First, notice that LDA VB switches approximately twice as often. Second, as intuition may suggest, we see that with larger number of topics both methods’ SwitchP decreases—though VRTM still switches less often. Third, we note sharp decreases in LDA VB in going from 5 to 10 topics, with a more gradual decline for VRTM. As Lund2019AutomaticAH argue, these results demonstrate that our model and learning approach yield more “consistent” topic models; this suggests that our approach effectively summarizes thematic words consistently in a way that allows the overall document to be modeled.
Inferred Document Topic Entropy
To better demonstrate characteristics of our learned model, we report the average document topics entropy with . Table 1(b) presents the results of our approach, along with LDA VB. These results show that the sparsity of relevant/likely topics are much more selective about which topics to use in a document. Together with the SwitchP and perplexity results, this suggests VRTM can provide consistent topic analyses.
Topics & Sentences
To demonstrate the effectiveness of our proposed masked embedding representation, we report nine random topics in Table 3 (see Table A2 in the appendix for more examples), and seven randomly generated sentences in Table 4. During the training we observed that the topic-word distribution matrix includes stop words at the beginning and then the probability of stop words given topics gradually decreases. This observation is consistent with our hypothesis that the stop-words do not belong to any specific topic. The reasons for this observation can be attributed to the uniform distributions defined for variational parameters, and to controlling the updates of matrix while facing the stop and non-stop words. A rigorous, quantitative examination of the generation capacity deserves its own study and is beyond the scope of this work. We provide these so readers may make their own qualitative assessments on the strengths and limitations of our methods.
|APNEWS||a damaged car and body unk were taken to the county medical center from dinner with one driver.|
|the house now has released the unk of $ 100,000 to former unk freedom u.s. postal service and $ unk to the federal government.|
|another agency will investigate possible abuse of violations to the police facility .|
|not even if it represents everyone under control . we are getting working with other items .|
|the oklahoma supreme court also faces a maximum amount of money for a local counties.|
|the money had been provided by local workers over how much unk was seen in the spring.|
|he did n’t offer any evidence and they say he was taken from a home in front of a vehicle.|
|IMDB||the film is very funny and entertaining . while just not cool and all ; the worst one can be expected.|
|if you must view this movie , then i ’d watch it again again and enjoy it .this movie surprised me .|
|they definitely are living with characters and can be described as vast unk in their parts .|
|while obviously not used as a good movie , compared to this ; in terms of performances .|
|i love animated shorts , and once again this is the most moving show series i ’ve ever seen.|
|i remember about half the movie as a movie . when it came out on dvd , it did so hard to be particularly silly .|
|this flick was kind of convincing . john unk ’s character better could n’t be ok because he was famous as his sidekick.|
|BNC||she drew into her eyes . she stared at me . molly thought of the young lady , there was lack of same feelings of herself.|
|these conditions are needed for understanding better performance and ability and entire response .|
|it was interesting to give decisions order if it does not depend on your society , measured for specific thoughts , sometimes at least .|
|not a conservative leading male of his life under waste worth many a few months to conform with how it was available .|
|their economics , the brothers began its $ unk wealth by potential shareholders to mixed them by tomorrow .|
|should they happen in the north by his words , as it is a hero of the heart , then without demand .|
|we will remember the same kind of the importance of information or involving agents what we found this time.|
We incorporated discrete variables into neural variational without analytically integrating them out or reparametrizing and running stochastic backpropagation on them. Applied to a recurrent, neural topic model, our approach maintains the discrete topic assignments, yielding a simple yet effective way to learn thematic vs. non-thematic (e.g., syntactic) word dynamics. Our approach outperforms previous approaches on language understanding and other topic modeling measures.
The model used in this paper is fundamentally an associative-based language model. While NVI does provide some degree of regularization, a significant component of the training criteria is still a cross-entropy loss. Further, this paper’s model does not examine adjusting this cross-entropy component. As such, the text the model is trained on can influence the types of implicit biases that are transmitted to the learned syntactic component (the RNN/representations ), the learned thematic component (the topic matrix and topic modeling variables and ), and the tradeoff(s) between these two dynamics ( and ). For comparability, this work used available datasets that have been previously published on. Based upon this work’s goals, there was not an in-depth exploration into any biases within those datasets. Note however that the thematic vs. non-thematic aspect of this work provides a potential avenue for examining this. While we treated as a binary indicator, future work could involve a more nuanced, gradient view.
Direct interpretability of the individual components of the model is mixed. While the topic weights can clearly be inspected and analyzed directly, the same is not as easy for the RNN component. While lacking a direct way to inspect the overall decoding model, our approach does provide insight into the thematic component.
We view the model as capturing thematic vs. non-thematic dynamics, though in keeping with previous work, for evaluation we approximated this with non-stopword vs. stopword dynamics. Within topic modeling stop-word handling is generally considered simply a preprocessing problem (or obviated by neural networks), we believe that preprocessing is an important element of a downstream user’s workflow that is not captured when preprocessing is treated as a stand-alone, perhaps boring step. We argue that future work can examine how different elements of a user’s workflow, such as preprocessing, can be handled with our approach.
Acknowledgements and Funding Disclosure
We would like to thank members and affiliates of the UMBC CSEE Department, including Edward Raff, Cynthia Matuszek, Erfan Noury and Ahmad Mousavi. We would also like to thank the anonymous reviewers for their comments, questions, and suggestions. Some experiments were conducted on the UMBC HPCF. We’d also like to thank the reviewers for their comments and suggestions. This material is based in part upon work supported by the National Science Foundation under Grant No. IIS-1940931. This material is also based on research that is in part supported by the Air Force Research Laboratory (AFRL), DARPA, for the KAIROS program under agreement number FA8750-19-2-1003. The U.S.Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either express or implied, of the Air Force Research Laboratory (AFRL), DARPA, or the U.S. Government.
Appendix A Proof of Theorm 1
Theorem 1 states: If in the generative story , and we have a uniform distribution for the prior and variational topic posterior approximation (), then VRTM reduces to a Recurrent Neural Network to just reconstruct the words given previous words.
Following is the proof.
If then for every token , . Therefore, the reconstruction term simplifies to We proceed to analyze the other remaining terms as following. The first terms on the right-hand side of and are over thematic word, and simply we have Since all the tokens are forced to have the same label, the classifier term is Also , since it is the KL divergence between two equal distributions. Overall, reduces to , indicating that the model is just maximizing the log-model evidence based on the RNN output. ∎
Appendix B Dataset Details
Table A1 provides details on the different datasets we use. Note that these are the same datasets as have been used by others [wang2019topic].
|# Docs||# Sents||# Tokens||# Docs||# Sents||# Tokens||# Docs||# Sents||# Tokens|
Appendix C Implementation Details
For an RNN we used a single-layer LSTM with 600 units in the hidden layer, set the size of embedding to be 400, and had a fixed/maximum sequence length of 45. We also present experiments demonstrating the performance of basic RNN and GRU cells in Table 0(a). Note that although our embedding size is higher, we are use an end-to-end training manner without using pre-trained word embeddings. However, as shown in Table 0(a), we also examined the impact of using lower dimension embeddings and found our results to be fairly consistent. We found that using pretrained word embeddings such as word2vec could result in slight degradations in perplexity, and so we opted to learn the embeddings from scratch. We used a single Monte Carlo sample of
per document and epoch. We used a dropout rate of 0.4. In all experiments,is fixed to based on the validation set metrics. We optimized our parameters using the Adam optimizer with initial learning rate
and early stopping (lack of validation performance improvement for three iterations). We implemented the VRTM with Tensorflow v1.13 and CUDA v8.0. Models were trained using a single Titan GPU. With a batch size of 200 documents, full training and evaluation runs typically took between 3 and 5 hours (depending on the number of topics).
Appendix D Perplexity Calculation
Following previous efforts we calculate perplexity across documents with tokens total as
where factorizes according to eq. 1. We approximate the marginal probability of each token within as
Each is sampled from the computed posterior approximation, and we draw samples per document.
Appendix E Text Generation
The overall generating document procedure is illustrated in Algorithm 1. We use <SEP> as a special symbol that we silently prepend to sentences during training.
We limit the concatenation step () to the previous 30 words.
Appendix F Generated Topics
See Table A2 for additional, randomly sampling topics from VRTM models learned on APNEWS, IMDB, and BNC (50 topics).
Appendix G Generated Sentences
In this part we provide some sample, generated output explain how we can generate text using VRTM. To this end, we begin with the start word and then we proceed to predict the next word given all the previous words. It is worth mentioning that for this task, the labels of stop and non-stop words are marginalized out and the model is predicting these labels best on RNN hidden states. This conditional probability is
Computing Eq. 8 exactly is intractable in our context. We apply Monte Carlo sampling . It is too expensive to recompute with each word generated. To alleviate this problem, we can use a “sliding window” of previous words rather than the whole sequence to periodically resample [mikolov2012context, Dieng0GP17]. In this, we maintain a buffer of generated words and resample when the buffer is full (at which point we empty it and continue generating). We found that splitting each training document into blocks of 10 sentences generated more coherent sentence. Table 4 illustrates the quality of some generated sentences.
g.1 The Effect of Hyperparameters on Topic Selectivity
Following previous work in language modeling that take the sparsity into consideration [arora2018compressed, mousavi2019survey], we sought for a selective token topic assignment with low entropy. First, we slightly overload the notation of entropy to define an average of non stop word-topics entropy and similarly document entropy: and where and are the total number of non-stop words in the document and the total number of documents, respectively. Second, the Dirichlet distribution can easily be parameterized to generate sparse samples. Now we provide some intuitive explanation. In our setting, the equality holds, which is the (negative) KL-divergence between and parameters. Moreover, on the side, samples are controlled by the prior distribution. Overall, selectivity of strongly depends on the choice of prior parameters. As shown in Fig. A1, not only we can control the value of , but also we can reduce by tuning the prior parameters without the need of any other regularizer.