Recurrent Hierarchical Topic-Guided Neural Language Models

12/21/2019 ∙ by Dandan Guo, et al. ∙ 0

To simultaneously capture syntax and global semantics from a text corpus, we propose a new larger-context recurrent neural network (RNN) based language model, which extracts recurrent hierarchical semantic structure via a dynamic deep topic model to guide natural language generation. Moving beyond a conventional RNN based language model that ignores long-range word dependencies and sentence order, the proposed model captures not only intra-sentence word dependencies, but also temporal transitions between sentences and inter-sentence topic dependences. For inference, we develop a hybrid of stochastic-gradient MCMC and recurrent autoencoding variational Bayes. Experimental results on a variety of real-world text corpora demonstrate that the proposed model not only outperforms state-of-the-art larger-context RNN-based language models, but also learns interpretable recurrent multilayer topics and generates diverse sentences and paragraphs that are syntactically correct and semantically coherent.



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Both topic and language models are widely used for text analysis. Topic models, such as latent Dirichlet allocation (LDA) (Blei et al., 2003; Griffiths & Steyvers, 2004; Hoffman et al., 2013) and its nonparametric Bayesian generalizations (Teh et al., 2006; Zhou & Carin, 2015), are well suited to extract document-level word concurrence patterns into latent topics from a text corpus. Their modeling power has been further enhanced by introducing multilayer deep representation (Srivastava et al., 2013; Mnih & Gregor, 2014; Gan et al., 2015; Zhou et al., 2016; Zhao et al., 2018; Zhang et al., 2018). While having semantically meaningful latent representation, they typically treat each document as a bag of words (BoW), ignoring word order (Griffiths et al., 2004; Wallach, 2006)

. Language models have become key components of various natural language processing (NLP) tasks, such as text summarization

(Rush et al., 2015; Gehrmann et al., 2018), speech recognition (Mikolov et al., 2010; Graves et al., 2013), machine translation (Sutskever et al., 2014; Cho et al., 2014)

, and image captioning

(Vinyals et al., 2015; Mao et al., 2015; Xu et al., 2015; Gan et al., 2017; Rennie et al., 2017). The primary purpose of a language model is to capture the distribution of a word sequence, commonly with a recurrent neural network (RNN) (Mikolov et al., 2011; Graves, 2013) or a Transformer based neural network (Vaswani et al., 2017; Dai et al., 2019; Devlin et al., 2019; Radford et al., 2018, 2019). In this paper, we focus on improving RNN-based language models that often have much fewer parameters and are easier to perform end-to-end training.

While RNN-based language models do not ignore word order, they often assume that the sentences of a document are independent to each other. This simplifies the modeling task to independently assigning probabilities to individual sentences, ignoring their orders and document context

(Tian & Cho, 2016). Such language models may consequently fail to capture the long-range dependencies and global semantic meaning of a document (Dieng et al., 2017; Wang et al., 2018). To relax the sentence independence assumption in language modeling, Tian & Cho (2016)

propose larger-context language models that model the context of a sentence by representing its preceding sentences as either a single or a sequence of BoW vectors, which are then fed directly into the sentence modeling RNN. An alternative approach attracting significant recent interest is leveraging topic models to improve RNN-based language models.

Mikolov & Zweig (2012) use pre-trained topic model features as an additional input to the RNN hidden states and/or output. Dieng et al. (2017); Ahn et al. (2017) combine the predicted word distributions, given by both a topic model and a language model, under variational autoencoder (Kingma & Welling, 2013). Lau et al. (2017)

introduce an attention based convolutional neural network to extract semantic topics, which are used to extend the RNN cell.

Wang et al. (2018) learn the global semantic coherence of a document via a neural topic model and use the learned latent topics to build a mixture-of-experts language model. Wang et al. (2019)

further specify a Gaussian mixture model as the prior of the latent code in variational autoencoder, where each mixture component corresponds to a topic.

While clearly improving the performance of the end task, these existing topic-guided methods still have clear limitations. For example, they only utilize shallow topic models with only a single stochastic hidden layer in their data generation process. Note several neural topic models use deep neural networks to construct their variational encoders, but still use shallow generative models (decoders) (Miao et al., 2017; Srivastava & Sutton, 2017). Another key limitation lies in ignoring the sentence order, as they treat each document as a bag of sentences. Thus once the topic weight vector learned from the document context is given, the task is often reduced to independently assigning probabilities to individual sentences (Lau et al., 2017; Wang et al., 2018, 2019).

In this paper, as depicted in Fig. 1, we propose to use recurrent gamma belief network (rGBN) to guide a stacked RNN for language modeling. We refer to the model as rGBN-RNN, which integrates rGBN (Guo et al., 2018), a deep recurrent topic model, and stacked RNN (Graves, 2013; Chung et al., 2017), a neural language model, into a novel larger-context RNN-based language model. It simultaneously learns a deep recurrent topic model, extracting document-level multi-layer word concurrence patterns and sequential topic weight vectors for sentences, and an expressive language model, capturing both short- and long-range word sequential dependencies. For inference, we equip rGBN-RNN (decoder) with a novel variational recurrent inference network (encoder), and train it end-to-end by maximizing the evidence lower bound (ELBO). Different from the stacked RNN based language model in Chung et al. (2017), which relies on three types of customized training operations (UPDATE, COPY, FLUSH) to extract multi-scale structures, the language model in rGBN-RNN learns such structures purely under the guidance of the temporally and hierarchically connected stochastic layers of rGBN. The effectiveness of rGBN-RNN as a new larger-context language model is demonstrated both quantitatively, with perplexity and BLEU scores, and qualitatively, with interpretable latent structures and randomly generated sentences and paragraphs. Notably, rGBN-RNN can generate a paragraph consisting of a sequence of semantically coherent sentences.

2 Recurrent hierarchical topic-guided language model

Denote a document of sentences as , where consists of words from a vocabulary of size . Conventional statistical language models often only focus on the word sequence within a sentence. Assuming that the sentences of a document are independent to each other, they often define RNN based neural language models define the conditional probability of each word given all the previous words within the sentence , through the softmax function of a hidden state , as



is a non-linear function typically defined as an RNN cell, such as long short-term memory (LSTM)

(Hochreiter & Schmidhuber, 1997)

and gated recurrent unit (GRU)

(Cho et al., 2014).

These RNN-based statistical language models are typically applied only at the word level, without exploiting the document context, and hence often fail to capture long-range dependencies. While Dieng et al. (2017); Lau et al. (2017); Wang et al. (2018, 2019) remedy the issue by guiding the language model with a topic model, they still treat a document as a bag of sentences, ignoring the order of sentences, and lack the ability to extract hierarchical and recurrent topic structures.

Figure 1: (a) The generative model of a three-hidden-layer rGBN-RNN, where the bottom part is the deep recurrent topic model (rGBN), document contexts of consecutive sentences are used as observed data, and upper is the language model. (b) Overview of the language model component, where input denotes the th word in th sentence of a document, , is the hidden state of the stacked RNN at time step , and is the topic weight vector of sentence at layer . (c) The overall architecture of the proposed model, including the decoder (rGBN and language model) and encoder (variational recurrent inference), where the red arrows denote the inference of latent topic weight vectors, black ones the data generation.

We introduce rGBN-RNN, as depicted in Fig. 1(a), as a new larger-context language model. It consists of two key components: (i) a hierarchical recurrent topic model (rGBN), and (ii) a stacked RNN based language model. We use rGBN to capture both global semantics across documents and long-range inter-sentence dependencies within a document, and use the language model to learn the local syntactic relationships between the words within a sentence. Similar to Lau et al. (2017); Wang et al. (2018), we represent a document as a sequence of sentence-context pairs as , where summarizes the document excluding , specifically , into a BoW count vector, with as the size of the vocabulary excluding stop words. Note a naive way is to treat each sentence as a document, use a dynamic topic model (Blei & Lafferty, 2006) to capture the temporal dependencies of the latent topic-weight vectors, which is fed to the RNN to model the word sequence of the corresponding sentence. However, the sentences are often too short to be well modeled by a topic model. In our setting, as summarizes the document-level context of , it is in general sufficiently long for topic modeling. Note during testing, we redefine as the BoW vector summarizing only the preceding sentences, , which will be further clarified when presenting experimental results.

2.1 Hierarchical recurrent topic model

Shown in Fig. 1 (a), to model the time-varying sentence-context count vectors in document , the generative process of the rGBN component, from the top to bottom hidden layers, is expressed as



denotes the gamma distributed topic weight vectors of sentence

at layer , the transition matrix of layer  that captures cross-topic temporal dependencies, the loading matrix at layer , the number of topics of layer , and

a scaling hyperparameter. At

, for and , where . Finally, Dirichlet priors are placed on the columns of and , , and , which not only makes the latent representation more identifiable and interpretable, but also facilitates inference. The count vector can be factorized into the product of and under the Poisson likelihood. The shape parameters of can be factorized into the sum of , capturing inter-layer hierarchical dependence, and , capturing intra-layer temporal dependence. rGBN not only captures the document-level word occurrence patterns inside the training text corpus, but also the sequential dependencies of the sentences inside a document. Note ignoring the recurrent structure, rGBN will reduce to the gamma belief network (GBN) of Zhou et al. (2016), which can be considered as a multi-stochastic-layer deep generalization of LDA (Cong et al., 2017a). If ignoring its hierarchical structure (i.e., ), rGBN reduces to Poisson–gamma dynamical systems (Schein et al., 2016). We refer to the rGBN-RNN without its recurrent structure as GBN-RNN, which no longer models sequential sentence dependencies; see Appendix A for more details.

2.2 Language model

Different from a conventional RNN-based language model, which predicts the next word only using the preceding words within the sentence, we integrate the hierarchical recurrent topic weight vectors into the language model to predict the word sequence in the th sentence. Our proposed language model is built upon the stacked RNN proposed in Graves (2013); Chung et al. (2017)

, but with the help of rGBN, it no longer requires specialized training heuristics to extract multi-scale structures. As shown in Fig.

1 (b), to generate , the token of sentence in a document, we construct the hidden states of the language model, from the bottom to top layers, as


where denotes the word-level LSTM at layer , are word embeddings to be learned, and . Note denotes the coupling vector, which combines the temporal topic weight vectors and hidden output of the word-level LSTM at each time step .

Following Lau et al. (2017), a gating unit similar to GRU (Cho et al., 2014) combines of sentence  with its hidden state of word-level LSTM at layer and time as We defer the details on to Appendix B. Denote as a weight matrix with rows and as the concatenation of across all layers; different from (1), the conditional probability of becomes


There are two main reasons for combining all the latent representations

for language modeling. First, the latent representations exhibit different statistical properties at different stochastic layers of rGBN-RNN, and hence are combined together to enhance their representation power. Second, having “skip connections” from all hidden layers to the output one makes it easier to train the proposed network, reducing the number of processing steps between the bottom of the network and the top and hence mitigating the “vanishing gradient” problem

(Graves, 2013).

To sum up, as depicted in Fig. 1 (a), the topic weight vector of sentence quantifies the topic usage of its document context at layer . It is further used as an additional feature of the language model to guide the word generation inside sentence , as shown in Fig. 1

(b). It is clear that rGBN-RNN has two temporal structures: a deep recurrent topic model to extract the temporal topic weight vectors from the sequential document contexts, and a language model to estimate the probability of each sentence given its corresponding hierarchical topic weight vector. Characterizing the word-sentence-document hierarchy to incorporate both intra- and inter-sentence information, rGBN-RNN learns more coherent and interpretable topics and increases the generative power of the language model. Distinct from existing topic-guided language models, the temporally related hierarchical topics of rGBN exhibit different statistical properties across layers, which better guides language model to improve its language generation ability.

2.3 Model likelihood and inference

For rGBN-RNN, given , the marginal likelihood of the sequence of sentence-context pairs of document is defined as


where . The inference task is to learn the parameters of both the topic model and language model components. One naive solution is to alternate the training between these two components in each iteration: First, the topic model is trained using a sampling based iterative algorithm provided in Guo et al. (2018); Second, the language model is trained with maximum likelihood estimation under a standard cross-entropy loss. While this naive solution can utilize readily available inference algorithms for both rGBN and the language model, it may suffer from stability and convergence issues. Moreover, the need to perform a sampling based iterative algorithm for rGBN inside each iteration limits the scalability of the model for both training and testing.

To this end, we introduce a variational recurrent inference network (encoder) to learn the latent temporal topic weight vectors . Denoting the ELBO of the log marginal likelihood shown in (5) can be constructed as


which unites both the terms that are primarily responsible for training the recurrent hierarchical topic model component, and terms for training the neural language model component. Similar to Zhang et al. (2018), we define , a random sample from which can be obtained by transforming standard uniform variables as


To capture the temporal dependencies between the topic weight vectors, both and , from the bottom to top layers, can be expressed as


where , , denotes the sentence-level recurrent encoder at layer implemented with a basic RNN cell, capturing the sequential relationship between sentences within a document, denotes the hidden state of , and superscript in denotes “sentence-level RNN” used to distinguish the hidden state of language model in (3) . Note both and are nonlinear functions mapping state to the parameters of , implemented with .

Rather than finding a point estimate of the global parameters of the rGBN, we adopt a hybrid inference algorithm by combining TLASGR-MCMC described in Cong et al. (2017a); Zhang et al. (2018) and our proposed recurrent variational inference network. In other words, the global parameters can be sampled with TLASGR-MCMC, while the parameters of the language model and variational recurrent inference network, denoted by

, can be updated via stochastic gradient descent (SGD) by maximizing the ELBO in (

6). We describe a hybrid variational/sampling inference for rGBN-RNN in Algorithm 1 and provide more details about sampling with TLASGR-MCMC in Appendix C. We defer the details on model complexity to Appendix E.

To sum up, as shown in Fig. 1(c), the proposed rGBN-RNN works with a recurrent variational autoencoder inference framework, which takes the document context of the th sentence within a document as input and learns hierarchical topic weight vectors that evolve sequentially with . The learned topic vectors in different layer are then used to reconstruct the document context input and as an additional feature for the language model to generate the th sentence.

  Set mini-batch size and the number of layer
  Initialize encoder and neural language model parameter parameter , and topic model parameter .
  for   do
      Randomly select a mini-batch of documents consisting of sentences to form a subset ; Draw random noise

from uniform distribution; Calculate

according to (6), and update ; Sample from (7) and (8) via to update and , will be described in Appendix C;
  end for
Algorithm 1 Hybrid SG-MCMC and recurrent autoencoding variational inference for rGBN-RNN.

3 Experimental results

We consider three publicly available corpora, including APNEWS, IMDB, and BNC. The links, preprocessing steps, and summary statistics for them are deferred to Appendix D. We consider a recurrent variational inference network for rGBN-RNN to infer , as shown in Fig. 1(c), whose number of hidden units in (8) are set the same as the number of topics in the corresponding layer. Following Lau et al. (2017), word embeddings are pre-trained 300-dimension word2vec Google News vectors ( Dropout with a rate of is used to the input of the stacked-RNN at each layer, i.e., or in (3). The gradients are clipped if the norm of the parameter vector exceeds . We use the Adam optimizer (Kingma & Ba, 2015) with learning rate . The length of an input sentence is fixed to . We set the mini-batch size as

, number of training epochs as

, and scaling hyperparameter  as 

. Code in TensorFlow is provided.

3.1 Quantitative comparison


For fair comparison, we use standard language model perplexity as the evaluation metric, by considering the following baselines:

(i) a standard LSTM language model (Hochreiter & Schmidhuber, 1997); (ii) LCLM (Tian & Cho, 2016), a larger-context language model that incorporates context from preceding sentences, which are treated as a bag of words; (iii) a standard LSTM language model incorporating the topic information of a separately trained LDA (LDA+LSTM); (iv) Topic-RNN (Dieng et al., 2017)

, a hybrid model rescoring the prediction of the next word by incorporating the topic information through a linear transformation;

(v) TDLM (Lau et al., 2017), a joint learning framework which learns a convolutional based topic model and a language model simultaneously. (vi) TCNLM (Wang et al., 2018), which extracts the global semantic coherence of a document via a neural topic model, with the probability of each learned latent topic further adopted to build a mixture-of-experts language model; (vii) TGVAE (Wang et al., 2019), combining a variational auto-encoder based neural sequence model with a neural topic model; (viii) GBN-RNN, a simplified rGBN-RNN that removes the recurrent structure of its rGBN component.

For rGBN-RNN, to ensure the information about the words in the th sentence to be predicted is not leaking through the sequential document context vectors at the testing stage, the input in (8) only summarizes the preceding sentences . For GBN-RNN, following TDLM (Lau et al., 2017) and TCNLM (Wang et al., 2018), all the sentences in a document, excluding the one being predicted, are used to obtain the BoW document context. As shown in Table 1, rGBN-RNN outperforms all baselines, and the trend of improvement continues as its number of layers increases, indicating the effectiveness of assimilating recurrent hierarchical topic information. rGBN-RNN consistently outperforms GBN-RNN, suggesting the benefits of exploiting the sequential dependencies of the sentence-contexts for language modeling. Moreover, comparing Table 1 and Table 4 of Appendix E suggests rGBN-RNN, with its hierarchical and temporal topical guidance, achieves better performance with fewer parameters than comparable RNN-based baselines.

Note that for language modeling, there has been significant recent interest in replacing RNNs with Transformer (Vaswani et al., 2017), which consists of stacked multi-head self-attention modules, and its variants (Dai et al., 2019; Devlin et al., 2019; Radford et al., 2018, 2019). While Transformer based language models have been shown to be powerful in various natural language processing tasks, they often have significantly more parameters, require much more training data, and take much longer to train than RNN-based language models. For example, Transformer-XL with 12L and that with 24L (Dai et al., 2019), which improve Transformer to capture longer-range dependencies, have 41M and 277M parameters, respectively, while the proposed rGBN-RNN with three stochastic hidden layers has as few as 7.3M parameters, as shown in Table 4, when used for language modeling. From a structural point-of-view, we consider the proposed rGBN-RNN as complementary to rather than competing with Transformer based language models, and consider replacing RNN with Transformer to construct rGBN guided Transformer as a promising future extension.

BLEU: Following Wang et al. (2019), we use test-BLEU to evaluate the quality of generated sentences with a set of real test sentences as the reference, and self-BLEU to evaluate the diversity of the generated sentences (Zhu et al., 2018). Given the global parameters of the deep recurrent topic model (rGBN) and language model, we can generate the sentences by following the data generation process of rGBN-RNN: we first generate topic weight vectors randomly and then downward propagate it through the rGBN as in (2.1) to generate . By assimilating the random draw topic weight vectors with the hidden states of the language model in each layer depicted in (3), we generate a corresponding sentence, where we start from a zero hidden state at each layer in the language model, and sample words sequentially until the end-of-the-sentence symbol is generated. Comparisons of the BLEU scores between different methods are shown in Fig. 3, using the benchmark tool in Texygen (Zhu et al., 2018); We show below BLEU-3 and BLEU-4 for BNC and defer the analogous plots for IMDB and APNEWS to Appendix G and H. Note we set the validation dataset as the ground-truth. For all datasets, it is clear that rGBN-RNN yields both higher test-BLEU and lower self-BLEU scores than related methods do, indicating the stacked-RNN based language model in rGBN-RNN generalizes well and does not suffer from mode collapse (i.e., low diversity).

Model LSTM Size Topic Size Perplexity
LCLM (Tian & Cho, 2016) 600 54.18 67.78 96.50
900-900 50.63 67.86 87.77
LDA+LSTM (Lau et al., 2017) 600 100 55.52 69.64 96.50
900-900 100 50.75 63.04 87.77
TopicRNN (Dieng et al., 2017) 600 100 54.54 67.83 93.57
900-900 100 50.24 61.59 84.62
TDLM (Lau et al., 2017) 600 100 52.75 63.45 85.99
900-900 100 48.97 59.04 81.83
TCNLM (Wang et al., 2018) 600 100 52.63 62.64 86.44
900-900 100 47.81 56.38 80.14
TGVAE (Wang et al., 2019) 600 50 48.73 57.11 87.86
basic-LSTM (Hochreiter & Schmidhuber, 1997) 600 64.13 72.14 102.89
900-900 58.89 66.47 94.23
900-900-900 60.13 65.16 95.73
GBN-RNN 600 100 47.42 57.01 86.39
600-512 100-80 44.64 55.42 82.95
600-512-256 100-80-50 44.35 54.53 80.25
rGBN-RNN 600 100 46.35 55.76 81.94
600-512 100-80 43.26 53.82 80.25
600-512-256 100-80-50 42.71 51.36 79.13
Table 1: Comparison of perplexity on three different datasets.
(a) BLEU-3
(b) BLEU-4
Figure 2: BLEU scores of different methods for BNC. x-axis denotes test-BLEU, y-axis self-BLEU, and a better BLEU score would fall within the lower right corner.
Figure 3: Visualization of the -norm of the hidden states of the language model of rGBN-RNN, shown in the top-row, and that of GBN-RNN, shown in the bottom row.
Figure 2: BLEU scores of different methods for BNC. x-axis denotes test-BLEU, y-axis self-BLEU, and a better BLEU score would fall within the lower right corner.
Figure 4: Topics and their temporal trajectories inferred by a three-hidden-layer rGBN-RNN from the APNEWS dataset, and the generated sentences under topic guidance (best viewed in color). Top words of each topic at layer 3, 2, 1 are shown in orange, yellow and blue boxes respectively, and each sentence is shown in a dotted line box labeled with the corresponding topic index. Sentences generated with a combination of topics in different layers are at the bottom of the figure.

3.2 Qualitative analysis

Hierarchical structure of language model: In Fig. 3, we visualize the hierarchical multi-scale structures learned with the language model of rGBN-RNN and that of GBN-RNN, by visualizing the -norm of the hidden states in each layer, while reading a sentence from the APNEWS validation set as “the service employee international union asked why cme group needs tax relief when it is making huge amounts of money?” As shown in Fig. 3(a), in the bottom hidden layer (h1), the norm sequence varies quickly from word to word, except within short phrases such “service employee”, “international union,” and “tax relief,” suggesting layer h1 is in charge of capturing short-term local dependencies. By contrast, in the top hidden layer (h3), the norm sequence varies slowly and exhibits semantic/syntactic meaningful long segments, such as “service employee international union,” “asked why cme group needs tax relief,” “when it is,” and “making huge amounts of,” suggesting that layer h3 is in charge of capturing long-range dependencies. Therefore, the language model in rGBN-RNN can allow more specific information to transmit through lower layers, while allowing more general higher level information to transmit through higher layers. Our proposed model have the ability to learn hierarchical structure of the sequence, despite without designing the multiscale RNNs on purpose like Chung et al. (2017). We also visualize the language model of GBN-RNN in Fig. 3(b); with much less smoothly time-evolved deeper layers, GBN-RNN fails to utilize its stacked RNN structure as effectively as rGBN-RNN does. This suggests that the language model is much better trained in rGBN-RNN than in GBN-RNN for capturing long-range temporal dependencies, which helps explain why rGBN-RNN exhibits clearly boosted BLEU scores in comparison to GBN-RNN.

Hierarchical topics: We present an example topic hierarchy inferred by a three-layer rGBN-RNN from APNEWS. In Fig. 4, we select a large-weighted topic at the top hidden layer and move down the network to include any lower-layer topics connected to their ancestors with sufficiently large weights. Horizontal arrows link temporally related topics at the same layer, while top-down arrows link hierarchically related topics across layers. For example, topic of layer on “budget, lawmakers, gov., revenue,” is related not only in hierarchy to topic on “lawmakers, pay, proposal, legislation” and topic of the lower layer on “budget, gov., revenue, vote, costs, mayor,” but also in time to topic of the same layer on “democratic, taxes, proposed, future, state.” Highly interpretable hierarchical relationships between the topics at different layers, and temporal relationships between the topics at the same layer are captured by rGBN-RNN, and the topics are often quite specific semantically at the bottom layer while becoming increasingly more general when moving upwards.

Sentence generation under topic guidance: Given the learned rGBN-RNN, we can sample the sentences both conditioning on a single topic of a certain layer and on a combination of the topics from different layers. Shown in the dotted-line boxes in Fig. 4, most of the generated sentences conditioned on a single topic or a combination of topics are highly related to the given topics in terms of their semantical meanings but not necessarily in key words, indicating the language model is successfully guided by the recurrent hierarchical topics. These observations suggest that rGBN-RNN has successfully captured syntax and global semantics simultaneously for natural language generation.

Figure 5: Examples of generated sentences and paragraph conditioned on a document from APNEWS (green denotes novel words, blue the key words in document and generated sentences.)

Sentence/paragraph generation conditioning on a paragraph: Given the GBN-RNN and rGBN-RNN learned on APNEWS, we further present the generated sentences conditioning on a paragraph, as shown in Fig. 5. To randomly generate sentences, we encode the paragraph into a hierarchical latent representation and then feed it into the stacked-RNN. Besides, we can generate a paragraph with rGBN-RNN, using its recurrent inference network to encode the paragraph into a dynamic hierarchical latent representation, which is fed into the language model to predict the word sequence in each sentence of the input paragraph. It is clear that both the proposed GBN-RNN and rGBN-RNN can successfully capture the key textual information of the input paragraph, and generate diverse realistic sentences. Interestingly, the rGBN-RNN can generate semantically coherent paragraphs, incorporating contextual information both within and beyond the sentences. Note that with the topics that extract the document-level word cooccurrence patterns, our proposed models can generate semantically-meaningful words, which may not exist in the original document.

4 Conclusion

We propose a recurrent gamma belief network (rGBN) guided neural language modeling framework, a novel method to learn a language model and a deep recurrent topic model simultaneously. For scalable inference, we develop hybrid SG-MCMC and recurrent autoencoding variational inference, allowing efficient end-to-end training. Experiments results conducted on real world corpora demonstrate that the proposed models outperform a variety of shallow-topic-model-guided neural language models, and effectively generate the sentences from the designated multi-level topics or noise, while inferring interpretable hierarchical latent topic structure of document and hierarchical multiscale structures of sequences. For future work, we plan to extend the proposed models to specific natural language processing tasks, such as machine translation, image paragraph captioning, and text summarization. Another promising extension is to replace the stacked-RNN in rGBN-RNN with Transformer, , constructing an rGBN guided Transformer as a new larger-context neural language model.


  • Ahn et al. (2017) Sungjin Ahn, Heeyoul Choi, Tanel Parnamaa, and Yoshua Bengio. A neural knowledge language model. arXiv: Computation and Language, 2017.
  • Blei & Lafferty (2006) D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, 2006.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation.

    Journal of Machine Learning Research

    , 3(Jan):993–1022, 2003.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Fethi Bougares, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Computer Science, 2014.
  • Chung et al. (2017) Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In ICLR, 2017.
  • Cong et al. (2017a) Yulai Cong, Bo Chen, Hongwei Liu, and Mingyuan Zhou. Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC. In ICML, 2017a.
  • Cong et al. (2017b) Yulai Cong, Bo Chen, and Mingyuan Zhou.

    Fast simulation of hyperplane-truncated multivariate normal distributions.

    Bayesian Anal., 12(4):1017–1037, 2017b.
  • Consortium (2007) BNC Consortium. The British National Corpus, version 3 (BNC XML Edition)., 2007.
  • Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In ACL, 2019.
  • Devlin et al. (2019) Jacob Devlin, Mingwei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In north american chapter of the association for computational linguistics, pp. 4171–4186, 2019.
  • Dieng et al. (2017) Adji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. TopicRNN: A recurrent neural network with long-range semantic dependency. In ICLR, 2017.
  • Gan et al. (2015) Zhe Gan, Changyou Chen, Ricardo Henao, David Carlson, and Lawrence Carin. Scalable deep Poisson factor analysis for topic modeling. In ICML, pp. 1823–1832, 2015.
  • Gan et al. (2017) Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In CVPR, pp. 1141–1150, 2017.
  • Gehrmann et al. (2018) Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. Bottom-up abstractive summarization. In EMNLP, pp. 4098–4109, 2018.
  • Girolami & Calderhead (2011) Mark A Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of The Royal Statistical Society Series B-statistical Methodology, 73(2):123–214, 2011.
  • Graves et al. (2013) A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pp. 6645–6649, 2013.
  • Graves (2013) Alex Graves. Generating sequences with recurrent neural networks.

    arXiv: Neural and Evolutionary Computing

    , 2013.
  • Griffiths & Steyvers (2004) T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235, 2004.
  • Griffiths et al. (2004) Thomas L Griffiths, Mark Steyvers, David M Blei, and Joshua B Tenenbaum. Integrating topics and syntax. In NeurIPS, pp. 537–544, 2004.
  • Guo et al. (2018) Dandan Guo, Bo Chen, Hao Zhang, and Mingyuan Zhou. Deep Poisson gamma dynamical systems. In NeurIPS, pp. 8451–8461, 2018.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • Hoffman et al. (2013) Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2013.
  • Klein & Manning (2003) D Klein and Christopher D Manning. Accurate unlexicalized parsing. In Meeting of the Association for Computational Linguistics, 2003.
  • Lau et al. (2017) Jey Han Lau, Timothy Baldwin, and Trevor Cohn. Topically driven neural language model. In meeting of the association for computational linguistics, pp. 355–365, 2017.
  • Li et al. (2015) C. Li, C. Chen, D. Carlson, and L. Carin. Preconditioned stochastic gradient Langevin dynamics for deep neural networks. arXiv, 2015.
  • Ma et al. (2015) Y. Ma, T. Chen, and E. Fox. A complete recipe for stochastic gradient MCMC. In NIPS, pp. 2899–2907, 2015.
  • Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Huang Dan, Andrew Y. Ng, and Christopher Potts.

    Learning Word Vectors for Sentiment Analysis.

    In Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011.
  • Mao et al. (2015) Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L Yuille. Deep captioning with multimodal recurrent neural networks (m-RNN). In ICLR, 2015.
  • Miao et al. (2017) Yishu Miao, Edward Grefenstette, and Phil Blunsom. Discovering discrete latent topics with neural variational inference. In ICML, pp. 2410–2419, 2017.
  • Mikolov & Zweig (2012) Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. In SLT, pp. 234–239, 2012.
  • Mikolov et al. (2010) Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, 2010.
  • Mikolov et al. (2011) Tomas Mikolov, Stefan Kombrink, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. Extensions of recurrent neural network language model. In ICASSP, pp. 5528–5531, 2011.
  • Mnih & Gregor (2014) Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In ICML, pp. 1791–1799, 2014.
  • Patterson & Teh (2013) Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In NIPS, pp. 3102–3110, 2013.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.
  • Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
  • Rennie et al. (2017) Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 7008–7024, 2017.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In ICML, pp. 1278–1286, 2014.
  • Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston.

    A neural attention model for abstractive sentence summarization.

    In EMNLP, pp. 379–389, 2015.
  • Schein et al. (2016) Aaron Schein, Hanna Wallach, and Mingyuan Zhou. Poisson–gamma dynamical systems. In Neural Information Processing Systems, 2016.
  • Srivastava & Sutton (2017) Akash Srivastava and Charles Sutton. Autoencoding variational inference for topic models. In ICLR, 2017.
  • Srivastava et al. (2013) Nitish Srivastava, Ruslan Salakhutdinov, and Geoffrey E Hinton.

    Modeling documents with deep Boltzmann machines.


    Uncertainty in Artificial Intelligence

    , 2013.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
  • Teh et al. (2006) Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical Dirichlet processes. Publications of the American Statistical Association, 101(476):1566–1581, 2006.
  • Tian & Cho (2016) Wang Tian and Kyunghyun Cho. Larger-context language modelling with recurrent neural network. In Meeting of the Association for Computational Linguistics, 2016.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems, pp. 6000–6010, 2017.
  • Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
  • Wallach (2006) Hanna M Wallach. Topic modeling: beyond bag-of-words. In ICML, pp. 977–984, 2006.
  • Wang et al. (2018) Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen, Jiaji Huang, Wei Ping, Sanjeev Satheesh, and Lawrence Carin. Topic compositional neural language model. In AISTATS, pp. 356–365, 2018.
  • Wang et al. (2019) Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, and Lawrence Carin. Topic-guided variational autoencoders for text generation. In NAACL, 2019.
  • Welling & Teh (2011) Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, pp. 681–688, 2011.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pp. 2048–2057, 2015.
  • Zhang et al. (2018) Hao Zhang, Bo Chen, Dandan Guo, and Mingyuan Zhou. WHAI: Weibull hybrid autoencoding inference for deep topic modeling. In ICLR, 2018.
  • Zhao et al. (2018) He Zhao, Lan Du, Wray Buntine, and Mingyuan Zhou. Dirichlet belief networks for topic structure learning. In Neural Information Processing Systems, pp. 7955–7966, 2018.
  • Zhou & Carin (2015) Mingyuan Zhou and Lawrence Carin. Negative binomial process count and mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):307–320, 2015.
  • Zhou et al. (2016) Mingyuan Zhou, Yulai Cong, and Bo Chen. Augmentable gamma belief networks. J. Mach. Learn. Res., 17(163):1–44, 2016.
  • Zhu et al. (2018) Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018.

Appendix A The GBN-RNN

GBN-RNN: denotes a sentence-context pair, where represents the document-level context as a word frequency count vector, the th element of which counts the number of times the th word in the vocabulary appears in the document excluding sentence . The hierarchical model of a -hidden-layer GBN, from top to bottom, is expressed as


The stacked-RNN based language model described in (3) is also used in GBN-RNN.

Statistical inference: To infer GBN-RNN, we consider a hybrid of stochastic gradient MCMC (Welling & Teh, 2011; Patterson & Teh, 2013; Li et al., 2015; Ma et al., 2015; Cong et al., 2017a), used for the GBN topics , and auto-encoding variational inference (Kingma & Welling, 2013; Rezende et al., 2014), used for the parameters of both the inference network (encoder) and RNN. More specifically, GBN-RNN generalizes Weibull hybrid auto-encoding inference (WHAI) of Zhang et al. (2018): it uses a deterministic-downward-stochastic-upward inference network to encode the bag-of-words representation of into the latent topic-weight variables across all hidden layers, which are fed into not only GBN to reconstruct , but also a stacked RNN in language model, as shown in (3), to predict the word sequence in . The topics can be sampled with topic-layer-adaptive stochastic gradient Riemannian (TLASGR) MCMC, whose details can be found in Cong et al. (2017a); Zhang et al. (2018), omitted here for brevity. Given the sampled topics , the joint marginal likelihood of is defined as


For efficient inference, an inference network as is used to provide a ELBO of the log joint marginal likelihood as


and the training is performed by maximizing ; following Zhang et al. (2018), we define , where both and are deterministically transformed from  using neural networks. Distinct from a usual variational auto-encoder whose inference network has a pure bottom-up structure, the inference network here has a determistic-upward–stoachstic-downward ladder structure (Zhang et al., 2018).

Appendix B The coupling vector

Following Lau et al. (2017), the can be implemented with a gating unit similar to a GRU (Cho et al., 2014), describe as


Appendix C SGMCMC for rGBN-RNN

To allow for scalable inference, we apply the topic-layer-adaptive stochastic gradient Riemannian (TLASGR) MCMC algorithm described in Cong et al. (2017a); Zhang et al. (2018), which can be used to sample simplex-constrained global parameters Cong et al. (2017b) in a mini-batch based manner. It improves its sampling efficiency via the use of the Fisher information matrix (FIM) Girolami & Calderhead (2011), with adaptive step-sizes for the latent factors and transition matrices of different layers. In this section, we discuss how to update the global parameters of rGBN in detail and give a complete one in Algorithm in 1.

Sample the auxiliary counts: This step is about the “backward” and “upward” pass. Let us denote ,  , and , where is shown in (2.1). Working backward for and upward for , we draw


Note that via the deep structure, the latent counts will be influenced by the effects from both time at layer  and time at layer . With and , we can sample the latent counts at layer  and by


and then draw


In rGBN, the prior and the likelihood of is very similar with , so we also apply the TLASGR MCMC sampling algorithm on both of them conditioned on the auxiliary counts.

Sample the hierarchical components : For , the th column of the loading matrix of layer , its sampling can be efficiently realized as


where is calculated using the estimated FIM, and , comes from the augmented latent counts in (13), denote the prior of , and denotes a simplex constraint.

Sample the transmission matrix : For , the th column of the transition matrix of layer , its sampling can be efficiently realized as


where is calculated using the estimated FIM, and , comes from the augmented latent counts in (16), and denotes a simplex constraint, and denotes the prior of , more details about TLASGR-MCMC for our proposed model can be found in Cong et al. (2017a).

Appendix D Datasets

We consider three publicly available corpora111 APNEWS is a collection of Associated Press news articles from 2009 to 2016, IMDB is a set of movie reviews collected by Maas et al. (2011), and BNC is the written portion of the British National Corpus (Consortium, 2007). Following the preprocessing steps in Lau et al. (2017), we tokenize words and sentences using Stanford CoreNLP (Klein & Manning, 2003), lowercase all word tokens, and filter out word tokens that occur less than 10 times. For the topic model, we additionally exclude stopwords222We use Mallet’s stopword list: and the top most frequent words. All these corpora are partitioned into training, validation, and testing sets, whose summary statistics are provided in Table 2 of the Appendix.

Dataset Vocubalry Training Validation Testing
LM TM Docs Sents Tokens Docs Sents Tokens Docs Sents Tokens
APNEWS 34231 32169 50K 0.8M 15M 2K 33K 0.6M 2K 32K 0.6M
IMDB 36009 34925 75K 1.1M 20M 12.5K 0.18M 0.3M 12.5K 0.18M 0.3M
BNC 43703 41552 15K 1M 18M 1K 57K 1M 1K 66K 1M
Table 2: Summary statistics for the datasets.

Appendix E Complexity of rGBN-RNN

The proposed rGBN-RNN consists of both language model and topic model components. For the topic model component, there are the global parameters of rGBN (decoder), including in (2.1) , and the parameters of the variational recurrent inference network (encoder), consisting of , , and in (8). The language model component is parameterized by in (3) and the coupling vectors described in Appendix B. We summarize in Table 4 the complexity of rGBN-RNN (ignoring all bias terms), where denotes the vocabulary size of the language model, the dimension of word embedding vectors, the size of the vocabulary of the topic model that excludes stop words, the number of hidden units of the word-level LSTM at layer (stacked-RNN language model), the number of hidden units of the sentence-level RNN at layer (variational recurrent inference network), and the number of topics at layer .

Table 4 further compares the number of parameters between various RNN-based language models, where we follow the convention to ignore the word embedding layers. Some models in Table 1 are not included here, because we could not find sufficient information from their corresponding papers or code to accurately calculate the number of model parameters. Note when used for language generation at the testing stage, rGBN-RNN no longer needs its topics , whose parameters are hence not counted. Note the number of parameters of the topic model component is often dominated by that of the language model component.

Component Language Model Topic Model
Param in (3) in B in (2.1) in (2.1) in (8) in (8) in (8)
Table 4: Comparison of the number of parameters of different models when used for language generation.
Model LSTM Size Topic Size # LM Param #TM Param # All Param
TDLM (Lau et al., 2017) 600 100 3.35M 0.019M 3.37M
900-900 100 13.38M 0.019M 13.40M
basic-LSTM (Hochreiter & Schmidhuber, 1997) 600 2.16M 2.16M
900-900 10.80M 10.80M
900-900-900 17.68M 17.68M
GBN-RNN 600 100 3.40M 0.02M 3.42M
600-512 100-80 6.50M 0.04M 6.54M
600-512-256 100-80-50 7.20M 0.05M 7.25M
rGBN-RNN 600 100 3.40M 0.03M 3.43M
600-512 100-80 6.50M 0.06M 6.56M
600-512-256 100-80-50 7.20M 0.07M 7.27M
Table 3: Complexity of the three-layer rGBN-RNN.

Appendix F More experimental results on IMDB and BNC

Figure 6: Topics and their temporal trajectories inferred by a three-hidden-layer rGBN-RNN from the IMDB dataset, and the generated sentences under topic guidance (best viewed in color). Top words of each topic are shown in orange, yellow and blue box, and each sentence is shown in a dotted line box labeled with the corresponding topic index. Sentences generated with a combination of topics in different layers are at the bottom of the figure.
Figure 7: Topics and their temporal trajectories inferred by a three-hidden-layer rGBN-RNN from the BNC dataset, and the generated sentences under topic guidance (best viewed in color). Top words of each topic are shown in orange, yellow and blue box, and each sentence is shown in a dotted line box labeled with the corresponding topic index. Sentences generated with a combination of topics in different layers are at the bottom of the figure.
Figure 6: Topics and their temporal trajectories inferred by a three-hidden-layer rGBN-RNN from the IMDB dataset, and the generated sentences under topic guidance (best viewed in color). Top words of each topic are shown in orange, yellow and blue box, and each sentence is shown in a dotted line box labeled with the corresponding topic index. Sentences generated with a combination of topics in different layers are at the bottom of the figure.

Appendix G BLEU scores for IMDB

Figure 8: BLEU scores of different methods for IMDB. x-axis denotes test-BLEU, and y-axis self-BLEU. Left panel is BLEU-3 and right is BLEU-4, and a better BLEU score would fall within the lower right corner, where black point represents mean value and circles with different colors denote the elliptical surface of probability of BLEU in a two-dimensional space.

Appendix H BLEU scores for APNEWS

Figure 9: BLEU scores of different methods for APNEWS. x-axis denotes test-BLEU, and y-axis self-BLEU. Left panel is BLEU-3 and right is BLEU-4, and a better BLEU score would fall within the lower right corner, where black point represents mean value and circles with different colors denote the elliptical surface of probability of BLEU in a two-dimensional space.