Topic Compositional Neural Language Model

12/28/2017 ∙ by Wenlin Wang, et al. ∙ 0

We propose a Topic Compositional Neural Language Model (TCNLM), a novel method designed to simultaneously capture both the global semantic meaning and the local word ordering structure in a document. The TCNLM learns the global semantic coherence of a document via a neural topic model, and the probability of each learned latent topic is further used to build a Mixture-of-Experts (MoE) language model, where each expert (corresponding to one topic) is a recurrent neural network (RNN) that accounts for learning the local structure of a word sequence. In order to train the MoE model efficiently, a matrix factorization method is applied, by extending each weight matrix of the RNN to be an ensemble of topic-dependent weight matrices. The degree to which each member of the ensemble is used is tied to the document-dependent probability of the corresponding topics. Experimental results on several corpora show that the proposed approach outperforms both a pure RNN-based model and other topic-guided language models. Further, our model yields sensible topics, and also has the capacity to generate meaningful sentences conditioned on given topics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: The overall architecture of the proposed model.

A language model is a fundamental component to natural language processing (NLP). It plays a key role in many traditional NLP tasks, ranging from speech recognition 

(Mikolov et al., 2010; Arisoy et al., 2012; Sriram et al., 2017), machine translation (Schwenk et al., 2012; Vaswani et al., 2013) to image captioning (Mao et al., 2014; Devlin et al., 2015). Training a good language model often improves the underlying metrics of these applications, e.g., word error rates for speech recognition and BLEU scores (Papineni et al., 2002) for machine translation. Hence, learning a powerful language model has become a central task in NLP. Typically, the primary goal of a language model is to predict distributions over words, which has to encode both the semantic knowledge and grammatical structure in the documents. RNN-based neural language models have yielded state-of-the-art performance (Jozefowicz et al., 2016; Shazeer et al., 2017). However, they are typically applied only at the sentence level, without access to the broad document context. Such models may consequently fail to capture long-term dependencies of a document (Dieng et al., 2016).

Fortunately, such broader context information is of a semantic nature, and can be captured by a topic model. Topic models have been studied for decades and have become a powerful tool for extracting high-level semantic structure of document collections, by inferring latent topics. The classical Latent Dirichlet Allocation (LDA) method (Blei et al., 2003) and its variants, including recent work on neural topic models (Wan et al., 2012; Cao et al., 2015; Miao et al., 2017), have been useful for a plethora of applications in NLP.

Although language models that leverage topics have shown promise, they also have several limitations. For example, some of the existing methods use only pre-trained topic models (Mikolov and Zweig, 2012), without considering the word-sequence prediction task of interest. Another key limitation of the existing methods lies in the integration of the learned topics into the language model; e.g.

, either through concatenating the topic vector as an additional feature of RNNs 

(Mikolov and Zweig, 2012; Lau et al., 2017), or re-scoring the predicted distribution over words using the topic vector (Dieng et al., 2016). The former requires a balance between the number of RNN hidden units and the number of topics, while the latter has to carefully design the vocabulary of the topic model.

Motivated by the aforementioned goals and limitations of existing approaches, we propose the Topic Compositional Neural Language Model (TCNLM), a new approach to simultaneously learn a neural topic model and a neural language model. As depicted in Figure 1

, TCNLM learns the latent topics within a variational autoencoder 

(Kingma and Welling, 2013) framework, and the designed latent code quantifies the probability of topic usage within a document. Latent code is further used in a Mixture-of-Experts model (Hu et al., 1997), where each latent topic has a corresponding language model (expert). A combination of these “experts,” weighted by the topic-usage probabilities, results in our prediction for the sentences. A matrix factorization approach is further utilized to reduce computational cost as well as prevent overfitting. The entire model is trained end-to-end by maximizing the variational lower bound. Through a comprehensive set of experiments, we demonstrate that the proposed model is able to significantly reduce the perplexity of a language model and effectively assemble the meaning of topics to generate meaningful sentences. Both quantitative and qualitative comparisons are provided to verify the superiority of our model.

2 Preliminaries

We briefly review RNN-based language models and traditional probabilistic topic models.

Language Model

A language model aims to learn a probability distribution over a sequence of words in a pre-defined vocabulary. We denote

as the vocabulary set and to be a sequence of words, with each . A language model defines the likelihood of the sequence through a joint probability distribution

(1)

RNN-based language models define the conditional probabiltiy of each word given all the previous words through the hidden state :

(2)
(3)

The function

is typically implemented as a basic RNN cell, a Long Short-Term Memory (LSTM) cell 

(Hochreiter and Schmidhuber, 1997)

, or a Gated Recurrent Unit (GRU) cell 

(Cho et al., 2014). The input and output words are related via the relation .

Topic Model

A topic model is a probabilistic graphical representation for uncovering the underlying semantic structure of a document collection. Latent Dirichlet Allocation (LDA) (Blei et al., 2003), for example, provides a robust and scalable approach for document modeling, by introducing latent variables for each token, indicating its topic assignment. Specifically, let denote the topic proportion for document , and represent the topic assignment for word . The Dirichlet distribution is employed as the prior of . The generative process of LDA may be summarized as:

where represents the distribution over words for topic , is the hyper-parameter of the Dirichlet prior, , and is the number of words in document . The marginal likelihood for document can be expressed as

3 Topic Compositional Neural Language Model

We describe the proposed TCNLM, as illustrated in Figure 1. Our model consists of two key components: (i) a neural topic model (NTM), and (ii) a neural language model (NLM). The NTM aims to capture the long-range semantic meanings across the document, while the NLM is designed to learn the local semantic and syntactic relationships between words.

3.1 Neural Topic Model

Let denote the bag-of-words representation of a document, with denoting nonnegative integers. is the vocabulary size, and each element of reflects a count of the number of times the corresponding word occurs in the document. Distinct from LDA (Blei et al., 2003), we pass a Gaussian random vector through a softmax function to parameterize the multinomial document topic distributions (Miao et al., 2017). Specifically, the generative process of the NTM is

(4)

where

is an isotropic Gaussian distribution, with mean

and variance

in each dimension; is a transformation function that maps sample to the topic embedding , defined here as , where and are trainable parameters.

The marginal likelihood for document is:

(5)

The second equation in (5) holds because we can readily marginalized out the sampled topic words by

(6)

is the transition matrix from the topic distribution to the word distribution, which are trainable parameters of the decoder; is the number of topics and is the topic distribution over words (all elements of are nonnegative, and they sum to one).

The re-parameterization trick (Kingma and Welling, 2013)

can be applied to build an unbiased and low-variance gradient estimator for the variational distribution. The parameter updates can still be derived directly from the variational lower bound, as discussed in Section 

3.3.

Diversity Regularizer

Redundance in inferred topics is a common issue exisiting in general topic models. In order to address this issue, it is straightforward to regularize the row-wise distance between each paired topics to diversify the topics. Following Xie et al. (2015); Miao et al. (2017), we apply a topic diversity regularization while carrying out the inference.

Specifically, the distance between a pair of topics are measured by their cosine distance . The mean angle of all pairs of topics is , and the variance is . Finally, the topic diversity regularization is defined as .

3.2 Neural Language Model

We propose a Mixture-of-Experts (MoE) language model, which consists a set of “expert networks”, i.e., . Each expert is itself an RNN with its own parameters corresponding to a latent topic.

Without loss of generality, we begin by discussing an RNN with a simple transition function, which is then generalized to the LSTM. Specifically, we define two weight tensors

and , where is the number of hidden units and is the dimension of word embedding. Each expert corresponds to a set of parameters and , which denotes the -th 2D “slice” of and , respectively. All experts work cooperatively to generate an output . Sepcifically,

(7)
(8)

where is the usage of topic (component of ), and

is a sigmoid function;

is the weight matrix connecting the RNN’s hidden state, used for computing a distribution over words. Bias terms are omitted for simplicity.

However, such an MoE module is computationally prohibitive and storage excessive. The training process is inefficient and even infeasible in practice. To remedy this, instead of ensembling the output of the experts as in (7), we extend the weight matrix of the RNN to be an ensemble of topic-dependent weight matrices. Specifically, the experts work together as follows:

(9)
(10)

and

(11)

In order to reduce the number of model parameters, motivated by Gan et al. (2016); Song et al. (2016), instead of implementing a tensor as in (11), we decompose into a multiplication of three terms , and , where is the number of factors. Specifically,

(12)

where represents the Hadamard operator. and are shared parameters across all topics, to capture the common linguistic patterns. are the factors which are weighted by the learned topic embedding . The same factorization is also applied for . The topic distribution affects RNN parameters associated with the document when predicting the succeeding words, which implicitly defines an ensemble of language models. In this factorized model, the RNN weight matrices that correspond to each topic share “structure”.

Now we generalize the above analysis by using LSTM units. Specifically, we summarize the new topic compositional LSTM cell as:

(13)

For , we define

(14)
(15)

Compared with a standard LSTM cell, our LSTM unit has a total number of parameters in size of and the additional computational cost comes from (14) and (15). Further, empirical comparison has been conducted in Section 5.6 to verify that our proposed model is superior than using the naive MoE implementation as in (7).

3.3 Model Inference

The proposed model (see Figure 1) follows the variational autoencoder (Kingma and Welling, 2013) framework, which takes the bag-of-words as input and embeds a document into the topic vector. This vector is then used to reconstruct the bag-of-words input, and also to learn an ensemble of RNNs for predicting a sequence of words in the document.

The joint marginal likelihood can be written as:

(16)

Since the direct optimization of (16) is intractable, we employ variational inference (Jordan et al., 1999). We denote to be the variational distribution for . Hence, we construct the variational objective function, also called the evidence lower bound (ELBO), as

(17)

More details can be found in the Supplementary Material. In experiments, we optimize the ELBO together with the diversity regularisation:

(18)

4 Related Work

Dataset Vocabulary Training Development Testing
LM TM # Docs # Sents # Tokens # Docs # Sents # Tokens # Docs # Sents # Tokens
APNEWS
IMDB
BNC
Table 1: Summary statistics for the datasets used in the experiments.

Topic Model Topic models have been studied for a variety of applications in document modeling. Beyond LDA (Blei et al., 2003), significant extensions have been proposed, including capturing topic correlations (Blei and Lafferty, 2007), modeling temporal dependencies (Blei and Lafferty, 2006), discovering an unbounded number of topics (Teh et al., 2005), learning deep architectures (Henao et al., 2015; Zhou et al., 2015)

, among many others. Recently, neural topic models have attracted much attention, building upon the successful usage of restricted Boltzmann machines 

(Hinton and Salakhutdinov, 2009), auto-regressive models (Larochelle and Lauly, 2012), sigmoid belief networks (Gan et al., 2015), and variational autoencoders (Miao et al., 2016).

Variational inference has been successfully applied in a variety of applications (Pu et al., 2016; Wang et al., 2017; Chen et al., 2017). The recent work of Miao et al. (2017) employs variational inference to train topic models, and is closely related to our work. Their model follows the original LDA formulation and extends it by parameterizing the multinomial distribution with neural networks. In contrast, our model enforces the neural network not only modeling documents as bag-of-words, but also transfering the inferred topic knowledge to a language model for word-sequence generation.

Language Model

Neural language models have recently achieved remarkable advances (Mikolov et al., 2010)

. The RNN-based language model (RNNLM) is superior for its ability to model longer-term temporal dependencies without imposing a strong conditional independence assumption; it has recently been shown to outperform carefully-tuned traditional n-gram-based language models 

(Jozefowicz et al., 2016).

An RNNLM can be further improved by utilizing the broad document context (Mikolov and Zweig, 2012). Such models typically extract latent topics via a topic model, and then send the topic vector to a language model for sentence generation. Important work in this direction include Mikolov and Zweig (2012); Dieng et al. (2016); Lau et al. (2017); Ahn et al. (2016). The key differences of these methods is in either the topic model itself or the method of integrating the topic vector into the language model. In terms of the topic model, Mikolov and Zweig (2012) uses a pre-trained LDA model; Dieng et al. (2016) uses a variational autoencoder; Lau et al. (2017)

introduces an attention-based convolutional neural network to extract semantic topics; and 

Ahn et al. (2016)

utilizes the topic associated to the fact pairs derived from a knowledge graph 

(Vinyals and Le, 2015).

Concerning the method of incorporating the topic vector into the language model, Mikolov and Zweig (2012) and Lau et al. (2017) extend the RNN cell with additional topic features. Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from these approaches, our model learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a Mixture-of-Experts (MoE) model design. Under our factorization method, the model can yield boosted performance efficiently (as corroborated in the experiments).

Recently, Shazeer et al. (2017)

proposes a MoE model for large-scale language modeling. Different from ours, they introduce a MoE layer, in which each expert stands for a small feed-forward neural network on the previous output of the LSTM layer. Therefore, it yields a significant quantity of additional parameters and computational cost, which is infeasible to train on a single GPU machine. Moreover, they provide no semantic meanings for each expert, and all experts are treated equally; the proposed model can generate meaningful sentences conditioned on given topics.

Our TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016)

uses a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, our model jointly learns a topic model and a language model, and focuses on the language modeling task.

5 Experiments

Dataset LSTM basic-LSTM LDA+LSTM LCLM Topic-RNN TDLM TCNLM
type 50 100 150 50 100 150 50 100 150 50 100 150
APNEWS small 52.59
large 47.74
IMDB small 62.59
large 56.12
BNC small 86.21
large 80.12
Table 2: Test perplexities of different models on APNEWS, IMDB and BNC. () taken from Lau et al. (2017).

Datasets

We present experimental results on three publicly available corpora: APNEWS, IMDB and BNC. APNEWS111https://www.ap.org/en-gb/ is a collection of Associated Press news articles from 2009 to 2016. IMDB is a set of movie reviews collected by Maas et al. (2011), and BNC (BNC Consortium, 2007) is the written portion of the British National Corpus, which contains excerpts from journals, books, letters, essays, memoranda, news and other types of text. These three datasets can be downloaded from GitHub222https://github.com/jhlau/topically-driven-language-model.

We follow the preprocessing steps in Lau et al. (2017). Specifically, words and sentences are tokenized using Stanford CoreNLP (Manning et al., 2014). We lowercase all word tokens, and filter out word tokens that occur less than 10 times. For topic modeling, we additionally remove stopwords333We use the following stopwords list: https://github.
com/mimno/Mallet/blob/master/stoplists/en.txt
in the documents and exclude the top most frequent words and also words that appear in less than 100 documents. All these datasets are divided into training, development and testing sets. A summary statistic of these datasets is provided in Table 1.

Setup

For the NTM part, we consider a 2-layer feed-forward neural network to model , with

hidden units in each layer; ReLU 

(Nair and Hinton, 2010)

is used as the activation function. The hyper-parameter

for the diversity regularizer is fixed to 0.1 across all the experiments. All the sentences in a paragraph, excluding the one being predicted, are used to obtain the bag-of-words document representation . The maximum number of words in a paragraph is set to .

In terms of the NLM part, we consider 2 settings: (i) a small 1-layer LSTM model with hidden units, and (ii) a large 2-layer LSTM model with hidden units in each layer. The sequence length is fixed to 30. In order to alleviate overfitting, dropout with a rate of is used in each LSTM layer. In addition, adaptive softmax (Grave et al., 2016) is used to speed up the training process.

During training, the NTM and NLM parameters are jointly learned using Adam (Kingma and Ba, 2014)

. All the hyper-parameters are tuned based on the performance on the development set. We empirically find that the optimal settings are fairly robust across the 3 datasets. All the experiments were conducted using Tensorflow and trained on NVIDIA GTX TITAN X with 3072 cores and 12GB global memory.

5.1 Language Model Evaluation

Perplexity is used as the metric to evaluate the performance of the language model. In order to demonstrate the advantage of the proposed model, we compare TCNLM with the following baselines:

  • basic-LSTM: A baseline LSTM-based language model, using the same architecture and hyper-parameters as TCNLM wherever applicable.

  • LDA+LSTM: A topic-enrolled LSTM-based language model. We first pretrain an LDA model (Blei et al., 2003) to learn 50/100/150 topics for APNEWS, IMDB and BNC. Given a document, the LDA topic distribution is incorporated by concatenating with the output of the hidden states to predict the next word.

  • LCLM (Wang and Cho, 2016): A context-based language model, which incorporates context information from preceding sentences. The preceding sentences are treated as bag-of-words, and an attention mechanism is used when predicting the next word.

  • TDLM (Lau et al., 2017): A convolutional topic model enrolled languge model. Its topic knowledge is utilized by concatenating to a dense layer of a recurrent language model.

  • Topic-RNN (Dieng et al., 2016)

    : A joint learning framework that learns a topic model and a language model simutaneously. The topic information is incorporated through a linear transformation to rescore the prediction of the next word.

Topic-RNN (Dieng et al., 2016) is implemented by ourselves and other comparisons are copied from (Lau et al., 2017). Results are presented in Table 2. We highlight some observations. (i) All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. (ii) Our TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. (iii) The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. (iv

) The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.

Dataset army animal medical market lottory terrorism law art transportation education
APNEWS afghanistan animals patients zacks casino syria lawsuit album airlines students
veterans dogs drug cents mega iran damages music fraud math
soldiers zoo fda earnings lottery militants plaintiffs film scheme schools
brigade bear disease keywords gambling al-qaida filed songs conspiracy education
infantry wildlife virus share jackpot korea suit comedy flights teachers
IMDB horror action family children war detective sci-fi negative ethic epsiode
zombie martial rampling kids war eyre alien awful gay season
slasher kung relationship snoopy che rochester godzilla unfunny school episodes
massacre li binoche santa documentary book tarzan sex girls series
chainsaw chan marie cartoon muslims austen planet poor women columbo
gore fu mother parents jews holmes aliens worst sex batman
BNC environment education politics business facilities sports art award expression crime
pollution courses elections corp bedrooms goal album john eye police
emissions training economic turnover hotel score band award looked murder
nuclear students minister unix garden cup guitar research hair killed
waste medau political net situated ball music darlington lips jury
environmental education democratic profits rooms season film speaker stared trail
Table 3: 10 topics learned from our TCNLM on APNEWS, IMDB and BNC.
# Topic Model Coherence
APNEWS IMDB BNC
50 0.125 0.084 0.106
0.075 0.064 0.081
0.149 0.104 0.102
0.130 0.088 0.095
Topic-RNN(s) 0.134 0.103 0.102
Topic-RNN(l) 0.127 0.096 0.100
TCNLM(s) 0.159 0.106 0.114
TCNLM(l)
100 0.136 0.092 0.119
0.085 0.071 0.070
0.152 0.087 0.106
0.142 0.097 0.101
Topic-RNN(s) 0.158 0.096 0.108
Topic-RNN(l) 0.143 0.093 0.105
TCNLM(s) 0.160 0.101
TCNLM(l)
150 0.134 0.094 0.119
0.078 0.075 0.072
0.147 0.085 0.100
0.145 0.091 0.104
Topic-RNN(s) 0.146 0.089 0.102
Topic-RNN(l) 0.137 0.092 0.097
TCNLM(s) 0.096
TCNLM(l) 0.155
Table 4: Topic coherence scores of different models on APNEWS, IMDB and BNC. (s) and (l) indicate small and large model, respectively.() taken from Lau et al. (2017).
Data Topic Generated Sentences
APNEWS army a female sergeant, serving in the fort worth, has served as she served in the military in iraq .
animal most of the bear will have stumbled to the lake .
medical physicians seeking help in utah and the nih has had any solutions to using the policy and uses offline to be fitted with a testing or body .
market the company said it expects revenue of $ unk million to $ unk million in the third quarter .
lottory where the winning numbers drawn up for a mega ball was sold .
2-3 army+terrorism the taliban ’s presence has earned a degree from the 1950-53 korean war in pakistan ’s historic life since 1964 , with two example of unk
 soldiers from wounded iraqi army shootings and bahrain in the eastern army .
animal+lottory she told the newspaper that she was concerned that the buyer was in a neighborhood last year and had a gray wolf .
IMDB horror the killer is a guy who is n’t even a zombie .
action the action is a bit too much , but the action is n’t very good .
family the film is also the story of a young woman whose unk and unk and very yet ultimately sympathetic , unk relationship , unk ,
 and palestine being equal , and the old man , a unk .
children i consider this movie to be a children ’s film for kids .
war the documentary is a documentary about the war and the unk of the war .
2-3 horror+negative if this movie was indeed a horrible movie i think i will be better off the film .
sci-fi+children paul thinks him has to make up when the unk eugene discovers defeat in order to take too much time without resorting to mortal bugs ,
 and then finds his wife and boys .
BNC environment environmentalists immediate base calls to defend the world .
education the school has recently been founded by a unk of the next generation for two years .
politics a new economy in which privatization was announced on july 4 .
business net earnings per share rose unk % to $ unk in the quarter , and $ unk m , on turnover that rose unk % to $ unk m.
facilities all rooms have excellent amenities .
2-3 environment+politics the commission ’s report on oct. 2 , 1990 , on jan. 7 denied the government ’s grant to ” the national level of water ” .
art+crime as well as 36, he is returning freelance into the red army of drama where he has finally been struck for their premiere .
Table 5: Generated sentences from given topics. More examples are provided in the Supplementary Material.

5.2 Topic Model Evaluation

We evaluate the topic model by inspecting the coherence of inferred topics (Chang et al., 2009; Newman et al., 2010; Mimno et al., 2011). Following Lau et al. (2014), we compute topic coherence using normalized PMI (NPMI). Given the top words of a topic, the coherence is calculated based on the sum of pairwise NPMI scores between topic words, where the word probabilities used in the NPMI calculation are based on co-occurrence statistics mined from English Wikipedia with a sliding window. In practice, we average topic coherence over the top topic words. To aggregate topic coherence score for a trained model, we then further average the coherence scores over topics. For comparison, we use the following baseline topic models:

  • LDA: LDA (Blei et al., 2003) is used as a baseline topic model. We use LDA to learn the topic distributions for LDA+LSTM.

  • NTM: We evaluate the neural topic model proposed in Cao et al. (2015). The document-topic and topic-words multinomials are expressed using neural networks. N-grams embeddings are incorporated as inputs of the model.

  • TDLM (Lau et al., 2017): The same model as used in the language model evaluation.

  • Topic-RNN (Dieng et al., 2016): The same model as used in the language model evaluation.

Results are summarized in Table 4. Our TCNLM achieves promising results. Specifically, (i) we achieve the best coherence performance over APNEWS and IMDB, and are relatively competitive with LDA on BNC. (ii) We also observe that a larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. (iii) Additionally, the advantage of our TCNLM over Topic-RNN indicates that our TCNLM supplies a more powerful topic guidance.

Figure 2: Inferred topic distributions on one sample document in each dataset. Content of the three documents is provided in the Supplementary Mateiral.

In order to better understand the topic model, we provide the top 5 words for 10 randomly chosen topics on each dataset (the boldface word is the topic name summarized by us), as shown in Table 3. These results correspond to the small network with 100 neurons. We also present some inferred topic distributions for several documents from our TCNLM in Figure 2. The topic usage for a specific document is sparse, demonstrating the effectiveness of our NTM. More inferred topic distribution examples are provided in the Supplementary Material.

5.3 Sentence Generation

Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Given topic i, we construct an LSTM generator by using only the -th factor of and . Then we start from a zero hidden state, and greedily sample words until an end token occurs. Table 5 shows the generated sentences from our TCNLM learned with 50 topics using the small network. Most of the sentences are strongly correlated with the given topics. More interestingly, we can also generate reasonable sentences conditioned on a mixed combination of topics, even if the topic pairs are divergent, e.g., “animal” and “lottory” for APNEWS. More examples are provided in the Supplementary Material. It shows that our TCNLM is able to generate topic-related sentences, providing an interpretable way to understand the topic model and the language model simulaneously. These qualitative analysis further demonstrate that our model effectively assembles the meaning of topics to generate sentences.

5.4 Empirical Comparison with Naive MoE

We explore the usage of a naive MoE language model as in (7). In order to fit the model on a single GPU machine, we train a NTM with topics and each NLM of the MoE is a 1-layer LSTM with hidden units. Results are summarized in Table 6. Both the naive MoE and our TCNLM provide better performance than the basic LSTM. Interestingly, though requiring less computational cost and storage usage, our TCNLM outperforms the naive MoE by a non-trivial margin. We attribute this boosted performance to the “structure” design of our matrix factorization method. The inherent topic-guided factor control significantly prevents overfitting, and yields efficient training, demonstrating the advantage of our model for transferring semantic knowledge learned from the topic model to the language model.

Dataset basic-LSTM naive MoE TCNLM
APNEWS 101.62 85.87 82.67
IMDB 105.29 96.16 94.64
BNC 146.50 130.01 125.09
Table 6: Test perplexity comparison between the naive MoE implementation and our TCNLM on APNEWS, IMDB and BNC.

6 Conclusion

We have presented Topic Compositional Neural Language Model (TCNLM), a new method to learn a topic model and a language model simultaneously. The topic model part captures the global semantic meaning in a document, while the language model part learns the local semantic and syntactic relationships between words. The inferred topic information is incorporated into the language model through a Mixture-of-Experts model design. Experiments conducted on three corpora validate the superiority of the proposed approach. Further, our model infers sensible topics, and has the capacity to generate meaningful sentences conditioned on given topics. One possible future direction is to extend the TCNLM to a conditional model and apply it for the machine translation task.

References

Appendix A Detailed model inference

We provide the detailed derivation for the model inference. Start from (16), we have

Appendix B Documents used to infer topic distributions

The documents used to infer the topic distributions ploted in Figure 2 are provided below.

Apnews

: colombia ’s police director says six police officers have been killed and a seventh wounded in ambush in a rural southwestern area where leftist rebels operate . gen. jose roberto leon tells the associated press that the officers were riding on four motorcycles when they were attacked with gunfire monday afternoon on a rural stret ch of highway in the cauca state town of padilla . he said a front of the revolutionary armed forces of colombia , or farc , operates in the area . if the farc is r esponsible , the deaths would bring to 15 the number of security force members killed since the government and rebels formally opened peace talks in norway on oct. 18 . the talks to end a nearly five-decade-old conflict are set to begin in earnest in cuba on nov. 15 .

Imdb

: having just watched this movie for a second time , some years after my initial viewing , my feelings remain unchanged . this is a solid sci-fi drama that i enjoy very much . what sci-fi elements there are , are primarily of added interest rather than the main substance of the film . what this movie is really about is wartime confl ict , but in a sci-fi setting . it has a solid cast , from the ever reliable david warner to the up and coming freddie prinze jr , also including many british tv regu lars ( that obviously add a touch of class :) , not forgetting the superb tcheky karyo . i feel this is more of an ensemble piece than a starring vehicle . reminisc ent of wwii films based around submarine combat and air-combat ( the fighters seem like adaptations of wwii corsairs in their design , evoking a retro feel ) this is one of few american films that i felt was not overwhelmed by sentiment or saccharine . the sets and special effects are all well done , never detracting form the bel ievability of the story , although the kilrathi themselves are rather under developed and one dimensional . this is a film more about humanity in conflict rather than a film about exploring a new and original alien race or high-brow sci-fi concepts . forget that it ’s sci-fi , just watch and enjoy .

BNC: an army and civilian exercise went ahead in secret yesterday a casualty of the general election . the simulated disaster in exercise gryphon ’s lift was a midair coll ision between military and civilian planes over catterick garrison . hamish lumsden , the ministry of defence ’s chief press officer who arrived from london , said : ’ there ’s an absolute ban on proactive pr during an election . ’ journalists who reported to gaza barracks at 7.15 am as instructed were told they would not be all owed to witness the exercise , which involved 24 airmobile brigade , north yorkshire police , fire and ambulance services , the county emergency planning department a nd ’ casualties ’ from longlands college , middlesbrough . the aim of the gryphon lift was to test army support for civil emergencies . brief details supplied to th e press outlined the disaster . a fully loaded civilian plane crashes in mid-air with an armed military plane over catterick garrison . the 1st battalion the green ho wards and a bomb disposal squad cordon and clear an area scattered with armaments . 24 airmobile field ambulance , which served in the gulf war , tends a burning , pa cked youth hostel hit by pieces of aircraft . 38 engineer regiment from claro barracks , ripon , search a lake where a light aircraft crashed when hit by flying wreck age . civilian emergency services , including the police underwater team , were due to work alongside military personnel under the overall co-ordination of the police . mr lumsden said : ’ there is a very very strict rule that during a general election nothing that the government does can intrude on the election process . ’

Figure 3: Inferred topic distributions for the first 5 documents in the development set over each dataset.

Appendix C More inferred topic distribution examples

We present the inferred topic distributions for the first 5 documents in the development set over each dataset in Figure 3.

Appendix D More generated sentences

We present generated sentences using the topics listed in Table 3 for each dataset. The generated sentences for a single topic are provided in Table 7, 8, 9; the generated sentences for a mixed combination of topics are provided in Table 10.

Topic Generated Sentences
army a female sergeant, serving in the fort worth, has served as she served in the military in iraq .
obama said the obama administration is seeking the state ’s expected endorsement of a family by afghan soldiers at the military
 in world war ii, whose lives at the base of kandahar .
the vfw announced final results on the u.s. and a total of $ 5 million on the battlefield , but he ’s still running for the democratic nomination
 for the senate .
animal most of the bear will have stumbled to the lake .
feral horses takes a unique mix to forage for their pets and is birth to humans .
the zoo has estimated such a loss of half in captivity , which is soaked in a year .
medical physicians seeking help in utah and the nih has had any solutions to using the policy and uses offline to be fitted with a testing or body .
that triggers monday for the study were found behind a list of breast cancer treatment until the study , as does in nationwide , has 60 days
 to sleep there .
right stronger , including the virus is reflected in one type of drug now called in the clinical form of radiation .
market the company said it expects revenue of $ unk million to $ unk million in the third quarter .
biglari said outside parliament district of january, up $ 4.30 to 100 cents per share , the last summer of its year to $ 2 .
four analysts surveyed by zacks expected $ unk billion .
lottory the numbers drawn friday night were unk .
where the winning numbers drawn up for a mega ball was sold .
the jackpot is expected to be in july .
terrorism the russian officials have previously said the senior president made no threats .
obama began halting control of the talks friday and last year in another round of the peace talks after the north ’s artillery attack there
 wednesday have unk their highest debate over his cultural weapons soon .
the turkish regime is using militants into mercenaries abroad to take on gates was fired by the west and east jerusalem in recent years
law the eeoc lawsuit says it ’s entitled to work time for repeated reporting problems that would kick a nod on cheap steel from the owner
the state allowed marathon to file employment , and the ncaa has a broken record of sale and fined for $ unk for a check
the taxpayers in the lawsuit were legally alive and march unk , or past at improper times of los alamos
art quentin tarantino ’s announcements that received the movie unk spanish
cathy johnson , jane ’s short-lived singer steve dean and ” the broadway music musical show , ” the early show , ” adds classics , unk
 or 5,500 , while restaurants have picked other unk next
katie said , he ’s never created the drama series : the movies could drops into his music lounge and knife away in a long forgotten gown
transportation young and bernard madoff would pay more than $ 11 million in banks , the airline said announced by the unk
the fraud case included a delta ’s former business travel business official whose fake cards ” led to the scheme , ” and to have been more
 than $ 10,000 .
a former u.s. attorney ’s office cited in a fraud scheme involving two engines , including mining companies led to the government from
 the government .
education the state ’s unk school board of education is not a unk .
assembly member unk, charter schools chairman who were born in new york who married districts making more than lifelong education
 play the issue , tells the same story that they ’ll be able to support the legislation .
the state ’s leading school of grant staff has added the current schools to unk students in a unk class and ripley aims to serve
 child unk and social sciences areas filled into in may and the latest resources
Table 7: More generated sentences using topics learned from APNEWS.
Topic Generated Sentences
horror the killer is a guy who is n’t even a zombie .
the cannibals want to one scene , the girl has something out of the head and chopping a concert by a shark in the head , and he heads in the
 shower while somebody is sent to unk .
a bunch of teenage slasher types are treated to a girl who gets murdered by a group of friends .
action the action is a bit too much , but the action is n’t very good .
some really may be that the war scene was a trademark unk when fighting sequences were used by modern kung fu ’s rubbish .
action packed neatly , a fair amount of gunplay and science fiction acts as a legacy to a cross between 80s and , great gunplay and scenery .
family the film is also the story of a young woman whose unk and unk and very yet ultimately sympathetic , unk relationship , unk .
 , and palestine being equal , and the old man , a unk .
catherine seeks work and her share of each other , a unk desire , and submit to her , but he does not really want to rethink her issues ,
 or where he aborted his mother ’s quest to unk .
then i ’m about the love , but after a family meeting , her friend aditya ( tatum unk ) marries a 16 year old girl , will be able to
 understand the amount of her boyfriend anytime
children snoopy starts to learn what we do but i ’m still bringing up in there .
i consider this movie to be a children ’s film for kids .
my favorite childhood is a touch that depicts her how the mother was what they apparently would ’ve brought it to the right place to fox .
war the documentary is a documentary about the war and the unk of the war.
one of the major failings of the war is that the germans also struggle to overthrow the death of the muslims and the nazi regime , and unk .
the film goes , far as to the political , but the news that will be unk at how these people can be reduced to a rescue .
detective hopefully that ’s starting unk as half of rochester takes the character in jane ’s way , though holmes managed to make tyrone power
 perfected a lot of magical stuff , playing the one with hamlet today .
while the film was based on the stage adaptation , i know she looked up to suspect from the entire production .
there was no previous version in my book i saw , only to those that read the novel , and i thought that no part that he was to why are far
 more professional .
sci-fi the monster is much better than the alien in which the unk

was required for nearly every moment of the film .

were the astronauts feel like enough to challenge the space godzilla , where it first prevails
but the adventure that will arise from the earth is having a monster that can make it clear that the aliens are not wearing the unk all that
 will unk the laser .
negative the movie reinforces my token bad ratings - it ’s the worst movie i have ever seen .
it was pretty bad , but aside from a show to the 2 idiots in their cast members , i ’m psychotic.
we had the garbage using peckinpah ’s movies with so many unk , i can not recommend this film to anyone else .
ethic englund earlier in a supporting role , a closeted gay gal reporter who apparently hopes to disgrace the girls in being sexual .
this film is just plain stupid and insane , and a little bit of cheesy .
the film is well made during its turbulent , exquisite , warm and sinister joys , while a glimpse of teen relationships .
episode 3 episodes as a unk won 3 emmy series !
i remember the sopranos on tv , early 80 ’s , ( and in my opinion it was an abc show made to a minimum ) . .
the show is notable ( more of course , not with the audience ) , and most of the actors are excellent and the overall dialogue is nice to
 watch ( the show may have been a great episode )
Table 8: More generated sentences using topics learned from IMDB.
Topic Generated Sentences
environment environmentalists immediate base calls to defend the world .
on the formation of the political and the federal space agency , the ec ’s long-term interests were scheduled to assess global warming
 and aimed at developing programmes based on the american industrial plants .
companies can reconstitute or use large synthetic oil or gas .
education the school has recently been founded by a unk of the next generation for two years .
the institute for student education committees depends on the attention of the first received from the top of lasmo ’s first round of
 the year .
66 years later the team joined blackpool , with unk lives from the very good and beaten for all support crew – and a new unk
 student calling under ulster news may provide an unk modern enterprise .
politics the restoration of the government announced on nov. 4 that the republic’s independence to direct cuba ( three years of electoral
unk would be also held in the united kingdom ( unk ) .
a new economy in which privatization was announced on july 4 .
agreements were hitherto accepted from simplified terms the following august 1969 in the april elections [ see pp. unk . ]
business net earnings per share rose unk % to $ unk in the quarter , and $ unk m , on turnover that rose unk % to $ unk m.
the insurance management has issued the first quarter net profit up unk - $ unk m , during turnover up unk % at $ unk
 m ; net profit for the six months has reported the cumulative losses of $
the first quarter would also have a small loss of five figures in effect , due following efforts of $ 7.3 .
facilities the hotel is situated in the unk and the unk .
all rooms have excellent amenities .
the restaurant is in a small garden , with its own views and unk .
sports the unk , who had a unk win over the unk , was a unk goal .
harvey has been an thrashing goal for both unk and the institutional striker fell gently on the play-off tip and , through in regular
unk to in the season .
botham ’s team took the fourth chance over a season without playing .
art radio code said the band ’s first album ’ newest ’ army club and unk album .
they have a unk of the first album , ’ run ’ for a ’ unk ’ , the orchestra , which includes the unk , which is and the band ’s life .
nearly all this year ’s album ’s a sell-out tour !
award super french for them to meet winners the label unk with just # 50,000 should be paid for a very large size .
a spokeswoman for unk said : this may have been a matter of practice .
female speaker they ’ll start discussing music at their home and why decisions are celebrating children again , but their popularity has
 arisen for environmental research .
expression roirbak stared at him , and the smile hovering .
but they must have seen it in the great night , aunt , so you made the blush poor that fool .
making her cry , it was fair .
crime the prosecution say the case is not the same .
the chief inspector supt michael unk , across bristol road , and delivered on the site that the police had been accused to take him
 job because it is not above the legal services .
she was near the same time as one of the eight men who died taken prisoner and subsequently stabbed , where she was hit away .
Table 9: More generated sentences using topics learned from BNC.
Data Topic Generated Sentences
APNEWS army+terrorism the taliban ’s presence has earned a degree from the 1950-53 korean war in pakistan ’s historic life since 1964 , with two
 example of unk soldiers from wounded iraqi army shootings and bahrain in the eastern army .
at the same level , it would eventually be given the government administration ’s enhanced military force since the war .
the unk previously blamed for the attacks in afghanistan , which now covers the afghan army , and the united nations will be
 a great opportunity to practice it .
animal+lottory when the numbers began , the u.s. fish and wildlife service unveiled a gambling measure by agreeing to acquire a permit by animal
 protection staff after the previous permits became selected from the governor ’s office .
she told the newspaper that she was concerned that the buyer was in a neighborhood last year and had a gray wolf .
the tippecanoe county historical society says it was n’t selling any wolf hunts .
IMDB horror+negative if this movie was indeed a horrible movie i think i will be better off the film .
he starts talking to the woman when the person gets the town, she suddenly loses children for blood and it ’s annoying to death
 even though it is up to her fans and baby.
what ’s really scary about this movie is it ’s not that bad .
sci-fi+children mystery inc is a lot of smoke , when a trivial , whiny girl unk troy unk and a woman gets attacked by the unk captain (
 played by hurley ) .
paul thinks him has to make up when the unk eugene discovers defeat in order to take too much time without resorting to
 mortal bugs , and then finds his wife and boys .
the turtles are grown up to billy ( as he takes the rest of the fire ) and the scepter is a family and is dying .
BNC environment+politics unk shallow water area complex in addition an international activity had been purchased to hit unk tonnes of nuclear power
 at the un plant in unk , which had begun strike action to the people of southern countries .
the national energy minister , michael unk of unk , has given a ” right ” route to the united kingdom ’s european parliament
 , but to be passed by unk , the first and fourth states .
the commission ’s report on oct. 2 , 1990 , on jan. 7 denied the government ’s grant to ” the national level of water ” .
art+crime as well as 36 , he is returning freelance into the red army of drama where he has finally been struck for their premiere .
by alan donovan , two arrested by one guest of a star is supported by some teenage women for the queen .
after the talks , the record is already featuring shaun unk ’s play ’ unk ’ in the quartet of the ira bomb .
Table 10: More generated sentences using a mixed combination of topics.