Explaining Away Syntactic Structure in Semantic Document Representations

06/05/2018 ∙ by Erik Holmer, et al. ∙ ETH Zurich 0

Most generative document models act on bag-of-words input in an attempt to focus on the semantic content and thereby partially forego syntactic information. We argue that it is preferable to keep the original word order intact and explicitly account for the syntactic structure instead. We propose an extension to the Neural Variational Document Model (Miao et al., 2016) that does exactly that to separate local (syntactic) context from the global (semantic) representation of the document. Our model builds on the variational autoencoder framework to define a generative document model based on next-word prediction. We name our approach Sequence-Aware Variational Autoencoder since in contrast to its predecessor, it operates on the true input sequence. In a series of experiments we observe stronger topicality of the learned representations as well as increased robustness to syntactic noise in our training data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In natural language processing (NLP), it is becoming increasingly important to be able to classify and cluster different collections of text. Document models are fundamental for many use cases and applications in information retrieval and text mining, for instance, in search ranking and text categorization. They are also often used for collaborative filtering, by making appropriate analogies, e.g. identifying users with documents and items with words.

Most current document models operate on bag-of-words input. On the one hand this is due to computational reasons, but on the other hand there is also the underlying assumption that the removal of the sequence information puts a stronger focus on the topicality of a document. We argue that keeping the input word order intact is in fact preferable and that through explicit modeling of the local syntactic structure, the syntactic and semantic properties of text can be teased apart more effectively.

In this paper, we propose an extension to the Neural Variational Document Model (NVDM) Miao et al. (2016), an unsupervised generative document model for text documents of arbitrary length. Our approach employs the next-word prediction model in the variational autoencoder framework Kingma and Welling (2014); Rezende et al. (2014). By explicitly modeling syntax in a local context, our method can separate and ”explain away” these effects in the global document representation. Embeddings learned by our model show higher similarity within topics and larger separation between them, and are more robust to syntactic noise in the text datasets of our evaluations. We name our model Sequence-Aware Variational Autoencoder (SAVAE) because in contrast to its base model NVDM, it explicitly incorporates sequence information.

The separation of local and global embeddings in SAVAE is inspired by the Paragraph Vectors model

Le and Mikolov (2014), where a paragraph’s embedding is concatenated with word vectors from a given context and then used to predict the following word. While providing a simple and powerful way of producing document representations, the paragraph vectors have to be trained at prediction time for any new document. Our model features a proper probabilistic model over documents which lets it infer document representations in a single feedforward pass at prediction time.

In the following, we analyze the state of the art in generative document modeling and highlight the connections between the models. We then introduce our own model which we subsequently evaluate on document retrieval, clustering and sentiment classification tasks. We also inspect the produced word embeddings, and discover that SAVAE nicely separates semantic and syntactic aspects of text.

2 Related Work

Generative document models aim to learn a -parameterized distribution over words for a given document of length ,

where is a predefined vocabulary of size . In order to focus on the topicality of a document, the distribution is often over word multisets instead of the exact sequence , which is also called the bag-of-words model.

The introduction of topic models has drawn a lot of attention to the field. Latent Dirichlet Allocation (LDA) Blei et al. (2003) is a particularly renowned topic model where documents are viewed as mixtures over latent topics. These topics define a distribution over words. In the specified generative process, for each word in the document a topic is sampled independently. Subsequently, the word itself is sampled according to the distribution of that topic.

The Replicated Softmax Hinton and Salakhutdinov (2009)

is an adaptation of the general Restricted Boltzmann Machine to the bag-of-words model. It consists of a layer of visible units that represent the input bag-of-words, and a layer of hidden binary units that can be seen as topics and collectively serve as the document representation.

A common definition of models for arbitrary length word sequences exploits the chain rule


All we need to provide is a next-word prediction model, which is typically given in the form of a softmax function


where are the word embeddings, are biases, and are word sequence embeddings computed from the context

. A proponent of the next-word model based on the Neural Autoregressive Distribution Estimator (NADE)

Larochelle and Murray (2011) is Document NADE (DocNADE) Larochelle and Lauly (2012). To compute the context embeddings, it uses



is the sigmoid nonlinearity. Therefore, there are two embedding vectors of equal dimension per word in the vocabulary,

. We can see from Eq. (3) that the context embedding is independent of the word ordering in the context . Although the next-word model operates on the true word sequence of a document, the authors of DocNADE assume that this ordering is not readily available for many datasets, and they therefore train on randomized word orderings. The paper notes that the performance of the model when trained on two different random orderings is comparable.

In DeepDocNADE Lauly et al. (2016) the DocNADE model is extended to compute the context embeddings with a multilayer architecture. It completely removes the autoregressive parts in DocNADE, and instead predicts the remaining words in the document from a bag-of-words context without any word ordering assumptions.

A different approach than the next-word model is the latent Bayesian model that introduces a latent variable to define the probabilistic model as


where one typically assumes that is simple (and not parameterized) and we can use the softmax for as in Eq. (2). The problem with the latent variable model is that we cannot easily perform the integration, even if

is as simple as an isotropic normal distribution. Hence, we cannot evaluate the model, e.g. to compute a document probability or perplexity.

A solution to this problem has been introduced with the Variational Autoencoder (VAE) Kingma and Welling (2014); Rezende et al. (2014), which approximates the integral by maximizing the variational lower bound. The Neural Variational Document Model Miao et al. (2016) applies the VAE framework to document modeling. The inference network of the autoencoder estimates the mean and spread of a variational distribution over the latent variable . Specifically, the mean is modeled as



is the rectified linear activation function. The decoder

is again the softmax function from Eq. (2

). From a modeling perspective, NVDM is very similar to DocNADE, with two exceptions: (i) the ReLU nonlinearity was applied to infer context embeddings, and (ii) in addition to the mean a dispersion parameter is estimated for the latent variable

. This effectively adds noise to the training process, as the sampling from generates slightly different results in each training pass.

Recent work attempted to apply Generative Adversarial Networks (GAN) to model documents Glover (2016). While this seems to be a promising direction of research, the results so far are still behind the state of the art.

Interesting models for shorter sequences of text (sentences) apply RNNs in a VAE framework Bowman et al. (2016). The authors report difficulties in making efficient use of the latent representation, most likely due to a too powerful RNN decoder that can independently explain the structure in the data. This is further analyzed in Chen et al. (2017)

where the authors state under which conditions the latent code is used in a VAE with an autoregressive decoder. Namely, when the autoregressive model is restricted to operate on limited local context, the global semantics must be captured in the latent code. This observation is a key element to the design of our model.

Multiple recent studies work on the encoder of VAEs. Kingma et al. (2016) use inverse autoregressive flows to obtain more flexible latent distributions. Similarly, Serban et al. (2017) employ a piecewise constant prior to allow for multiple modes. The Stein variational autoencoder Pu et al. (2017) uses Stein’s identity to get rid of the prior distribution for the latent variable altogether. These approaches are orthogonal to our method, and it would be interesting to see the results of combining them.

Outside of the realm of generative models, a different method has been proposed to solely learn the document representations. The method was originally named Paragraph Vectors Le and Mikolov (2014), but is also commonly referred to as doc2vec because of its similarities to the word2vec method Mikolov et al. (2013)

for learning word embeddings. doc2vec exists in two versions, called the distributed memory model (PV-DM) and the distributed bag-of-words model (PV-DBOW). The PV-DM model predicts the next word from a paragraph/document embedding together with word embeddings from the previous words. The document and word embeddings are then updated by backpropagation. This model has strong similarities to our model, as we discuss in Section 

3. The other model, PV-DBOW, predicts words that are randomly sampled from the current paragraph, solely based on the document embedding. This method completely ignores the local context, but is conceptually simpler and faster to train.

Semantic document representations can benefit other tasks, such as language modeling. As an example, TopicRNN Dieng et al. (2017) uses the same encoder as NVDM to provide its decoder with additional semantic information. We can imagine that work in our area can be incorporated in recurrent models working on such tasks.

3 Model

In Le and Mikolov (2014), the authors propose a novel way of generating low dimensional representations for paragraphs/documents. One of the main ideas in the paper is to maintain a global representation of a document and combine it (through concatenation or averaging) with word vectors from a local context of previous words to predict the next word. The hope is that this separates global from local features, allowing the model to attend to them independently and thus producing better results.

































































Local context

Global context
Figure 1: Illustration of how SAVAE predicts the next word . We look up global word embeddings for all words in and local embeddings for the previous words . The global word embeddings are then summed up and transformed through layers of an MLP, inferring and . Then we sample from to obtain a document representation which is concatenated with the local embedding and finally used to predict the next word .

In this section we take this idea of combining local and global context features to build a generative document model. We start by defining a generative model


with a latent variable , which we will refer to as the global context vector. We can interpret the model as a VAE, defining an encoder and a decoder . We let the decoder reconstruct documents word by word based on the global context vector and a small window of the previous words. Thus Eq. (7) becomes


We define the next-word prediction model as the softmax function


where are biases, are word embedding vectors taken from a matrix and is the concatenation of the global context vector and the local context vector , which we define as



is a bias vector, and

are word embedding vectors taken from a matrix . Note that within each context , the ordering of the words is ignored and their embeddings simply added up.

The encoder distribution is modeled as a Gaussian where the mean and spread are inferred using a standard feedforward MLP of layers. Now, instead of directly optimizing Eq. (8) we can use the variational lower bound on the log-likelihood (ELBO) as our training objective


where is the softmax function in Eq. (9). In the particular case where is a standard Gaussian, the KL divergence term can be computed analytically, as described in Miao et al. (2016). Note that we need to sample from to approximate the expectation in Eq. (11). An illustration of how SAVAE predicts the next word in a document is shown in Figure 1.

After training the model, we can use the ELBO to give a lower bound on the probability for a document . We can also use the inferred as a representation of a document for further use, e.g. classification, information retrieval or clustering.

4 Experiments

We evaluate the semantic document representations and inspect the learned word embeddings. We compare our model against its base version NVDM, which uses only the global context. We further evaluate doc2vec, which obtains document representations in a slightly different manner but also takes the local context into account. Finally, we also analyze DocNADE as a different proponent of the next-word prediction model.

In the following, we first describe the datasets, preprocessing and the choice of parameters that were used in training, before presenting the results of our experiments. Code and data to reproduce the experiments will be made available.

4.1 Datasets and Preprocessing

Three standard text corpora were used for the subsequent evaluations. They are 20 Newsgroups, Reuters Corpus Volume 1 version 2 (RCV1-v2) Lewis et al. (2004), and the IMDB movie review dataset Maas et al. (2011). The 20 Newsgroups corpus contains 18,845 posts from Usenet discussion groups. These posts range across 20 diverse discussion topics such as computer hardware, sports and religion, which also serve as their labels. The Reuters RCV1-v2 is an archive of 804,414 newswire stories that were categorized by their editors into possibly multiple topics out of 103 possibilities. These labels are structured in a tree hierarchy, and documents inherit all parent labels of their most specific label. The IMDB movie review dataset contains 100,000 movie reviews, where 25,000 are labeled with a positive sentiment, 25,000 with a negative one, and 50,000 come without a label.

Since our model requires the actual word sequence as input, we cannot use the same preprocessed data that was used in Hinton and Salakhutdinov (2009) and later prior work. Therefore, we keep the preprocessing standard and adopt choices from prior work where we can. For all corpora, lowercasing and tokenization of scikit-learn is applied. We do not remove any stopwords. We do not apply any discounting to the word counts. Further corpus-specific preprocessing is stated in the following.

20 Newsgroups.

As in previous work, we restrict the vocabulary to the 2,000 most common words. We use the same train/test split. We then shuffle the respective datasets with a random seed of 2. By default, the 20 Newsgroups posts come with metadata such as headers, footers and quotes that appear in typical email conversations. These content-wise uninformative data can vary strongly between discussion groups, but stay consistent within a group. Thus, they provide strong signals for classifiers and also influence unsupervised representation learning. We consider this an artefact of data collection and not representative of general documents, and therefore remove metadata from the corpus.

Reuters RCV1-v2.

Following prior work, the vocabulary size was set to 10,000. We also randomly split the data into 794,414 training and 10,000 test articles. Note that this is not the same random split.

IMDB movie reviews.

We keep a vocabulary of the 10,000 most common words. The original dataset comes with train and test splits for positive and negative samples, along with unlabeled reviews. Since we learn the document representations completely unsupervised, we pool all the data together to learn the document representations as in Le and Mikolov (2014).

Figure 2: Document retrieval evaluation on 20 Newsgroups (left) and Reuters RCV1 (right).

4.2 Training Details

We set the dimensionality of our embeddings to 50 for 20 Newsgroups and Reuters, and to 100 for IMDB, following previous work. We trained all generative models for 1,000 epochs on 20 Newsgroups and IMDB, and 100 epochs on Reuters. The exception is doc2vec, which we only trained for 20 epochs on all corpora, as results did not improve when we trained for more epochs. We found that after convergence the results remained stable for all algorithms.

In the course of analyzing the various models, we reimplemented all but doc2vec in the Tensorflow framework. This allowed us to adjust the model parameters, and in the process we found that using the Adam optimizer

Kingma and Ba (2015) improved results for all models.


We used a learning rate of 0.001. Training for DocNADE was rather unstable, so we employed early stopping based on the best attained perplexity during training.


We used a learning rate of 0.0001. The inference network consists of two layers of dimension 500 and ReLU nonlinearities. To approximate the variational lower bound we use 1 sample during training and 20 samples during evaluation, as suggested in Miao et al. (2016).


Since no code was published with the original paper, we used an implementation by the co-author Tomas Mikolov 111posted in this Google Groups discussion to train doc2vec. Although our model is very similar to the distributed memory model (PV-DM) in the original paper, later studies Dai et al. (2015); Lau and Baldwin (2016)

have found the distributed bag-of-words model (PV-DBOW) to perform better. Since we observed the same behavior in our own experiments, we use the distributed bag-of-words version for our evaluations. We jointly train the word and document embeddings. The algorithm uses stochastic gradient descent as an optimizer, with a learning rate that linearly decays from 0.05 to 0. We used a window size of 10, and applied negative sampling with 10 negative samples per positive sample. We downsampled frequent words above a frequency of 0.0001.


We fixed the learning rate at 0.00001. The inference network is the same as in NVDM. In the decoder, we achieved the best results with for the size of the local context. To approximate the variational lower bound we use 1 sample during training and 20 samples during evaluation, just as in NVDM.

4.3 Document Retrieval Evaluation

Adopting the document retrieval evaluation from prior work, we unsupervisedly train document representations on the train sets of 20 Newsgroups and Reuters. For each query document in the held-out test set, we first get its document representation while keeping the model parameters fixed. We then rank the training documents by their representations’ cosine similarity with the query document’s representation. We can then compute precision-recall curves as we go from documents that have been predicted to be similar to those that are supposed to be dissimilar.

How precision is computed differs between the two datasets, since the documents of 20 Newsgroups have only one label (out of 20 categories), whereas the Reuters documents can have multiple labels (out of 103 labels). For 20 Newsgroups we compute the precision at rank as the fraction of most similar documents with the same label as the query document. For the Reuters dataset we define the relevance of a document in the training set for a given query document to be the Jaccard similarity between their respective label sets, i.e. the cardinality of the intersection divided by the cardinality of the union of the two sets. For both datasets we get a precision-recall curve per query document that we average for the final results.

The results for 20 Newsgroups can be seen in Figure 2 on the left. SAVAE improves substantially on its base model NVDM, and even beats the representations of doc2vec. DocNADE performs the worst throughout the entire task. It is interesting to note that SAVAE’s performance does not degrade like the other models’ for higher recall rates. On the right-hand side of Figure 2, we show the evaluation on the Reuters RCV1 corpus. For very low recall, NVDM and SAVAE perform almost identical. As we increase the recall, NVDM loses precision and matches the performance of doc2vec. The gap between doc2vec and SAVAE slightly widens as recall increases.

Model Davies-Bouldin index Dunn index Silhouette coefficient
DocNADE 124.0651 62.0585 0.0076 -0.0117 0.0566
NVDM 5.2976 2.1503 0.1905 0.0745 0.0715
doc2vec 6.5670 3.0634 0.1405 0.0970 0.0770
SAVAE 3.2995 1.3666 0.2875 0.0911 0.0810
Table 1: Clustering metrics on representations trained on 20 Newsgroups. An upwards arrow () indicates that a higher score is better, a downwards arrow (

) that lower is better. Where applicable, the mean and standard deviation across clusters is reported.

4.4 Cluster Analysis

We decide to further analyze the properties of clusters produced from 20 Newsgroups data. While this task was not employed in prior work, we adopt it in order to gain insight into the structure of our document embedding space. We want to know how well the algorithms separate different topics, and how tightly they group related documents.

Our algorithms only output an implicit clustering through the documents’ vector representations; in particular no labels are output. We thus resort to internal evaluation metrics which evaluate the intra- and inter-cluster distances. For a robust solution to the clustering problem, we prefer the intra-cluster distances to be low, while the inter-cluster distances should be high. In our evaluation, we use three established metrics that evaluate this principle from slightly different angles.

The Davies-Bouldin index Davies and Bouldin (1979) is defined as


where is the number of clusters, is the mean distance of all elements of cluster to its centroid , and is our distance measure. The Davies-Bouldin index should be minimized, which can be achieved with low intra-cluster distances () and high inter-cluster distances (). Note that only the cluster that maximizes this ratio is considered, which therefore is the most similar one. Second, the Dunn index Dunn (1973) is


The Dunn index inverts the ratio of intra- and inter-cluster distances compared to the Davies-Bouldin index, and should consequently be maximized. Furthermore, instead of computing a score per cluster, it outputs a global ratio of the lowest inter-cluster distance to the highest intra-cluster distance. It therefore penalizes the currently worst cluster. Finally, the silhouette coefficient Rousseeuw (1987) computes a point-wise normalized distance to the true cluster centroid and the closest wrong centroid. We further average the per-cluster scores, to balance the contributions of the individual clusters.


For an individual data point , a score of zero means that the true and the closest wrong cluster’s centroid have an equal distance to . A negative score means that there is a centroid closer to than the true one, and a positive score indicates that the true centroid is also the closest.

We use the cosine distance as our distance measure

and normalize the centroid distances per data point, so that outliers cannot influence the scores overproportionately. For the Davies-Bouldin index and the silhouette coefficient, where we average the cluster scores, we also report the standard deviation across clusters.

The results are shown in Table 1

and confirm the findings of the previous experiment. While the absolute values have no significance to us, we can qualitatively determine that SAVAE produces the most robust clustering with higher inter- and lower intra-cluster distances than the other methods. The silhouette coefficient is positive for NVDM, doc2vec and SAVAE, which indicates that for the most part, the true centroid is also the closest for a given data point. The variance between clusters is rather large, which hints at the fact that some topics of the 20 Newsgroups dataset are harder to separate.

NVDM doc2vec SAVAE (global) SAVAE (local)
weapons medical define weapons medical define weapons medical define weapons medical define
guns medicine defined weapon health defined weapon health draw weapon basic with
weapon disease null firearms disease definition firearms disease realize practice serial make
batf health int guns patients must arms medicine assume files party completely
firearms patients morality defense study indeed guns patients count operation graphics include
militia treatment constitution citizens volume false crime treatment notice event page towards
Table 2: Learned word embeddings on the 20 Newsgroups corpus for NVDM, doc2vec and SAVAE.
Model Accuracy
DocNADE 80.79 %
NVDM 88.96 %
doc2vec 89.89 %
SAVAE 89.05 %
TopicRNN* 93.72 %
Virtual Adversarial* 94.09 %
Table 3: Classification accuracy on the sentiment classification task of IMDB movie reviews. Results with an asterisk (*) are taken from the respective publications.

4.5 Word embeddings

SAVAE produces semantic (global) and syntactic (local) embeddings. We therefore want to inspect these and compare them to the embeddings produced by our baselines. We select a subset of words chosen in previous studies Larochelle and Lauly (2012); Miao et al. (2016) and show their 5 nearest neighbors by cosine distance in Table 2.

We can see that the word embeddings of NVDM, doc2vec and SAVAE (global) are very similar and semantically meaningful for the words weapons and medical. The biggest difference shows for the word define, where both NVDM and doc2vec group the word with a context of program code (null, int, false). This is a syntactic similarity based on co-occurrence instead of a related meaning. SAVAE nicely separates synonyms from co-occurring terms into the global and local embeddings, respectively.

Taking a closer look at the local embeddings of SAVAE, we can see two effects at work. The co-occurrence in many contexts is a straightforward effect that causes local embeddings of two words to move closer together. define often co-occurs with adverbs and prepositions, which gets captured accordingly. A second effect we found was the existence of connector words that appear in the contexts of two seemingly unrelated words. Examples for such connector words are organization, which has a high co-occurrence count with both medical and graphics, and research for medical and basic. Finally, there are artefacts specific to our data that do not have a linguistic reason. We observe many co-occurrences of medical and page due to a phrase that looks like the page footer of a scientific journal that did not get filtered out by preprocessing. However, this serves as yet another compelling argument to separate accidental syntactic co-occurrences from the semantic representations of words and documents.

4.6 Sentiment Classification

An interesting task adopted in the doc2vec paper Le and Mikolov (2014) involves sentiment classification on IMDB movie reviews. In a first step, document representations are trained in an unsupervised manner on all reviews. The representations of 12,500 samples of both positive and negative sentiment are then used to train a linear classifier that predicts the polarity of the 25,000 test samples. In representation learning, a popular view holds that a good representation should be able to ”disentangle” the factors of variation present in the data and make them linearly separable. In this task, we test this hypothesis for the dimension of sentiment. The results are listed in Table 3

. We observe that all the semantic document models perform very similarly, with the exception of DocNADE. To our knowledge, the current state-of-the-art is held by an approach that combines virtual adversarial training with recurrent neural networks 

Miyato et al. (2017)

and uses much higher dimensional hidden representations. TopicRNN, which uses the same encoder as NVDM, is close behind.

The results are somewhat expected. It is well known that word sequence plays an important role in sentiment classification. This can be demonstrated with the example of negation, where the adverb not changes the sentiment of a statement. The cases where syntax decides the sentiment will lead to classification errors for semantic document representations, since they either ignore, or – in the case of SAVAE and doc2vec – explicitly explain away the syntactic structure. It is remarkable, however, how close to the state of the art the document representations get considering they are not trained end-to-end.

5 Conclusion

In this paper, we extend a popular generative document model Miao et al. (2016) built on the variational autoencoder framework. We explicitly model the local context to separate syntax from the semantic document representations. This is different from most state-of-the-art document models that only make use of global context, i.e. bag-of-words. In contrast to Le and Mikolov (2014), we provide a probabilistic model that does not need to be trained at prediction time. We compared against several document model baselines on established tasks and found that our model consistently finds better representations, produces more topical clusters and is more robust to syntactic peculiarities of the training data.

There are several promising directions for future inquiry. The variational autoencoder framework leaves room for much creativity. Since our extension is independent of the exact encoder and decoder, future work may readily combine it with more flexible encoders or decoders with higher capacity. Moreover, the idea of explaining away local context by explicitly modeling it is sufficiently general to be applicable in other models and for purposes other than document modeling.


  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. In

    Journal of machine Learning research

  • Bowman et al. (2016) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning.
  • Chen et al. (2017) Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2017. Variational lossy autoencoder. ICLR .
  • Dai et al. (2015) Andrew M Dai, Christopher Olah, and Quoc V Le. 2015. Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 .
  • Davies and Bouldin (1979) David L Davies and Donald W Bouldin. 1979. A cluster separation measure. In IEEE transactions on pattern analysis and machine intelligence.
  • Dieng et al. (2017) Adji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. 2017. Topicrnn: A recurrent neural network with long-range semantic dependency. ICLR .
  • Dunn (1973) J.C. Dunn. 1973. A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. In Journal of Cybernetics.
  • Glover (2016) John Glover. 2016. Modeling documents with generative adversarial networks. arXiv preprint arXiv:1612.09122 .
  • Hinton and Salakhutdinov (2009) Geoffrey E Hinton and Ruslan R Salakhutdinov. 2009. Replicated softmax: an undirected topic model. In Advances in neural information processing systems.
  • Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR .
  • Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems.
  • Kingma and Welling (2014) Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. ICLR .
  • Larochelle and Lauly (2012) Hugo Larochelle and Stanislas Lauly. 2012. A neural autoregressive topic model. In Advances in Neural Information Processing Systems.
  • Larochelle and Murray (2011) Hugo Larochelle and Iain Murray. 2011. The neural autoregressive distribution estimator. In AISTATS.
  • Lau and Baldwin (2016) Jey Han Lau and Timothy Baldwin. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368 .
  • Lauly et al. (2016) Stanislas Lauly, Yin Zheng, Alexandre Allauzen, and Hugo Larochelle. 2016. Document neural autoregressive distribution estimation. arXiv preprint arXiv:1603.05962 .
  • Le and Mikolov (2014) Quoc V Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. International Conference on Machine Learning .
  • Lewis et al. (2004) David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. In Journal of machine learning research.
  • Maas et al. (2011) Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011.

    Learning word vectors for sentiment analysis.

    In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1.
  • Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In ICML.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. ICLR Workshop Proceedings .
  • Miyato et al. (2017) Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2017. Adversarial training methods for semi-supervised text classification. ICLR .
  • Pu et al. (2017) Yuchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, and Lawrence Carin. 2017. Vae learning via stein variational gradient descent. In Advances in Neural Information Processing Systems.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning .
  • Rousseeuw (1987) Peter J Rousseeuw. 1987.

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.

    In Journal of computational and applied mathematics.
  • Serban et al. (2017) Iulian Vlad Serban, Alexander G Ororbia, Joelle Pineau, and Aaron Courville. 2017. Piecewise latent variables for neural variational text processing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.