1 Introduction
Paragraph vectors (Le and Mikolov, 2014) are a recent method for embedding pieces of natural language text as fixedlength, realvalued vectors. Extending the word2vec framework (Mikolov et al., 2013b), paragraph vectors are typically presented as neural language models, and compute a single vector representation for each paragraph. Unlike word embeddings, paragraph vectors are not shared across the entire corpus, but are instead local to each paragraph. When interpreted as a latent variable, we expect them to have higher uncertainty when the paragraphs are short.
Recently, Barkan (2017) proposed a probabilistic view of word2vec that has motivated research on combining word2vec with other priors (Bamler and Mandt, 2017). Inspired by this progress, we extend paragraph vectors to a probabilistic model. Our model may be specified via modern inference tools like Edward (Tran et al., 2016), which makes it easy to experiment with different inference algorithms. The experiments in Sec. 4 confirm the intuition that paragraph vectors have higher posterior uncertainty when paragraphs are short, and we show that explicitly modeling this uncertainty improves performance in supervised prediction tasks.
2 Related work
Paragraph embeddings are built on top of word embeddings, a set of dimensionality reduction tools that map words from a large vocabulary to a dense vector representation. Most word embedding methods learn a point estimate for each embedding vector (Mikolov et al., 2013a, b; Mnih and Kavukcuoglu, 2013; Goldberg and Levy, 2014; Pennington et al., 2014). Barkan (2017) pointed out that the skipgram model with negative sampling, also known as word2vec (Mikolov et al., 2013b), admits a Bayesian interpretation. The Bayesian skipgram model allows uncertainty to be taken into account in a principled way, and lays the basis for our proposed Bayesian paragraph vector model.
Many tasks in natural language processing require fixedlength features for text passages of variable length, such as sentences, paragraphs, or documents (in this paper, we treat these three terms interchangeably). Generalizing embeddings of single words, several methods have been proposed to find dense vector representations of paragraphs (Le and Mikolov, 2014; Kiros et al., 2015; Wieting et al., 2015; Palangi et al., 2016; Pagliardini et al., 2017). Since paragraph embeddings are local to short pieces of text, we expect them to have high posterior uncertainty if the paragraphs are short. In this work, we incorporate the idea of paragraph vectors (Le and Mikolov, 2014) into the Bayesian skipgram model in order to coherently infer the uncertainty associated with paragraph vectors.
3 Method
In Sec. 3.1, we summarize the Bayesian skipgram model on which our model is based. We then present our Bayesian paragraph model in Sec. 3.2, and discuss two inference methods in Sec. 3.3.
3.1 Bayesian skipgram model
The Bayesian skipgram model (Barkan, 2017) is a probabilistic interpretation of word2vec (Mikolov et al., 2013b). The left part of Figure 1 shows the generative process. For each word in the vocabulary, the model draws a latent word embedding vector and a latent context embedding vector from a Gaussian prior . Here, is the embedding dimension and
is a hyperparameter. The model then constructs
labeled pairs of words following a twostep process. First, a proposal pair of wordsis drawn from a uniform distribution over the vocabulary. Then, the model assigns to the proposal pair a binary label
, whereis the sigmoid function. The pairs with label
form the socalled positive examples, and are assumed to correspond to occurrences of the word in the context of word somewhere in the corpus. The socalled negative examples with labeldo not correspond to any observation in the corpus. When training the model, we resort to the heuristics proposed in
(Mikolov et al., 2013b) to create artificial evidence for the negative examples (see Section 3.2 below).3.2 Bayesian paragraph vectors
Bayesian paragraph vectors (BPV) are a direct extension of the Bayesian skipgram model. The right part of Figure 1 shows the generative process. In addition to global word and context embeddings and , the model draws a paragraph vector for each of the documents in the corpus. Following Le and Mikolov (2014), we add to the context vector
when we classify a given pair of words
as a positive or a negative example. Thus, the likelihood of a word pair in document to have label is(1) 
We collect evidence for the positive examples in each document by forming pairs of words . Here, is the word class of the ^{th} token, runs over all tokens in document , runs from to where is a small context window size, and we exclude . Negative examples are not observed in the corpus. Following Mikolov et al. (2013b), we construct artificial evidence for negative pairs by sampling from the noise distribution , where is the empirical unigram frequency across the training corpus. The loglikelihood of the entire data is thus
(2) 
In the limit , Eq. (2
) reduces to the negative loss function of word2vec. BPV can be easily specified in Edward, a Python library for probabilistic modeling and inference
(Tran et al., 2016):from edward.models import Bernoulli, Normal U = Normal(loc=tf.zeros((W, E), dtype=tf.float32), scale=lam) V = Normal(loc=tf.zeros((W, E), dtype=tf.float32), scale=lam) d_n = Normal(loc=tf.zeros(E, dtype=tf.float32), scale=phi) u_n = tf.nn.embedding_lookup(U, indices_n_I) v_n = tf.nn.embedding_lookup(V, indices_n_J) z_n = Bernoulli(logits=tf.reduce_sum(u_n * (v_n + d_n), axis=1))
3.3 MAP and black box variational inference
The BPV model has global and local latent variables. We expect the posterior of the global variables to be peaked, and therefore approximate the global word embedding matrices and via point estimates. We expect a broader posterior distribution for the local paragraph vectors . Thus we use variational inference (VI) (Blei et al., 2016) to fit the posterior over
with a fully factorized Gaussian distribution. We split inference into two stages. In the first stage, we point estimate all parameters. In the second stage, we fix
and and only perform VI for the paragraph vectors.In the first stage, our goal is to train the global variables via stochastic gradient descent, where every minibatch contains a single document
and a fixed set of negative examples. We first maximize the joint probability
w.r.t the paragraph vector . As this local optimization is noise free, it converges quickly under a constant learning rate. Then, we perform a single gradient step for the global variables and . This gradient is noisy due to the minibatch sampling and the stochastic generation of negative examples. For this reason, a decreasing learning rate is used. Finally, we reinitialize and proceed to the next document. Optimizing in a nested loop before each update step saves memory since we only need to keep track of the document vectors one at a time.In the second stage, we fit a variational distribution for the paragraph vectors while holding and fixed. We use black box VI (Ranganath et al., 2014) with reparameterization gradients (Kingma and Welling, 2014; Rezende et al., 2014), which is provided by the Edward library. This time, we generate new negative examples in each update step to avoid overfitting. The stochastic optimization is again performed with a decreasing learning rate. We also perform a separate maximum a posteriori (MAP) estimate of the paragraph vectors to serve as the baseline for downstream classification tasks.
4 Experiments
Paragraph vectors are often used as input features for supervised learning in natural language processing (Le and Mikolov, 2014; Kiros et al., 2015; Palangi et al., 2016)
. In this section, we apply BPV to two binary classification tasks: sentiment analysis and paraphrase detection. We find that the posterior uncertainty of BPV decreases as the length of paragraphs grows. We also find that by concatenating the variational mean and standard deviation features inferred by VI, we improve classification accuracy compared to MAP point estimates of paragraph embeddings.
4.1 Sentiment analysis
We use the IMDB dataset (Maas et al., 2011) for sentiment analysis. It contains 100k movie reviews, split into 75k training points (25k labeled, 50k unlabeled) and 25k labeled test points. Positive and negative labels are balanced in both labeled subsets, and typical reviews consist of several sentences.
As our algorithm is unsupervised, we run the inference algorithms described in Sec. 3.3
using all the training data, and then train a logistic regression classifier using the paragraph vectors of the labeled training data only. We use the most frequent 10k words as the vocabulary, and set the context window size
, the embedding dimension , the hyperparameters for the prior , and the number of negative examples per document equal to the average number of positive pairs of all the documents. The feature vectors for the classifier are the point estimates of the paragraph vectors for MAP, and the concatenation of the variational mean and standard deviation of for VI.Table 1 shows the test accuracy of the two inference methods. VI outperforms MAP since it takes into account posterior uncertainty in paragraph embeddings. Fig. 2
(left) shows the entropy of the paragraph vectors, computed using the posterior variance obtained from VI. As the document length grows, the entropy decreases, which makes intuitive sense since longer reviews can be more specific.
Task (dataset)  MAP  VI 

Sentiment analysis (IMDB)  86.9  87.0 
Paraphrase detection (MSR)  70.0  71.0 
4.2 Paraphrase detection
We also test the discriminative power of BPV on the Microsoft Research Paraphrase Corpus (Dolan et al., 2004). Each data point contains of two sentences extracted from news sources on the web, and the goal is to predict whether they are paraphrases of each other. The training set contains 4076 sentence pairs in which 2753 are paraphrases, and the test set contains 1725 pairs among which 1147 are paraphrases. We use the same hyperparameters as in the sentiment analysis task, except that we take all the words appearing more than once into the vocabulary because this dataset is much smaller.
After finding the paragraph vectors, we train the classifier by following Kiros et al. (2015), where features are constructed by concatenating the componentwise product and the absolute difference between each pair of features and . The classification results in Table 1 show that VI again outperforms MAP. The relationship between entropy and document length shown in Fig. 2 (right) is also similar to that of the IMDB dataset.
5 Discussion
We proposed Bayesian paragraph vectors, a generative model of paragraph embeddings. We treated the local latent variables of paragraph vectors in a Bayesian way because we expected high uncertainty, especially for short documents. Our experiments confirmed this intuition, and showed that knowledge of the posterior uncertainty improves the performance of downstream supervised tasks.
In addition to MAP and VI, we experimented with Hamiltonian Monte Carlo (HMC) inference, but our preliminary results showed worse performance; we plan to investigate further. A possible reason might be that we had to use a fixed set of negative examples for each document when generating HMC samples, which may result in overfitting to the noise. Finally, we believe that more sophisticated models of document embeddings would also benefit from a Bayesian treatment of the local variables.
References
 Bamler and Mandt (2017) Bamler, R. and Mandt, S. (2017). Dynamic word embeddings. In ICML.
 Barkan (2017) Barkan, O. (2017). Bayesian neural word embedding. In AAAI.
 Blei et al. (2016) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2016). Variational inference: A review for statisticians. arXiv:1601.00670.
 Dolan et al. (2004) Dolan, B., Quirk, C., and Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In ACL.
 Goldberg and Levy (2014) Goldberg, Y. and Levy, O. (2014). Word2vec explained: Deriving mikolov et al.’s negativesampling wordembedding method. arXiv:1402.3722.
 Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Autoencoding variational Bayes. In ICLR.
 Kiros et al. (2015) Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. (2015). Skipthought vectors. In NIPS.
 Le and Mikolov (2014) Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In ICML.
 Maas et al. (2011) Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In ACL.
 Mikolov et al. (2013a) Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv:1301.3781.
 Mikolov et al. (2013b) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In NIPS.

Mnih and Kavukcuoglu (2013)
Mnih, A. and Kavukcuoglu, K. (2013).
Learning word embeddings efficiently with noisecontrastive estimation.
In NIPS.  Pagliardini et al. (2017) Pagliardini, M., Gupta, P., and Jaggi, M. (2017). Unsupervised learning of sentence embeddings using compositional ngram features. arXiv:1703.02507.

Palangi et al. (2016)
Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., Song, X., and Ward,
R. (2016).
Deep sentence embedding using long shortterm memory networks: Analysis and application to information retrieval.
TASLP, 24(4):694–707.  Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP.
 Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. In AISTATS.

Rezende et al. (2014)
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).
Stochastic backpropagation and approximate inference in deep generative models.
In ICML.  Tran et al. (2016) Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. (2016). Edward: A library for probabilistic modeling, inference, and criticism. arXiv:1610.09787.
 Wieting et al. (2015) Wieting, J., Bansal, M., Gimpel, K., Livescu, K., and Roth, D. (2015). From paraphrase database to compositional paraphrase model and back. TACL, 3:345–358.
Comments
There are no comments yet.