Learning document embeddings along with their uncertainties

08/20/2019 ∙ by Santosh Kesiraju, et al. ∙ Brno University of Technology 0

Majority of the text modelling techniques yield only point estimates of document embeddings and lack in capturing the uncertainty of the estimates. These uncertainties give a notion of how well the embeddings represent a document. We present Bayesian subspace multinomial model (Bayesian SMM), a generative log-linear model that learns to represent documents in the form of Gaussian distributions, thereby encoding the uncertainty in its covariance. Additionally, in the proposed Bayesian SMM, we address a commonly encountered problem of intractability that appears during variational inference in mixed-logit models. We also present a generative Gaussian linear classifier for topic identification that exploits the uncertainty in document embeddings. Our intrinsic evaluation using perplexity measure shows that the proposed Bayesian SMM fits the data better as compared to variational auto-encoder based document model. Our topic identification experiments on speech (Fisher) and text (20Newsgroups) corpora show that the proposed Bayesian SMM is robust to over-fitting on unseen test data. The topic ID results show that the proposed model is significantly better than variational auto-encoder based methods and achieve similar results when compared to fully supervised discriminative models.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Learning word and document embeddings have proven to be useful in wide range of information retrieval, speech and natural language processing applications 

[Wei:2006:LDA_IR, Mikolov:2012:LM_adap, Win:2014:STD, Chen:2015:LM_adap, Benes:2018]. These embeddings elicit the latent semantic relations present among the co-occurring words in a sentence or bag-of-words from a document. Majority of the techniques for learning these embeddings are based on two complementary ideologies, (i) topic modelling, and (ii) word prediction. The former methods are primarily built on top of bag-of-words model and tend to capture higher level semantics such as topics. The latter techniques capture lower level semantics by exploiting the contextual information of words in a sequence [Mikolov:2013:word2vec, Jeffrey:2014:GloVe, Quoc:2014:PV].

On the other hand, there is a growing interest towards developing pre-trained language models [Ruder:2018:Universal, Peters:2018:ELMO]

, that are then fine-tuned for specific tasks such as document classification, question answering, named entity recognition, etc. Although these models achieve state-of-the-art results in several NLP tasks; they require enormous computational resources to train 


Latent variable models [Bishop:1999:LVM]

are a popular choice in unsupervised learning; where the observed data is assumed to be generated through the latent variables according to a stochastic process. The goal is then to learn the model parameters and often, also the estimates of latent variables. In probabilistic topic models (PTMs) 

[Blei:2012:PTM] the latent variables are attributed to topics, and the generative process assumes that every topic is generated from a distribution over words in the vocabulary and documents are generated from distribution of (latent) topics. Recent works showed that auto-encoders can also be seen as generative models for images and text [Kingma:2014:AEVB, NVI:2016]. Having a generative model allows us to incorporate prior information about the latent variables, and with the help of variational Bayes (VB) techniques [Rezende:2014:SBP], one can infer posterior distribution over the latent variables, instead of just point estimates. The posterior distribution captures uncertainty of the latent variable estimates while trying to explain (fit) the observed data and our prior belief. In the context of text modelling, these latent variables are seen as embeddings.

In this paper, we present Bayesian subspace multinomial model (Bayesian SMM) as a generative model for bag-of-words representation of documents. We show that our model can learn to represent document embeddings in the form of Gaussian distributions, there by encoding the uncertainty in its covariance. Further, we propose a generative Gaussian classifier that exploits this uncertainty for the task of topic identification. The proposed VB framework can be extended in a straightforward way for subspace -gram model [Mehdi:2013:SnGM], that can model -gram distribution of words from sentences.

Earlier, (non Bayesian) SMM was used for learning document embeddings in an unsupervised fashion which are then used for training linear classifiers for the task of topic ID from speech and text [May:2015:mivec, Kesiraju:2016:SMM]

. However, one of the limitations was that, the learned document embeddings (also termed as document i-vectors) using SMM were only point estimates and were prone to over-fitting, especially in scenarios with shorter documents. Our proposed model can over this problem by capturing the uncertainty of the embeddings in the form of posterior distributions.

Given the significant prior research in probabilistic topic models and related algorithms for learning representations, it is important to draw relations between the presented model and prior research. We do this from the following viewpoints: (a) Graphical models illustrating the dependency of random and observed variables, (b) assumptions of distributions over random variables and their limitations, and (c) approximations made during inference and their consequences.

The contributions of this paper are as follows: (a) we present Bayesian subspace multinomial model and analyse its relation to popular models such as latent Dirichlet allocation (LDA) [Blei:2003:LDA], correlated topic model (CTM) [Blei:2005:CTM], paragraph vector (PV-DBOW) [Quoc:2014:PV] and neural variational document model (NVDM) [NVI:2016], (b) we adapt tricks from [Kingma:2014:AEVB] for faster and efficient variational inference of the proposed model, (c) we combine optimization techniques from [Kingma:2014:Adam, Andrew:2007:L1] and use them to train the proposed model, (d) we propose a generative Gaussian classifier that exploits uncertainty in the posterior distribution of document embeddings, (e) we provide experimental results on both text and speech data showing that the proposed document representations achieve state-of-the-art perplexity scores, and (f) with the proposed classifier, we illustrate robustness of the model to over-fitting and at the same time achieving superior classification results when comapared to SMM and NVDM.

We begin with the description of Bayesian SMM in Section II, followed by VB for the model in Section III. The complete VB training procedure and algorithm is presented in Section III-A. The procedure for inferring the document embedding posterior distributions for (unseen) documents is described in Section III-B. Section IV presents a generative Gaussian classifier that exploits the uncertainty encoded in document embedding posterior distributions. Relationship between Bayesian SMM and existing popular topic models is described in Section V. Experimental details are given in Section VI, followed by results and analysis in Section VII. Finally, we conclude in Section VIII and discuss directions for future research.

Ii Bayesian subspace multinomial model

Our generative probabilistic model assumes that the training data (bag-of-words) were generated as follows:

For each document, a -dimensional latent vector is generated from isotropic  prior with mean and precision :


The latent vector is a low dimensional embedding () of document specific distribution of words, where is the size of the vocabulary. More precisely, for each document, the

-dimensional vector of word probabilities is calculated as:


where are parameters of the model. Vector known as universal background model represents log uni-gram probabilities of words. known as total variability matrix [Marcel:2010:SMM, Najim:2011:ivec] is a low-rank matrix defining subspace of document specific distributions.

Finally, for each document, a vector of word counts (bag-of-words) is sampled from  distribution:


where is the number of words in the document.

The above described generative process fully defines our Bayesian model, which we will now use to address the following problems: given training data , we estimate model parameters and, for any given document , we infer posterior distribution over corresponding document embedding . Parameters of such posterior distributions can be then used as an low dimensional representation of the document. Note that such distribution also encodes the inferred uncertainty about such representation.

Using Bayes’ rule, the posterior distribution of document embedding is written as111For clarity, explicit conditioning on and is omitted in the subsequent equations.:


In numerator of Eq. (4), represents prior distribution of document embeddings, which is according to Eq. (1) and represents the likelihood of observed data. According to our generative process, we assumed that every document is a sample from distribution (Eq. (3)), and the -likelihood is given as follows:


where represents a row in matrix .

Fig. 1: Graphical model for Bayesian SMM

The problem arises while computing the denominator in Eq. (4). It involves solving the integral over a product of likelihood term containing the function and

 distribution (prior). There exists no analytical form for this integral. This is a generic problem that arises while performing Bayesian inference for mixed-logit models, multi-class logistic regression or any other model where likelihood function and prior are not conjugate to each other 

[Bishop:2006:PRML]. In such cases, one can resort to variational inference and find an approximation to the posterior distribution . This approximation to the true posterior is referred as variational distribution , and is obtained by minimizing the Kullback-Leibler (KL) divergence, between the approximate and true posterior. We can express marginal (evidence) of the data as:


Here represents the entropy of . Given the data , is a constant with respect to , and can be minimized by maximizing , which is known as Evidence Lower BOund (ELBO) for a document. This is the standard formulation of variational Bayes [Bishop:2006:PRML], where the problem of finding an approximate posterior is transformed into optimization of the functional .

Iii Variational Bayes

In this section, using the VB framework, we derive and explain the procedure for estimating model parameters and inferring the variational distribution, . Before proceeding, we note that our model assumes that all documents and the corresponding document embeddings (latent variables) are independent. This can be seen from the graphical model in Fig. 1. Hence, we derive the inference only for one document embedding , given observed vector of word counts .

We chose the variational distribution to be , with mean and precision , i.e., . The functional now becomes:


The term from Eq. (11) is the negative KL divergence between the variational distribution and the document-independent prior from Eq. (1). This can be computed analytically [cookbook] as:


where denotes the dimension of document embedding. The term from Eq. (11) is the expectation over -likelihood of a document (Eq. (7)):


Eq. (13) involves solving the expectation over -- operation (denoted by ), which is intractable. It appears when dealing with variational inference in mixed-logit models [Blei:2005:CTM, Depraetere:2017:mixed]. We can approximate with empirical expectation using samples from . But is a function of , whose parameters we are seeking by optimizing . The corresponding gradients of with respect to

will exhibit high variance if we directly take the samples from

for the empirical expectation. To overcome this, we will re-parametrize the random variable  [Kingma:2014:AEVB]. This is done by introducing a differentiable function over another random variable . If , then,


where is the Cholesky factor of . Using this re-parametrization of , we obtain the following approximation:


where denotes the total number of samples () from . Combining Eqs. (12), (13) and (15), we get the approximation to . We will introduce the document suffix , to make the notation explicit:


For the entire training data , the complete ELBO will be simply the summation over all the documents, i.e., .

Iii-a Training

The variational Bayes (VB) training procedure for Bayesian SMM is stochastic because of the sampling involved in the re-parametrization trick (Eq. (14)). Like the standard VB approach [Bishop:2006:PRML], we optimize ELBO alternately with respect to and . Since we do not have closed form update equations, we perform gradient based updates. Additionally, we regularize rows in matrix while optimizing. Thus, the final objective function becomes,


where we have added term for regularization of rows in matrix , with corresponding weight . The same regularization was previously used for non Bayesian SMM in [Kesiraju:2016:SMM]

. This can also be seen as obtaining a maximum a posteriori estimate of

with Laplace priors.

Iii-A1 Parameter initialization

The vector is initialized to uni-gram probabilities estimated from training data. The values in matrix are randomly initialized from . The prior over latent variables in set to isotropic Gaussian distribution with mean and precision . The variational distribution is initialized to . Later in Section VII-A, we will show that initializing the posterior to a sharper Gaussian distribution helps in faster convergence.

Iii-A2 Optimization

The gradient based updates are made by using adam optimization scheme [Kingma:2014:Adam]; in addition to the following tricks:

We simplified the variational distribution by making its precision matrix diagonal222This is not a limitation, but only a simplification.. Further, while updating it, we used standard deviation parametrization, i.e.,


The gradients of the objective w.r.t. mean is given as follows:


The gradient w.r.t standard deviation is given as:


where represents a column vector of s, is according to Eq. (20), denotes element-wise product, and is element-wise exponential operation.

The regularization term in the objective (Eq. (17)) makes it discontinuous (non-differentiable) at points where it crosses the orthant. Hence, we used sub-gradients and employed orthant-wise learning [Andrew:2007:L1]. The gradient of the objective w.r.t. a row in matrix is computed as follows:


Here and operate element-wise. The sub-gradient is defined as:


Finally, the rows in matrix are updated according to,


Here is the learning rate, and

represents bias corrected first and second moments (as required by

adam) of sub-gradient respectively. represents orthant projection as defined in Eq. 26, which assures that the update step does not cross the point of non-differentiability. It introduces explicit zeros in the estimated matrix, and results in sparse solution. Unlike in [Kesiraju:2016:SMM], we do not require to apply the sign projection, because both gradient and step point in the same orthant (due to properties of adam). The stochastic VB training is outlined in Algorithm 1.

1 initialize model and variational parameters repeat
2       for  do
3             sample compute using Eq. (16) compute gradient using Eq. (19) compute gradient using Eq. (21) update and using adam
4       end for
5      compute using Eq. (17) compute sub-gradients using Eqs. (22) and (23) update rows in using Eq. 24
until convergence or max_iterations
Algorithm 1 Stochastic VB training

Iii-B Inferring embeddings for new documents

After obtaining the model parameters from VB training, we can infer (extract) the document embedding posterior distributions , for any given document . This is done by iteratively updating the parameters of that maximize from Eq. (16). These updates are performed by following the same adam optimization scheme as in training.

Note that the embeddings are extracted by maximizing the ELBO and does involve any supervision i.e., topic labels. These embeddings which are in the form of posterior distributions will be used as an input features for training topic ID classifiers. Alternatively, one can use only the mean of the posterior distributions as point estimates of document embeddings.

Iv Gaussian classifier with uncertainty

In this section, we will present a generative Gaussian classifier that exploits the uncertainty in document embedding posterior distributions. Moreover, it also exploits the same uncertainty while computing the posterior probability of class labels. The proposed classifier is called Gaussian linear classifier with uncertainty (GLCU) and is inspired from 

[Kenny:2013:PLDA, Sandro:2015:IS]. It can be seen as an extension to the simple Gaussian linear classifier (GLC) [Bishop:2006:PRML].

Let denote class labels, represent document indices, and represent the class label of document

in one-hot encoding.

GLC assumes that every class is Gaussian distributed with a specific mean , and a shared precision matrix . Let denote a matrix of class means, with representing a column. GLC is described by the following model:


where , and represent embedding for document . GLC can be trained by estimating the parameters that maximize the class conditional likelihood of all training examples:


In our case, however, the training examples come in the form of posterior distributions, as extracted using our Bayesian SMM. In such case, the proper ML training procedure would aim to maximize the expected class-conditional likelihood, where the expectation over would be calculated for each training example with respect to the posterior i.e., .

However, it is more convenient to introduce an equivalent model, where the observations are the means of the posteriors and the uncertainty encoded in is introduced into the model through latent variable as,


where, . The resulting model is called GLCU. Since the random variables and are Gaussian distributed, the resulting class conditional likelihood is obtained by convolution of two Gaussians [Bishop:2006:PRML], i.e,


GLCU can be trained by estimating its parameters , that maximize the class conditional likelihood (Eq. (30)) of training data. This can be done efficiently by using the following em algorithm; that alternates between e-step and m-step.

Iv-a EM algorithm

In the e-step, we calculate the posterior distribution of latent variables:


In the m-step, we maximize the auxiliary function with respect to model parameters . It is the expectation of joint-probability with respect to , i.e.,


Maximizing the auxiliary function w.r.t. , we have:


where is the set of documents from class . In our experiments, we observed that the model converges in 5 iterations.

Iv-B Classification

Given a test document embedding posterior distribution , we compute the class conditional likelihood according to Eq. (30), and the posterior probability of a class is obtained by applying the Bayes’ rule:


V Related models

In this section, we review and relate some of the popular PTMs and neural network based document models. We begin with a brief review of LDA 

[Blei:2003:LDA], a probabilistic generative model for bag-of-words representation of documents.

V-a Latent Dirichlet allocation

LDA assumes that every latent topic

is a discrete probability distribution over vocabulary of words, and every document is a discrete probability distribution over latent topics. The generative process for a document (

bag-of-words) can be explained by the following two steps: (1) first a document specific vector (embedding) is sampled from prior with parameter . Then for each word in the document, a latent topic is sampled: and word is in turn sampled from the topic specific distribution: .

Fig. 2: Graphical model for LDA

The topic () and document () vectors live in and simplexes respectively. For every word in a document , there is a discrete latent variable that tells which topic was responsible for generating the word. This can be seen from the respective graphical model in Fig. 2.

During inference, the generative process is inverted to obtain posterior distribution over latent variables, , given the observed data and prior belief. Since the true posterior is intractable, Blei [Blei:2003:LDA] resorted to variational inference which finds an approximation to the true posterior as a variational distribution . Further, mean-field approximation was made, to make the inference tractable, i.e., .

In the original model proposed by Blei [Blei:2003:LDA], the parameters were obtained using maximum likelihood approach. The choice of  distribution for simplifies the inference process because of the - conjugacy. However, the assumption of  distribution causes limitations to the model, and cannot capture the correlations between topics in each document. This was the motivation for Blei [Blei:2005:CTM] to model documents with  distribution, and the resulting model is called correlated topic model (CTM).

V-B Correlated topic model

The generative process for a document in CTM is same as in LDA, except for document vectors are now drawn from , i.e.,


In this formulation, the document embeddings are no longer in the simplex, rather they are dependent through the logistic normal (natural parametrization of the exponential family). This is same as in our proposed Bayesian SMM (Eq. (1)). The advantage is that the documents vectors can model the correlations in topics. The topic distributions over vocabulary , however still remained . In Bayesian SMM, the topic-word distributions () are not  , hence it can model the correlations between words and (latent) topics.

Fig. 3: Graphical model for CTM

The variational inference in CTM is similar to that of LDA including the mean-field approximation, because of the discrete latent variable (Fig. 3). The additional problem is dealing with the non-conjugacy. More specifically, it is the intractability while solving the expectation over -- function (see from Eq. (13)). Blei [Blei:2005:CTM] used Jensen’s inequality to form an upper bound on , and this in-turn acted as lower bound on ELBO. In our proposed Bayesian SMM, we also encountered the same problem, and we approximated using the re-parametrization trick (Section III). There exists similar approximation techniques based on Quasi Monte Carlo sampling [Depraetere:2017:mixed].

Unlike in LDA or CTM, Bayesian SMM does not require to make mean-field approximation, because the topic-word mixture is not  thus eliminating the need for discrete latent variable .

V-C Subspace multinomial model

SMM is a log-linear model; originally proposed for modelling discrete prosodic features for the task of speaker verification [Marcel:2010:SMM]. Later, it was used for phonotatic language recognition [Mehdi:2011:SMM] and eventually for topic identification and document clustering [May:2015:mivec, Kesiraju:2016:SMM]. Similar model was proposed by Maas [Maas:2011:Sent] for unsupervised learning of word representations. One of the major differences among these works is the type of regularization used for matrix . Another major difference is in obtaining embeddings for a given test document. Maas [Maas:2011:Sent] obtained them by projecting the vector of word counts onto the matrix , i.e., , whereas [May:2015:mivec, Kesiraju:2016:SMM] extracted the embeddings by maximizing regularized -likelihood function.

V-D Paragraph vector

Paragraph vector bag-of-words (PV-DBOW) [Quoc:2014:PV] is also a log-linear model, where the document embeddings are stochastically trained to maximize the likelihood of a set of words in a given document. During the stochastic updates, these set of words are randomly sampled from the observed document. SMM can be seen as a generalization of PV-DBOW, as it maximizes likelihood of all the words in a document.

V-E Neural network based models

Neural variational document model (NVDM) is an adaptation of variational auto-encoders for document modelling [NVI:2016]. The encoder models the posterior distribution of latent variables given the input, i.e, , and the decoder models distribution of input data given the latent variable i.e., . In NVDM, the authors used bag-of-words

 as input, while their encoder and decoders are two layer feed forward neural networks. The decoder part of NVDM is similar to Bayesian SMM, as both the models maximize expected

-likelihood of data, assuming  distribution. In simple terms, Bayesian SMM can be seen as decoder with a single feed forward layer. For a given test document, in NVDM, the approximate posterior distribution of latent variables is obtained directly by forward propagation through the encoder; whereas in Bayesian SMM, it is obtained by iteratively optimizing ELBO. We will show in Section VII that the posterior distributions obtained from Bayesian SMM represent the data better as compared to the ones obtained directly from the encoder of NVDM.

V-F Sparsity in topic models

Sparsity is often one of the desired properties in topic models [SAGE:2011, Biksha:2007:Sparse]. Sparse coding inspired topic model was proposed by [Zhu:2011:STC], where the authors have obtained sparse representations for both documents and words. regularization over for SMM ( SMM) was observed to yield better results when compared to LDA, STC and regularized SMM ( SMM) [Kesiraju:2016:SMM]. Relation between SMM and sparse additive generative model (SAGE) [SAGE:2011] was explained in [May:2015:mivec]. In [Mekala:2017:SCDV], the authors proposed an algorithm to obtain sparse document embeddings called sparse composite document vector (SCDV) from pre-trained word embeddings. In our proposed Bayesian SMM, we introduce sparsity into the model by applying regularization and using orthant-wise learning.

Vi Experiments

Vi-a Description of datasets

We have conducted experiments on both speech and text corpora. The speech data used is Fisher phase 1 corpus333https://catalog.ldc.upenn.edu/LDC2004S13, which is a collection of 5850 conversational telephone speech recordings with a closed set of 40 topics. Each conversation is approximately 10 minutes long with two sides of the call and is supposedly about one topic. Each side of the call (recording) is treated as an independent document, which resulted in a total of 11700 documents. The details of data splits are presented in Table I; they are the same as used in earlier research by [Hazen:2007:ASRU, Hazen:2011:MCE_topic_ID, May:2015:mivec]. Our preprocessing involved removing punctuation and special characters, but we did not remove any stop words. We have used both the manual and automatic transcriptions. The latter ones are obtained from a DNN-HMM based automatic speech recognizer (ASR) system built using Kaldi toolkit [Kaldi:2011:ASRU] following the training algorithm (recipe) described in [Karel:2013]. The ASR system resulted in 18% word-error-rate on a held-out test set. The vocabulary size while using manual transcriptions was 24854, for automatic, it was 18292, and the average document length (in number of words) is 830, and 856 respectively.

Set # docs. Duration (hrs.)
ASR training 6208 553
Topic ID training 2748 244
Topic ID test 2744 226
TABLE I: Data splits from Fisher phase 1 corpus, where each document represents one side of the conversation.

The text corpus used is 20Newsgroups444http://qwone.com/~jason/20Newsgroups/, which contains 11314 training and 7532 test documents over 20 topics. Our preprocessing involved removing punctuation and also words that do not occur in at least 2 documents. This resulted in a vocabulary of 56433 words. The average document length is 290 words.

Vi-B Hyper-parameters of Bayesian SMM

In our topic ID experiments, we observed that the embedding dimension () and regularization weight () for rows in matrix are the two important hyper-parameters. The embedding dimension was chosen from , and regularization weight from .

Vi-C Proposed topic ID systems

In our Bayesian SMM, the document embeddings are extracted (inferred) in an iterative fashion by optimizing the ELBO; this does not necessarily correlate with the performance of topic ID. This is true for SMM, NVDM or any other generative model trained without supervision. A typical way to overcome this problem is to have a topic ID performance monitoring system (PMS), which evaluates the topic ID accuracy on a held-out (or cross-validation) set at regular intervals during the inference. The PMS can be used to stop the inference earlier if needed.

Using the above described scheme, we trained three different classifiers: (i) Gaussian linear classifier (GLC), (ii) multi-class logistic regression (LR) are trained using only the mean parameter () of the posterior distributions, (iii) Gaussian linear classifier uncertainty (GLCU) is trained using the full posterior distribution, i.e., along with the uncertainties of document embeddings as described in Section IV.

Vi-D Baseline topic ID systems

Vi-D1 Nvdm

In the original paper [NVI:2016], the authors did not use the embeddings from NVDM for document classification. Since NVDM and our proposed Bayesian SMM share similarities, we chose to extract the embeddings from NVDM and use them for training linear classifiers. Given a trained NVDM model, embeddings for any test document can be extracted just by forward propagating through the encoder. Although this is computationally cheaper, one needs to decide when to stop training, as a fully converged NVDM may not yield optimal embeddings for discriminative tasks such as topic ID. Hence, we used the same strategy of having a topic ID PMS. We used the same three classifier pipelines (LR, GLC, GLCU) as we used for Bayesian SMM. Our architecture and training scheme is similar to ones proposed in [NVI:2016], i.e., two feed forward layers with either or hidden units and activation functions. The latent dimension was chosen from . The hyper-parameters were tuned based on cross-validation experiments.

Vi-D2 Smm

Our second baseline system is non-Bayesian SMM with regularization over the rows in matrix i.e., SMM. It was trained with several hyper-parameters such as embedding dimension , regularization weight . The embeddings obtained from SMM were then used to train GLC and LR classifiers. Note that we cannot use GLCU here, because SMM yields only point estimates of embeddings. The classifier training scheme is the same as we used in Bayesian SMM and NVDM, i.e., by using topic ID performance monitoring system.

Later in Section VII-C, we will show that Bayesian SMM is more robust to over-fitting when compared to SMM and NVDM; and does not require a performance monitoring system.


The third baseline system is universal language model fine tuned for classification (ULMFiT) [Ruder:2018:Universal]. The pre-trained555https://github.com/fastai/fastai

model consists of 3 BiLSTM layers. Fine-tuning the model involves two steps: (a) fine tuning LM on the target dataset and (b) training classifier (MLP layer) on the target dataset. We trained several models with various drop-out rates. More specifically, the LM was fine-tuned for 15 epochs

666Fine-tuning LM for higher number of epochs degraded the classification performance., with drop-out choices from: . The classifier was fine-tuned for 50 epochs with drop-out choices from: . All the hyper-parameters were tuned on held out development set.

Vi-D4 Tf-Idf

The fourth baseline system is a standard term frequency - inverse document frequency (TF-IDF) based document representation, followed by multi-class logistic regression (LR). Although TF-IDF is not a topic model, the classification performance of TF-IDF based systems are often close to state-of-the-art systems [May:2015:mivec]. The hyper-parameter ( regularization weight) of LR was selected based on 5-fold cross-validation experiments on training set.

Vii Results and Discussion

Fig. 4: Convergence of Bayesian SMM with respect to the initialization of variational distribution.

Vii-a Convergence of Bayesian SMM

We observed that the posterior distributions extracted using Bayesian SMM are always much sharper than standard Normal distribution. Hence we initialized the variational distribution to

for faster convergence. This can be seen in Fig. 4, where we have objective (ELBO) plotted against training iterations for two different initializations of variational distribution. Here the model was trained on manual transcriptions of Fisher corpus, with the embedding dimension , regularization weight and prior set to standard . We can observe that the model initialized to converges faster as compared to the one initialized to standard .

Vii-B Perplexity

Perplexity is an intrinsic measure for topic models [DBM:2013:Hinton, NVI:2016]. We computed it for the entire test data according to,


where is the number of words in document . For Bayesian SMM, is approximated with a lower bound according to Eq. (16). This is also the case with NVDM. Since is approximated with lower-bound, the resulting perplexity values act as upper bounds.

In Figs. 4(a) and 4(b), we show the comparison of test data perplexities of Bayesian SMM and NVDM from both the datasets. The horizontal solid green line shows the perplexity computed using the maximum likelihood probabilities estimated on the test data.

NVDM model was trained for 3000 VB iterations, and at regular checkpoints, we froze the model, extracted the embeddings for test data and computed the perplexities. For Bayesian SMM the model was trained for 5000 VB iterations, and the test document embeddings were extracted for 5000 iterations. The perplexity is computed at regular checkpoints during extraction and is shown in Figs. 4(a) and 4(b). We can observe that Bayesian SMM fits the test data better than NVDM. Also, one can observe that NVDM tends to over-fit on the training data and hence the perplexities of test data increases from around iteration 400. It suggests that one requires to have a held-out set to monitor the perplexity values to decide when to stop the training.

(a) PPL of Fisher test data.
(b) PPL of 20Newsgroups test data.
Fig. 5: Comparison of test data perplexities of Bayesian SMM, NVDM from both Fisher and 20Newsgroups datasets. The latent (embedding) dimension was set to 100. The horizontal solid green line shows the perplexity computed using the maximum likelihood probabilities estimated on the test data.

Vii-C Performance monitoring for topic ID systems

The embeddings extracted from a model trained purely in unsupervised fashion doesn’t necessarily yield optimum results when used in a supervised scenario. As discussed earlier, the document embeddings extracted from unsupervised models (SMM, Bayesian SMM, NVDM) are used as features for training linear classifiers for topic ID, and requires a performance monitoring system to achieve optimal results. To illustrate this, we show the topic ID accuracy against the iterative optimization of unsupervised models: NVDM, SMM and Bayesian SMM.

NVDM was trained for 3000 iterations, and at regular intervals during training, we froze the model, extracted embeddings for both training and test documents. We then trained topic ID classifiers and evaluated the training data using 5-fold cross-validation, and also test data. In case of SMM and Bayesian SMM, the model was trained for 5000 iterations, then embeddings were extracted 5000 iterations. At regular intervals during the extraction, we trained a topic ID system and evaluated the training set with 5-fold cross validation, and test set. The cross-validation results can be used to decide when to stop the iterative optimization scheme. Fig. 6 shows the above described scheme, executed on Fisher data. For the purpose of illustration, the embedding dimension was set to 100 and the classifier was chosen to be GLC. From, Fig. 6, we can observe that the embeddings extracted from non Bayesian SMM are prone to over-fitting and hence a steep fall in test accuracy. The cross-validation accuracies of NVDM and Bayesian SMM are similar and consistent over the iterations. However, the test accuracy of NVDM is much lower than that of Bayesian SMM and also decreases over iterations. On the other hand, the test accuracy of Bayesian SMM increases and stays consistent. This shows the robustness of our proposed model and doesn’t require any performance monitoring system for topic ID.

Fig. 6: Topic ID performance monitoring system on Fisher data.

Vii-D Topic ID results

Systems Model Classifier Accuracy (%) CE Accuracy (%) CE
Manual transcriptions Automatic transcriptions
Baseline BoW [Hazen:2007:ASRU] NB 87.61 - - -
TF-IDF  [May:2015:mivec] LR 86.41 - - -
Our Baseline TF-IDF LR 86.59 0.93 86.77 0.94
ULMFiT MLP 86.41 0.50 86.08 0.50
SMM LR 86.81 0.91 87.02 1.09
SMM GLC 85.17 1.64 85.53 1.54
NVDM LR 81.16 0.94 83.67 1.15
NVDM GLC 84.47 1.25 84.15 1.22
NVDM GLCU 83.96 0.93 83.01 0.97
Proposed Bayesian SMM LR 89.91 0.89 88.23 0.95
Bayesian SMM GLC 89.47 1.05 87.23 1.46
Bayesian SMM GLCU 89.54 0.68 87.54 0.77
TABLE II: Comparison of results on Fisher test sets, from earlier published works, our baselines and proposed systems.

In this section, we present the topic ID results in terms of classification accuracy (in %) and cross-entropy (CE) on the test sets. Cross entropy gives a notion of how confident the classifier is about its prediction. A well calibrated classifier tends to have lower cross entropy.

Table II presents the classification results on Fisher speech corpora with manual and automatic transcriptions. The first two rows are the best reported results from earlier published research on Fisher corpus with manual transcriptions. Hazen [Hazen:2007:ASRU], used discriminative vocabulary selection followed by a naïve Bayes (NB) classifier. The major drawback of this approach is that naïve Bayes model is restricted by a limited (small) vocabulary. Although, we have used the same training and test splits, May [May:2015:mivec] had slightly higher vocabulary than ours, and their best system was based on TF-IDF, which is similar to our baseline TF-IDF based system. The remaining rows in Table II show our baseline and proposed systems. We can see that our proposed systems achieve consistently better accuracies over other systems. Notably, GLCU which exploits the uncertainty in document embeddings has much lower cross-entropy than its counter part, GLC. To the best of our knowledge, the proposed systems achieved the best classification results on Fisher corpora with the current set-up, i.e., treating each side of the conversation as an independent document.

Table III presents classification results on 20Newsgroups dataset. The first three rows give the results as reported in earlier works. In [Raghu:2018:CNN]

, the authors present a CNN based discriminative model trained to jointly optimize categorical cross-entropy loss for classification task along with binary cross-entropy for verification task. Neural tensor skip-gram model (NTSG) 

[Liu:2015:NTSG] extends the idea of skip-gram model for obtaining document embeddings, where as sparse composite document vector (SCDV) [Mekala:2017:SCDV] exploits pre-trained word embeddings to obtain sparse document embeddings. The authors in [Mekala:2017:SCDV], have shown to achieve superior classification results as compared to paragraph vector, LDA and other systems. The next rows in Table III present our baseline and proposed systems. We see that the topic ID systems based on Bayesian SMM and logistic regression is better than many other models, except for the discriminatively trained CNN. We can also see that all the Bayesian SMM based systems are consistently better than variational auto encoder inspired NVDM.

Systems Model Classifier Accuracy (%) CE
Baseline CNN [Raghu:2018:CNN] - 86.12 -
SCDV [Mekala:2017:SCDV] SVM 84.60 -
NTSG-1 [Liu:2015:NTSG] SVM 82.60 -
Our Baselines TF-IDF LR 84.47 0.73
ULMFiT MLP 83.06 0.89
SMM LR 82.01 0.75
SMM GLC 82.02 1.33
NVDM LR 79.57 0.86
NVDM GLC 77.60 1.65
NVDM GLCU 76.86 0.88
Proposed Bayesian SMM LR 84.65 0.53
Bayesian SMM GLC 83.22 1.28
Bayesian SMM GLCU 82.81 0.79
TABLE III: Comparison of results on 20Newsgroups.

Vii-E Uncertainty in document embeddings

Fig. 7: Uncertainty captured in the document embeddings inferred using Bayesian SMM.

The uncertainty captured in the posterior distribution of document embeddings correlates well with size of the document. This can be seen in gradient equation 21). It can also be seen by plotting the trace of the covariance matrix of the inferred posterior distribution of the embeddings. Fig. 7 shows an example of uncertainty captured in the embeddings, where the corresponding Bayesian SMM model was trained on 20Newsgroups with embedding dimension set to 100. Note that our model is trained in an unsupervised fashion, and hence the uncertainties captured does not necessarily correlate with the topic labels.

Viii Conclusions and future work

We have presented a generative model for learning document representations (embeddings) and their uncertainties. We showed that our model achieved superior perplexity results on the standard 20Newsgroups and Fisher datasets, when compared to neural variational document model. Next, we have shown that the proposed model is robust to over-fitting and unlike in SMM and NVDM, it does not require any topic ID performance monitoring system. We proposed an extension to simple Gaussian linear classifier that exploits the uncertainty in document embeddings and achieves better cross entropy scores on the test data as compared to the simple GLC. There exists other scoring mechanisms that exploit the uncertainty in embeddings [Brummer:2018:GE]. Using simple linear classifiers on the obtained document embeddings, we achieved superior classification results on Fisher speech data and comparable results on 20Newsgroups text data. We also addressed a commonly encountered problem of intractability while performing variational inference in mixed-logit models, by using the re-parametrization trick. This idea can be translated in a straightforward way for subspace -gram model for learning sentence embeddings and also for learning word embeddings along with their uncertainties. The proposed Bayesian SMM can be extended to have topic specific priors for document embeddings, which can in-turn be learned from the labelled data; thus incorporating topic label uncertainty implicitly in the document embeddings.

Appendix A Gradients of Lower Bound

The variational distribution is diagonal with the following parametrization:


The lower bound for a single document is:




It is convenient to have the following derivatives:


Derivatives of the parameters of variational distribution:

Taking derivative of the objective function (Eq. (LABEL:eq:app_obj_baysmm)) with respect to mean parameter and using Eq. (45):
Taking the derivative of objective function (Eq. (LABEL:eq:app_obj_baysmm)) with respect to and using Eq. (46):

Derivatives of the model parameters:

Taking the derivative of complete objective Eq. (17) with respect to a row from matrix .


Here, operates element-wise.

Appendix B EM algorithm for GLCU


Obtaining the posterior distribution of latent variable . Using the results from [cookbook] (Pg. 41, Eq. (358)):

where is simplfied as:

resulting in:



Maximizing the auxiliary function


Using the results from [cookbook][p. 43, Eq. (378)], the auxiliary function is computed as:

Maximizing the auxiliary function with respect to model parameters

Taking derivative with respect to each column in and equating it to zero:

Taking derivative with respect to shared precision matrix and equating it to zero:



This work was supported by the U.S. DARPA LORELEI contract No. HR0011-15-C-0115. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government. The work was also supported by Technology Agency of the Czech Republic project No. TJ01000208 ”NOSICI” and by Czech Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPU II) project ”IT4Innovations excellence in science - LQ1602”