Effective Estimation of Deep Generative Language Models

by   Tom Pelsmaeker, et al.
University of Amsterdam

Advances in variational inference enable parameterisation of probabilistic models by deep neural networks. This combines the statistical transparency of the probabilistic modelling framework with the representational power of deep learning. Yet, it seems difficult to effectively estimate such models in the context of language modelling. Even models based on rather simple generative stories struggle to make use of additional structure due to a problem known as posterior collapse. We concentrate on one such model, namely, a variational auto-encoder, which we argue is an important building block in hierarchical probabilistic models of language. This paper contributes a sober view of the problem, a survey of techniques to address it, novel techniques, and extensions to the model. Our experiments on modelling written English text support a number of recommendations that should help researchers interested in this exciting field.



There are no comments yet.


page 1

page 2

page 3

page 4


Faithful Model Inversion Substantially Improves Auto-encoding Variational Inference

In learning deep generative models, the encoder for variational inferenc...

Extending Stan for Deep Probabilistic Programming

Deep probabilistic programming combines deep neural networks (for automa...

A Tutorial on the Mathematical Model of Single Cell Variational Inference

As the large amount of sequencing data accumulated in past decades and i...

Deep covariate-learning: optimising information extraction from terrain texture for geostatistical modelling applications

Where data is available, it is desirable in geostatistical modelling to ...

Generative Models for Learning from Crowds

In this paper, we propose generative probabilistic models for label aggr...

Probabilistic Models with Deep Neural Networks

Recent advances in statistical inference have significantly expanded the...

Learning document embeddings along with their uncertainties

Majority of the text modelling techniques yield only point estimates of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep generative models (DGMs) are probabilistic latent variable models parameterised by neural networks (NNs). Specifically, DGMs optimised with amortised variational inference and reparameterised gradient estimates (Kingma and Welling, 2014; Rezende et al., 2014)

, better known as variational auto-encoders (VAEs), have spurred much interest in various domains, including computer vision and natural language processing (NLP).

In NLP, VAEs have been developed for word representation (Rios et al., 2018), morphological analysis (Zhou and Neubig, 2017), syntactic and semantic parsing (Corro and Titov, 2018; Lyu and Titov, 2018), document modelling (Miao et al., 2016), summarisation (Miao and Blunsom, 2016), machine translation (Zhang et al., 2016), language and vision (Pu et al., 2016; Wang et al., 2017), dialogue modelling (Wen et al., 2017; Serban et al., 2017), speech modelling (Fraccaro et al., 2016), and, of course, language modelling (Bowman et al., 2016; Goyal et al., 2017a). One problem remains common to the majority of these models, VAEs often learn to ignore the latent variables.

We investigate this problem, dubbed by many posterior collapse, in the context of language modelling (LM). This is motivated by the fact that within NLP, DGMs attract a lot of attention from researchers in language generation, where systems usually employ an LM component.111Albeit typically modified to condition on additional inputs, for example, a chat history in dialogue modelling. In a deep generative LM (Bowman et al., 2016), sentences are generated conditioned on samples from a continuous latent space, an idea with various practical applications. For example, it gives NLP researchers an opportunity to shape this latent space and promote generalisations that are in line with linguistic knowledge and/or intuition (Xu and Durrett, 2018). This also allows for greater flexibility in how the model is used, for example, we can generate sentences that live—in latent space—in a neighbourhood of a given observation (Bowman et al., 2016). Deterministically trained language models, e.g. recurrent NN-based LMs (Mikolov et al., 2010), lack a latent space and are thus deprived of such explicit mechanisms.222It is possible to promote some of these properties in non-probabilistic auto-encoding frameworks (Vincent et al., 2008; Tolstikhin et al., 2017).

Decoding Generated sentence
Greedy The company said it expects to report net income of $UNK-NUM million
Sample They are getting out of my own things ?
IBM also said it will expect to take next year .
(a) Greedy generation from prior samples (top) yields the same sentence every time, showing that the latent code is ignored. Yet, ancestral sampling (bottom) produces good sentences, showing that the recurrent decoder learns about the structure of English sentences.
The two sides hadn’t met since Oct. 18.
I don’t know how much money will be involved.
The specific reason for gold is too painful.
The New Jersey Stock Exchange Composite Index gained 1 to 16.
And some of these concerns aren’t known.
Prices of high-yield corporate securities ended unchanged.

Homotopy: ancestral samples mapped from points along a linear interpolation of two given sentences as represented in latent space. The sentences do not seem to exhibit any coherent relation, showing that the model does not exploit neighbourhood in latent space to capture regularities in data space.

Figure 1: Sentences generated from a vanilla VAE-LM.

Despite this potential, VAEs that employ strong generators (e.g. recurrent NNs) tend to ignore the latent variable (Bowman et al., 2016; Zhang et al., 2016). Figure 1

illustrates this point with samples from a vanilla VAE LM: the model does not capture useful patterns in data space and behaves just like a standard recurrent LM. Various strategies to counter this problem have been independently proposed and tested, in particular, within the computer vision and machine learning communities. One of our contributions is a review and comparison of such strategies, as well as a novel strategy based on constrained optimisation.

There have also been attempts at identifying the fundamental culprit for posterior collapse (Chen et al., 2017; Alemi et al., 2018) leading to strategies based on changes to the generator, prior, and/or posterior. To follow on that, we improve inference for Bowman et al. (2016)’s VAE by employing a class of flexible approximate posteriors (Tabak et al., 2010; Rezende et al., 2014) and modify the model to employ strong priors.

Finally, we compare models and techniques intrinsically in terms of perplexity as well as bounds on mutual information between latent variable and observations. Our findings support a number of recommendations on how to effectively train a deep generative language model.

2 Density Estimation for Text

Density estimation for written text has a long history (Jelinek, 1980; Goodman, 2001), but in this work we concentrate on neural network models (Bengio et al., 2003), in particular, autoregressive ones (Mikolov et al., 2010). Following common practice, we model sentences independently, each a sequence of tokens.

2.1 Language models

A language model (LM) prescribes the generation of a sentence as a sequence of categorical draws parameterised in context, i.e.


To condition on all of the available context, a fixed NN maps from a prefix sequence (denoted ) to the parameters of a categorical distribution over the vocabulary. Given a dataset of i.i.d. observations, we estimate the parameters of the model by searching for a local optimum of the log-likelihood function via stochastic gradient-based optimisation (Robbins and Monro, 1951; Bottou and Cun, 2004), where the expectation is taken w.r.t. the true data distribution and approximated with samples . Throughout, we refer to this model as RnnLM.

2.2 Deep generative language models

Bowman et al. (2016) model observations as draws from the marginal of a DGM. An NN maps from a latent sentence embedding to a distribution over sentences,


where follows a standard Gaussian prior.333We use uppercase

for probability mass functions and lowercase

for probability density functions.

Generation still happens one word at a time without Markov assumptions, but now conditions on in addition to the observed prefix. The conditional is commonly referred to as generator or decoder. The quantity is the marginal likelihood, essential for parameter estimation.

This model is trained to assign high (marginal) probability to observations, much like standard LMs. However, unlike standard LMs, it employs a latent space which can accommodate a low-dimensional manifold where discrete sentences are mapped to—via posterior inference —and from—via generation . This gives the model an explicit mechanism to exploit neighbourhood and smoothness in latent space to capture regularities in data space. For example, it may group sentences according to certain latent factors (e.g. lexical choices, syntactic complexity, lexical semantics, etc.). It also gives users a mechanism to steer generation towards a certain purpose, for example, one may be interested in generating sentences that are mapped from the neighbourhood of another in latent space. To the extent this embedding space captures appreciable regularities, interest in this property is heightened.

Approximate inference

Marginal inference for this model is intractable and calls for approximate methods, in particular, variational inference (VI; Jordan et al., 1999), whereby an auxiliary and independently parameterised model approximates the true posterior . Where this inference model is itself parameterised by a neural network, we have a case of amortised inference (Kingma and Welling, 2014; Rezende et al., 2014) and an instance of what is known as a VAE. Bowman et al. (2016) approach posterior inference with a Gaussian model


whose parameters, i.e. a location vector

and a scale vector , are predicted by a neural network architecture from an encoding of the complete observation .444We use boldface for deterministic vectors and for elementwise multiplication. In this work, we use a bidirectional recurrent encoder (see Appendix A.1 for the complete design). Throughout the text we will refer to this model as SenVAE.

Parameter estimation

We can jointly estimate the parameters of both models (i.e. generative and inference) by locally maximising a lowerbound on the log-likelihood function (ELBO)


via gradient-based optimisation. For as long as we can reparameterise latent samples using a fixed random source, automatic differentiation (Baydin et al., 2018) can be used to obtain unbiased gradient estimates of the ELBO (Kingma and Welling, 2014; Rezende et al., 2014). In §5

we discuss a general class of reparameterisable distributions of which the Gaussian distribution is a special case.

3 Posterior Collapse and the Strong Generator Problem

In VI, we make inferences using an approximation to the true posterior and choose as to minimise the KL divergence . The same principle yields a lowerbound on log-likelihood used to estimate jointly with , thus making the true posterior a moving target. If the estimated conditional can be made independent of , which in our case means relying exclusively on to predict the distribution of , the true posterior will be independent of the data and equal to the prior.555This follows trivially from the definition of posterior: . Based on such observation, Chen et al. (2017) argue that information that can be modelled by the generator without using latent variables will be modelled that way—precisely because when no information is encoded in the latent variable the true posterior equals the prior and it is then trivial to reduce to . This is typically diagnosed by noting that after training for most : we say that the approximate posterior collapses to the prior.

In fact, Alemi et al. (2018) show that the rate, , is an upperbound to the mutual information (MI) between and . From the non-negativity of MI, it follows that whenever is close to zero for most training instances, MI is either or negligible. Alemi et al. (2018) also show that the distortion, , relates to a lowerbound on MI (the lowerbound being , where is the unknown but constant data entropy). Due to this relationship to MI, they argue that reporting and along with log-likelihood on held-out data offers better insights about a trained VAE, an advice we follow in §6.

A generator that makes no Markov assumptions, such as a recurrent LM, can potentially achieve , and indeed many have noticed that VAEs whose observation models are parameterised by such strong generators (or strong decoders) learn to ignore the latent representation (Bowman et al., 2016; Higgins et al., 2017; Sønderby et al., 2016; Zhao et al., 2018b). For this reason, a strategy to prevent posterior collapse is to weaken the decoder (Yang et al., 2017; Park et al., 2018). While in many cases there are good reasons for changing the model’s factorisation, in this work we are interested in employing a strong generator, thus we will not investigate weaker decoders. Alternative solutions typically involve changes to the optimisation procedure and/or manipulations to the objective. The former aims at finding local optima of the ELBO with non-negligible MI. The latter seeks alternatives to the ELBO that target MI more directly.


Bowman et al. (2016) propose “KL annealing”, whereby the term in the is incorporated into the objective in gradual steps. This way early on in optimisation the optimiser can focus on reducing distortion, potentially by increasing the MI between and . They also propose to drop words from the prefix uniformly at random to somewhat weaken the decoder and promote an increase in MI—the intuition is that the model would have to rely on to compensate for missing history. As we do not want to compromise the decoder, we propose a slight modification of this technique whereby we slowly vary this word dropout rate from , instead of selecting a fixed value. In a sense, we anneal the decoder from a weak generator to a strong generator.

Targeting rates

Another idea is to target a pre-specified positive rate (Alemi et al., 2018). Kingma et al. (2016) replace the term in the ELBO with , dubbed free bits (FB) because it allows encoding the first nats of information “for free”. For as long as we are not optimising a proper ELBO (it misses the term), and the introduces a discontinuity at . Chen et al. (2017) propose soft free bits (SFB), that instead multiplies the term in the ELBO with a weighing factor that is dynamically adjusted based on the target rate : is incremented (or reduced) by if (or

). Note that this technique requires hyperparameters (i.e.

) besides to be tuned in order to determine how is updated.

Change of objective

If we accept there is a fundamental problem with the , we may seek alternative objectives and relate them to quantities of interest such as marginal likelihood and MI. A simple adaptation of the ELBO is weighing its KL-term by a constant factor (-VAE; Higgins et al., 2017). Although it was originally aimed at disentanglement of latent features with a , setting promotes and thus increased MI. Whilst being a useful counter to posterior collapse, low might lead to variational posteriors becoming point estimates. The InfoVAE objective (Zhao et al., 2018b) mitigates this with an extra term on top of the -VAE objective which minimises the divergence from the aggregated variational posterior

and the prior. In our experiments we compute this divergence with an unbiased estimate of the maximum mean discrepancy

(MMD; Gretton et al., 2012).

4 Minimum desired rate

We propose minimum desired rate (MDR), a technique to attain ELBO values at a pre-specified rate that does not suffer from the gradient discontinuities of FB, and does not introduce the additional hyperparameters of SFB. The idea is to optimise the ELBO subject to a minimum rate constraint :


Because constrained optimisation is generally intractable, we optimise the Lagrangian (Boyd and Vandenberghe, 2004)


where is a positive Lagrangian multiplier. We define the dual function and solve the dual problem

. Local minima of the resulting min-max objective can be found by performing stochastic gradient descent with respect to

and stochastic gradient ascent with respect to .

Appendix B presents further theoretical remarks comparing -VAE, annealing, FB, SFB and the proposed MDR. We show that MDR is a form of weighing, albeit one that targets a specific rate. It can be seen, for example, as -VAE where (though note that is not fixed). Compared to annealing, we argue that a target rate is far more interpretable a hyperparameter than the length (number of steps) and type (e.g. linear or exponential) of annealing schedule. Like SFB, MDR addresses FB’s discontinuity in the gradients of the rate. Finally, we show that MDR is a form of SFB where is dynamically set to , thus much simpler to tune.

5 Expressive Latent Components

The observation by Chen et al. (2017) suggests that estimating and jointly leads to choosing a generative model such that its corresponding (true) posterior is simple and can be matched exactly. With a Gaussian prior and a complex observation model, unless the latent variable is ignored, the posterior is certainly not Gaussian and likely multimodal. In section §5.1, we modify Bowman et al. (2016)’s inference network to parameterise an expressive posterior approximation in an attempt to reach better local optima.

The information theoretic perspective of Alemi et al. (2018) suggests that the prior regularises the inference model capping the MI between and . Their bounds also suggest that, for a fixed posterior approximation, the optimum prior should be the aggregated posterior , and, therefore, investigating the use of strong priors seems like a fruitful avenue for effective estimation of DGMs. In §5.2, we modify SenVAE’s generative story to employ an expressive prior.

5.1 Expressive posterior

We improve inference for SenVAE using normalising flows (NFs; Rezende and Mohamed, 2015). An NF expresses the density of a transformed variable in terms of the density of a base variable using the change of density rule:


were and are -dimensional and is a differentiable and invertible transformation with Jacobian . For efficiency, it is crucial that the determinant of is simple, e.g. computable in . NFs parameterise (or its inverse) with neural networks, where either , the network, or both are carefully designed to comply with the aforementioned conditions. A special case of an NF is when and is affine with strictly positive slope, which essentially makes a diagonal Gaussian.

We design a posterior approximation based on an inverse autoregressive flow (IAF; Kingma et al., 2016), whereby we transform into a posterior sample by computing


via an affine transformation whose inverse is autoregessive. This is crucial to obtaining a Jacobian whose determinant is simple to compute (see Appendix C.1 for the derivation), i.e. . For increased flexibility, we compose such transformations, each parameterised by an independent MADE (Germain et al., 2015).666A MADE is a dense layer whose weight matrix is masked to be strictly lower triangular, it realises autoregressive transformations between fixed-dimension representations in parallel (see Appendix A.1 for details). We also investigate a more compact flow—in terms of number of parameters—known as a planar flow (PF; Rezende and Mohamed, 2015). The transformation in a PF is not autoregressive, but it is designed such that the determinant of its Jacobian is simple (see Appendix C.2).

5.2 Expressive priors

Here we extend the prior to some more complex, ideally multimodal, parametric family and fit . A perhaps obvious choice is a uniform mixture of Gaussians (MoG), i.e.


where the Gaussian parameters are optimised along with other generative parameters.

A less obvious choice is a variational mixture of posteriors (VampPrior; Tomczak and Welling, 2017). This prior is motivated by the fact, for a fixed posterior approximation, the prior that optimises the is the aggregated posterior . Though we could obtain an empirical estimate of this quantity, this is an intensive computation to perform for every sampled . Instead, Tomczak and Welling (2017) propose to use learned pseudo inputs and design the prior


where is the th such input—in their case a continuous deterministic vector. Again the parameters of the prior, i.e. , are optimised along with other generative parameters.

Applying this technique to our deep generative LM poses additional challenges as our inference model conditions on a sequence of discrete observations. We adapt this technique by point-estimating a sequence of word embeddings, which makes up a pseudo input. That is is a sequence where has the dimensionality of our embeddings, and is the length of the sequence (fixed at the beginning of training). See Appendix A.1 for remarks about both priors.

5.3 KL term

Be it due to an expressive posterior or due to an expressive prior (or both), we lose analytical access to the term in the ELBO. That is, however, not a problem, since we can MC-estimate the term using samples :


where in experiments we make .

6 Experiments

Our goal is to identify which techniques are effective in training VAEs for language modelling and our evaluation concentrates on intrinsic metrics: negative log-likelihood (NLL), perplexity per token (PPL), rate (), distortion (), the number of active units (AU; Burda et al., 2015)) and gap in accuracy of next word prediction (given gold prefixes) when decoding from prior samples versus decoding from posterior samples (ACC).

For VAE models, NLL (and therefore PPL) can only be estimated, since we do not have access to the exact marginal likelihood. For that we derive an importance sampling (IS) estimate


using our trained approximate posterior as importance distribution (we use samples).

We train and test our models on the English Penn Treebank (PTB) dataset (Marcus et al., 1993).777We employ Dyer et al. (2016)’s pre-processing and (standard) partitioning of the data. Hyperparameters for our architectures are chosen via Bayesian optimisation (BO; Snoek et al., 2012)—see Appendix A.2 for details.


We compare our RnnLM to an external baseline employing a comparable number of parameters (Dyer et al., 2016).888The current SOTA PTB-trained models use vastly more parameters and different pre-processing (Melis et al., 2017). Table 1 shows that our RnnLM is a strong baseline and its architecture makes a strong generator building block.

Dyer et al. (2016)
Table 1: Baseline LMs on the PTB test set: over five independent runs. Contrary to us, Dyer et al. (2016) removed the end of sentence token when computing perplexity. In the last column, we report perplexity computed with the stop token removed.

On optimisation strategies

Technique Hyperaparameters
KL annealing increment
annealed word dropout (AWD) decrement
FB target rate
SFB target rate , , ,
MDR target rate
-VAE KL weight
InfoVAE ,
Table 2: Techniques and their hyperparameters.
Mode Hyperparameters NLL PPL ACC
RnnLM - - - 118.7 0.12 107.1 0.46 -
Vanilla - 118.4 0.09 0.0 0.00 118.4 0.09 105.7 0.36 0.0 0.00
annealing : 2e 115.3 0.24 3.3 0.30 117.9 0.08 103.7 0.31 6.0 0.25
AWD : 2e 117.6 0.14 0.0 0.00 117.6 0.14 102.5 0.60 0.0 0.00
FB : 5.0 113.3 0.17 5.0 0.06 117.5 0.18 101.9 0.77 5.8 0.11
SFB : 6.467, : 1%,: 1,:1.05 112.00.15 6.40.04 117.30.13 101.00.51 7.0 0.08
MDR : 5.0 113.5 0.10 5.0 0.03 117.5 0.11 102.1 0.45 6.2 0.14
-VAE : 0.66 113.0 0.14 5.3 0.05 117.4 0.13 101.7 0.50 6.1 0.12
InfoVAE : 0.700, : 31.623 113.5 0.09 4.3 0.02 117.2 0.09 100.8 0.35 5.2 0.09
Table 3: Performance ( across independent runs) of SenVAE on the PTB validation set.
(a) PPL for various target rates.
(b) Rate over time for .
(c) Target rate minus validation rate at the end of training for various targets.
Figure 2: Validation results for SenVAE trained with free-bits (FB) or minimum desired rate (MDR).

First, we assess the effectiveness of techniques that aim at promoting local optima of SenVAE with better MI tradeoff. The techniques we compare have hyperparameters of their own (see Table 2), which we tune using BO towards minimising estimated NLL of the validation data. As for the architecture, the approximate posterior employs a bidirectional recurrent encoder, and the generator is essentially our RnnLM initialised with a learned projection of (Appendix A.1 contains the complete specification). Models were trained with Adam (Kingma and Ba, 2014) with default parameters and a learning rate of until convergence five times for each technique.

Results can be found in Table 3. First, note how the vanilla VAE (no special treatment) encodes no information in latent space (). Then note that all techniques converged to VAEs that attain better perplexities than the RnnLM, and all but annealed word dropout did so at non-negligible rate. Notably, the two most popular techniques, word dropout and annealing, perform sub-par to the other techniques.999Though here we show annealed word dropout, to focus on techniques that do not weaken the generator, standard word dropout also converged to negligible rates. The techniques that work well at non-negligible rate can be separated in two classes. The first class requires setting a target rate, whereas the second requires tuning of one or more hyperparameters.101010Soft free bits actually requires both. We argue that the rate hyperparameter is more interpretable and practical in most cases, for example, it likely requires less (manual or Bayesian) tuning by the researcher. Hence, we further investigated this first class, specifically FB and MDR, by varying the target rate further. Figure 1(a) shows that they attain comparable perplexities over a large range of rates. Figure 1(c) shows the difference between the specified target rate and the rate estimated on validation data at the end of training: MDR is just as good as FB for lower targets and becomes more effective than FB for higher targets.

On expressive priors and posteriors

Posterior Prior PPL AU ACC
- - - - 84.5 0.53 - -
Gaussian Gaussian 103.5 0.14 5.0 0.06 81.5 0.49 13 0.7 5.4 0.13
Gaussian MoG 103.3 0.11 5.0 0.07 81.4 0.54 32 0.0 5.8 0.11
Gaussian Vamp 103.1 0.11 5.0 0.06 81.2 0.40 22 1.6 5.8 0.07
Planar Gaussian 103.4 0.09 4.9 0.02 80.9 0.33 12 0.4 5.4 0.11
IAF Gaussian 103.4 0.10 5.0 0.02 81.4 0.34 32 0.0 5.5 0.10
IAF MoG 103.2 0.25 5.1 0.05 81.5 0.70 32 0.0 6.0 0.09
Table 4: Performance on the PTB test set of the SenVAE with various prior and posterior distributions ( across independent runs). All VAEs were trained with a target rate of five, and the top row shows RnnLM.
(a) Perplexity on validation set: models perform similarly well and perplexity degrades considerably for .
(b) Accuracy gap: VAEs with stronger latent components rely more on posterior samples for reconstruction.
Figure 3: Comparison of SenVAEs trained with standard prior and Gaussian posterior (Gauss), MoG prior and IAF posterior (IAF-MoG), and Vamp prior and Gaussian posterior (Vamp) to attain pre-specified rates.

Second, we compare the impact of expressive posteriors and priors. This time, flow and prior hyperparameters were selected via grid search, and can be found in Appendix A.1. All models were trained with a target rate of five, with settings otherwise the same as the previous experiment. In Table 4 it can be seen that more expressive components did not improve perplexity further. It is possible, however, that now that we have stronger latent components, we need to target models with higher MI between and . In Figure 2(a) it can be seen that this is not the case, since all models perform roughly the same and beyond nats performance degrades quickly. It is worth highlighting that, though perplexity did not improve, models with expressive latent components did show other indicators of increased MI. Figure 2(b) shows that SenVAEs trained with expressive latent components learn to rely more on information encoded in the latent variables—note the increased gap in performance when reconstructing from posterior rather than prior samples. This result is also hinted at by the increase in active units for expressive latent components shown in Table 4.

Generated samples

Sample Closest training instance TER
For example, the Dow Jones Industrial Average fell almost 80 points to close at 2643.65. By futures-related program buying, the Dow Jones Industrial Average gained 4.92 points to close at 2643.65.
The department store concern said it expects to report profit from continuing operations in 1990. Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.
The new U.S. auto makers say the accord would require banks to focus on their core businesses of their own account. International Minerals said the sale will allow Mallinckrodt to focus its resources on its core businesses of medical products, specialty chemicals and flavors.
Figure 4: Samples from SenVAE (MoG prior and IAF posterior) trained via MDR (): we sample from the prior and decode greedily. We also show the closest training instance in terms of a string edit distance (TER).
The inquiry soon focused on the judge.
The judge declined to comment on the floor.
The judge was dismissed as part of the settlement.
The judge was sentenced to death in prison.
The announcement was filed against the SEC.
The offer was misstated in late September.
The offer was filed against bankruptcy court in New York.
The letter was dated Oct. 6.
Figure 5: Latent space homotopy from a properly trained SenVAE. Note the smooth transition of topic and grammatically of the samples. All sentences were greedily decoded from a prior sample.

Figure 4 shows samples from a well-trained SenVAE, where we decode greedily from a prior sample—this way all variability is due to the generator’s reliance on the latent sample. Recall that a vanilla VAE ignores and thus greedy generation from a prior sample is essentially deterministic in that case (see Figure 0(a)). Next to the samples we show the closest training instance, which we measure in terms of an edit distance (TER; Snover et al., 2006).111111This distance metric varies from to , where indicates the sentence is completely novel and indicates the sentence is essentially copyied from the training data. The motivation to retrieve this “nearest neighbour” is to help us assess whether the generator is producing novel text or simply reproducing something it memorised from training. We also show a homotopy in Figure 5: here we decode greedily from points lying between a posterior sample conditioned on the first sentence and a posterior sample conditioned on the last sentence. In contrast to the vanilla VAE (Figure 0(b)), neighbourhood in latent space is now used to capture some regularities in data space. These samples add support to the quantitative evidence that our DGMs have been effectively trained not to neglect the latent space. In Appendix D we provide more samples (also for other variants of the model).


Based on our path through the land of SenVAE

s, we recommend to target a specific rate via MDR (or FB) instead of annealing (or word dropout). It is easy to pick a rate by plotting validation performance against a handful of rate values without sophisticated Bayesian optimisation. Use importance-sampled estimates of NLL, rather than single-sample ELBO estimates, for model selection, for the latter can be too loose a bound and/or too heavily influenced by noisy estimates of KL. Use as many samples as you can for that, you will observe a tighter bound and lower variance (we use

). Inspect sentences generated by greedily decoding from a prior (or posterior) sample as this shows whether the generator is at all sensitive to variation in latent space. Retrieve nearest neighbours from training data to spot copying behaviour. Do investigate stronger latent components (priors and approximate posteriors), they seem to lead to higher mutual information without hurting perplexity (which weaker generators probably would).

7 Related Work

In NLP, posterior collapse was probably first noticed by Bowman et al. (2016), who addressed it via word dropout and/or KL scaling. Further investigation revealed that in the presence of strong generators, the ELBO itself becomes the culprit (Chen et al., 2017; Alemi et al., 2018), for it does not have a term that explicitly promotes high MI between latent and observed data. Posterior collapse has also been ascribed to amortised inference (Kim et al., 2018). Beyond the techniques compared and developed in this work, other solutions have been proposed, including further adaptations to the generator architecture (Semeniuta et al., 2017; Yang et al., 2017; Park et al., 2018; Dieng et al., 2018), the ELBO (Tolstikhin et al., 2017; Goyal et al., 2017a), and the latent distributions (Xu and Durrett, 2018; Razavi et al., 2019).

Concurrently to this work, He et al. (2019) proposed aggressive optimisation of the inference network until convergence of MI estimates. The authors show that this outperforms KL scaling techniques, specifically -VAE. However, in contrast to our MDR objective, the extra optimisation of the inference network slows down training considerably. A comparison to their technique is an interesting direction for future work.

GECO (Rezende and Viola, 2018) and the Lagrangian VAE (LagVAE; Zhao et al., 2018a) cast VAE optimisation as a dual problem, and for that they are closely-related to our MDR. GECO targets minimisation of under constraints on reconstruction error, whereas LagVAE targets either maximisation or minimisation of (bounds on) the MI between and under constraints on the InfoVAE objective. Contrary to MDR, GECO focuses on latent space regularisation and offers no explicit mechanism to mitigate posterior collapse. LagVAE, in MI-maximisation mode, promotes non-negligible rates, but requires constraints based on feasible ELBO values.121212This might be a reasonable requirement for often explored datasets, such as MNIST. Thus, in this setting, it is somewhat the opposite of our technique: MDR minimises ELBO while targeting an upperbound on MI, LagVAE maximises MI while targeting an ELBO. It might depend on the specific problem which of the two methods is more convenient. All three techniques share the advantage that they can be trivially extended with other constraints at the researchers behest.

Expressive latent components have been extensively and successfully applied to the image domain. Expressive posteriors, based on NFs, include the IAF (Kingma et al., 2016), NAF (Huang et al., 2018a), ODE (Chen et al., 2018) and FFJORD (Grathwohl et al., 2019), Sylvester flow (van den Berg et al., 2018) and Householder flow (Tomczak and Welling, 2017). Expressive priors include the VampPrior (Tomczak and Welling, 2017), autoregressive flows (Papamakarios et al., 2017) and various non-parametric priors (Nalisnick and Smyth, 2016; Goyal et al., 2017b; Bodin et al., 2017). However, these techniques have seen little application to the language domain so far, with the exception of the Householder flow for variational topic modelling (Liu et al., 2018) and, concurrently to this work, NFs for latent sentence modelling with character-level latent variables and weak generators (Ziegler and Rush, 2019). We believe we are the first to employ expressive latent models at the sentence-level, and hope it will stimulate the NLP community to further investigate these techniques.

8 Discussion

The typical RnnLM

is built upon an exact factorisation of the joint distribution, thus a well trained architecture is hard to improve upon in terms of log-likelihood of gold-standard data. Our interest in latent variable models stems from the desire to obtain generative stories that are less opaque than that of an

RnnLM, for example, in that they may expose knobs that we can use to control generation and a hierarchy of steps that may award a degree of interpretability to the model. The SenVAE is not that model, but it is a crucial building block in the pursue for hierarchical probabilistic models of language. SenVAE is a deep generative model whose generative story is rather shallow, yet, due to its strong generator component, it is hard to make effective use of the extra knob it offers. In this paper, we have shown that effective estimation of such a model is possible, in particular, optimisation subject to a minimum rate constraint seems a simple and effective strategy to alleviate posterior collapse. Many questions remain open, especially regarding the potential of expressive latent components, but we hope this work, i.e. the organised review it contributes and the techniques it introduces, will pave the way to deeper—in statistical hierarchy—generative models of language.


This project has received funding from the Dutch Organization for Scientific Research VICI Grant No 277-89-002 and from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825299 (GoURMET).


Appendix A Architectures and Hyperparameters

In order to ensure that all our experiments are fully reproducible, this section provides an extensive overview of the model architectures, as well as model an optimisation hyperparameters.

Some hyperparameters are common to all experiments, e.g. optimiser and dropout, they can be found in Table 5. All models were optimised with Adam using default settings (Kingma and Ba, 2014). To regularise the models, we use (variational) dropout with a shared mask across time-steps (Gal and Ghahramani, 2016) and weight decay proportional to the dropout rate (Gal and Ghahramani, 2015) on the input and output layers of the generative networks (i.e. RnnLM and the recurrent decoder in SenVAE

). No dropout is applied to layers of the inference network as this does not lead to consistent empirical benefits and lacks a good theoretical basis. Gradient norms are clipped to prevent exploding gradients, and long sentences are truncated to three standard deviations above the average sentence length in the training data.

Parameter Value
Optimizer Adam
Optimizer Parameters
Learning Rate 0.001
Batch Size 64
Decoder Dropout Rate () 0.4
Weight Decay
Maximum Sentence Length 56
Maximum Gradient Norm 1.5
Table 5: Experimental settings.

a.1 Architectures

Model Parameter Value
A embedding units () 256
A vocabulary size () 25643
R and S decoder layers () 2
R and S decoder hidden units () 256
S encoder hidden units () 256
S encoder layers () 1
S latent units () 32
I context units () 512
I and P flow steps () 4
MoG mixture components () 100
VampPrior pseudo inputs () 100
Table 6: Architecture parameters: all (A), RnnLM (R), SenVAE (S), specific to IAF (I) and/or planar (P) variants.

This section describes the components that parameterise our models.131313All models were implemented with the PyTorch library (Paszke et al., 2017), using default modules for the recurrent networks, embedders and optimisers. We use mnemonic blocks to describe architectures. Table 6 lists hyperparameters for the models discussed in what follows.


At each step, an RnnLM parameterises a categorical distribution over the vocabulary, i.e. , where and


We employ an embedding layer (), one (or more) cell(s) ( is a parameter of the model), and an layer to map from the dimensionality of the GRU to the vocabulary size.

Gaussian SenVAE

A Gaussian SenVAE also parameterises a categorical distribution over the vocabulary for each given prefix, but, in addition, it conditions on a latent embedding , i.e. where and


Compared to RnnLM, we modify only slightly by initialising GRU cell(s) with computed as a learnt transformation of . Because the marginal of the Gaussian SenVAE is intractable, we train it via variational inference using an inference model where


Note that we reuse the embedding layer from the generative model. Finally, a sample is obtained via where .

Iaf SenVAE

Unlike the Gaussian case, an IAF (Kingma et al., 2016) does not parameterise a distribution directly, but rather a sampling procedure where we transform a -dimensional sample from a base distribution (here a standard Gaussian) via an invertible and differentiable transformation. Here we show the design of an IAF where we employ MADE layers:

The context vector represents the complete input sequence and allows each step of the flow to condition on . Note that while is actually Gaussian-distributed, i.e. , the distribution of each for is potentially increasingly more complex. A sample from is the output of the flow at step , i.e.
whose log-density is

See Appendix C for more on NFs.


We denote by a masked dense layer (Germain et al., 2015) with inputs and , which is autoregressive on , where:


where is a lower-triangular weight matrix with non-zero diagonal elements and a strictly lower-triangular weight matrix (with zeros on and above the diagonal). The parameters of the made are .

Planar SenVAE

A planar flow has a more compact parameterisation than an IAF, but is based on the same principle, namely, we parameterise a sampling procedure by an invertible and differentiable transformation of a fixed random source (a standard Gaussian in this case):

where a sample from is the output of the flow at step , i.e. with log-density
and (Rezende and Mohamed, 2015).

In line with the work of van den Berg et al. (2018), we amortise all parameters of the flow in addition to the parameters of the base distribution.

MoG prior

We parameterise diagonal Gaussians, which are mixed uniformly. To do so we need location vectors, each in , and scale vectors, each in . To ensure strict positivity for scales we make . The set of generative parameters is therefore extended with and , each in .


For this we estimate sequences of input vectors, each sequence corresponds to a pseudo-input. This means we extend the set of generative parameters with , each in , for . For each , we sample at the beginning of training and keep it fixed. Specifically, we drew samples from a normal, , which we rounded to the nearest interger. and are the dataset sentence length mean and variance respectively.

a.2 Bayesian Optimisation

Parameter Value
Objective Function Validation NLL
Kernel Matern
Acquisition Function Expected Improvement
Parameter Inference MCMC
MCMC Samples 10
Leapfrog Steps 20
Burnin Samples 100
Table 7: Bayesian optimisation settings.

Bayesian optimisation (BO) is an efficient method to approximately search for global optima of a (typically expensive to compute) objective function , where is a vector containing the values of hyperparameters that may influence the outcome of the function (Snoek et al., 2012). Hence, it forms an alternative to grid search or random search (Bergstra and Bengio, 2012) for tuning the hyperparameters of a machine learning algorithm. BO works by assuming that our observations (for ) are drawn from a Gaussian process (GP; Rasmussen and Williams, 2005). Then based on the GP posterior, we can design and infer an acquisition function. This acquisition function can be used to determine where to “look next” in parameter-space, i.e. it can be used to draw for which we then evaluate the objective function . This procedure iterates until a set of optimal parameters is found with some level of confidence.

In practice, the efficiency of BO hinges on multiple choices, such as the specific form of the acquisition function, the covariance matrix (or kernel) of the GP and how the parameters of the acquisition function are estimated. Our objective function is the (importance-sampled) validation NLL, which can only be computed after a model convergences (via gradient-based optimisation of the ELBO). We follow the advice of Snoek et al. (2012) and use MCMC for estimating the parameters of the acquisition function. This reduced the amount of objective function evaluations, speeding up the overall search. Other settings were also based on results by Snoek et al. (2012), and we refer the interested reader to that paper for more information about BO in general. A summary of all relevant settings of BO can be found in Table 7. We used the GPyOpt library (authors, 2016) to implement this procedure.

Appendix B Relation between optimisation techniques

It is insightful to compare the various techniques we surveyed to the technique we propose in terms of the quantities involved in their optimisation. To avoid clutter, let us assume a single data point , and denote the distortion by and the rate by .

The losses minimised by the -VAE, annealing and SFB, all have the form


where is a weighting factor. FB minimises the loss


where is the target rate. Last, with respect to and , MDR minimises the loss


where is the Lagrangian multiplier. And with respect to , it minimises


Since we aim to minimise these losses as a function of the parameters with stochastic gradient descent, it makes sense to evaluate how these methods influence optimisation by checking their gradients. First, FB has the following gradients w.r.t. its parameters:


which shows the discontinuity in the gradients as a results of this objective. I.e., there is a sudden ‘jump‘ from zero to a large gradient w.r.t. the KL when the KL dips above R. -VAE, KL annealing, and SFB have a gradient that does not suffer such discontinuities:


where you can see that the magnitude of the gradient w.r.t. the KL is influenced by the value of at that point in the optimisation. Last, observe the gradient of the MDR objective:


thus, essentially, with . Hence, MDR is another form of KL weighting, albeit one that allows specific rate targeting.

Compared to -VAE, MDR has the advantage that is not fixed, but estimated to meet the requirements on rate. This might mitigate the problem noticed by He et al. (2019) that -VAE can lead to under-regularisation at the end of training. Similar to their technique, MDR can cut the inference network more ‘slack’ during the start of training, but enforce stricter regularisation at the end, once the constraint is met. We observe that this happens in practice. Furthermore, we would argue that tuning towards a specific rate is more interpretable than tuning .

A similar argument can be made against -annealing. Although is not fixed in this scheme, it requires multiple decisions that are not very interpretable, such as the length (number of steps) and type (e.g. linear or exponential) of the schedule.

Most similar then, seems SFB. Like MDR, it flexibly updates by targeting a rate. However, differences between the two techniques become apparent when we observe how is updated. In case of SFB:


where , and are hyperparameters. In case of MDR (not taking optimiser-specific dynamics into account):


where is a learning rate. From this, we can draw the conclusion that MDR is akin to SFB without any extra hyperparameters. Yet, it also gives some insight into suitable hyperparameters for SFB; if we set ,141414If we always increment with this value, that is. and , SFB is essentially equal to performing Lagrangian relaxation on the ELBO with a constraint on the minimum rate.

All in all, this analysis shows that there is a clear relation between several of the optimisation techniques compared in this paper. MDR seems to be the most flexible, whilst requiring the least amount of hyperparameter tuning or heuristics.

Appendix C Normalising flows

This section reviews a general class of reparameterisable distributions known as a normalising flow (NF; Tabak et al., 2010). NFs express the density of a transformed variable in terms of the density of a base variable using the change of density rule:


or conversely, by application of the inverse function theorem,


were and are -dimensional and is a differentiable and invertible transformation with Jacobian . The change of densities rule can be used to map a sample from a complex distribution to a sample from a simple distribution, or the other way around, and it relates their densities analytically. For efficiency, it is crucial that the determinant of be simple, e.g. assessed in time . NFs parameterise (or its inverse) with neural networks, where either , the network, or both are carefully designed to comply with the aforementioned conditions.

NFs can be used where the input to the flow is a sample from a simple fixed distribution, such as uniform or standard Gaussian, and the output is a sample from a much more complex distribution. This leads to very expressive approximate posteriors for amortised variational inference. A general strategy for designing tractable flows is to design simple transformations, each of which meets our requirements, and compose enough of them exploiting the fact that composition of invertible functions remains invertible. In fact, where the base distribution is a standard Gaussian and the transformation is affine with strictly positive slope (an invertible and differentiable function), the resulting distribution is a parameterised Gaussian, showing that Gaussians can be seen as a particularly simple normalising flow. NFs can also be used where the input to the flow is a data point and the output is a sample from a simple distribution, this leads to very expressive density estimators for continuous observations—differentiability and invertibility constraints preclude direct use of NFs to model discrete distributions.

Normalising flows have been introduced in the context of variational inference (Rezende and Mohamed, 2015) and and density estimation (Rippel and Adams, 2013). Various transformations have been designed all aiming at increasing expressiveness with manageable computation (Kingma et al., 2016; Dinh et al., 2017; Papamakarios et al., 2017; Huang et al., 2018b).

c.1 Inverse autoregressive flows

In an IAF (Kingma et al., 2016), , where


is a differentiable transformation whose inverse


is autoregressive (the output depends on ). The parameters of , i.e. and , are compute by neural networks, and note that in the forward direction we can compute all transformations in parallel using a MADE (Germain et al., 2015). Moreover, the Jacobian is lower-triangular and thus has a simple determinant (product of diagonal elements). To see that, let us compute the entries of the Jacobian of . Below the main diagonal, i.e. , we have:


On the main diagonal, i.e. , we have


And finally, above the main diagonal, i.e. , the partial derivative is zero. The Jacobian matrix is therefore lower triangular with the th element of its diagonal equal to , which leads to efficient determinant computation:


and from the inverse function theorem it holds that


Therefore, where is sampled from a simple random source (e.g. a Gaussian), we can assess the log-density of via:


Naturally, composing transformations as the one in (30) leads to more complex distributions.

c.2 Planar flows

A planar flow (Rezende and Mohamed, 2015) is based on a transformation where


where and are parameters of the flow, is a smooth elementwise non-linearity (we use ) and derivative . It can be shown that


where . Then, where is sampled from a simple random source (e.g. a Gaussian),