1 Introduction
Deep generative models (DGMs) are probabilistic latent variable models parameterised by neural networks (NNs). Specifically, DGMs optimised with amortised variational inference and reparameterised gradient estimates (Kingma and Welling, 2014; Rezende et al., 2014)
, better known as variational autoencoders (VAEs), have spurred much interest in various domains, including computer vision and natural language processing (NLP).
In NLP, VAEs have been developed for word representation (Rios et al., 2018), morphological analysis (Zhou and Neubig, 2017), syntactic and semantic parsing (Corro and Titov, 2018; Lyu and Titov, 2018), document modelling (Miao et al., 2016), summarisation (Miao and Blunsom, 2016), machine translation (Zhang et al., 2016), language and vision (Pu et al., 2016; Wang et al., 2017), dialogue modelling (Wen et al., 2017; Serban et al., 2017), speech modelling (Fraccaro et al., 2016), and, of course, language modelling (Bowman et al., 2016; Goyal et al., 2017a). One problem remains common to the majority of these models, VAEs often learn to ignore the latent variables.
We investigate this problem, dubbed by many posterior collapse, in the context of language modelling (LM). This is motivated by the fact that within NLP, DGMs attract a lot of attention from researchers in language generation, where systems usually employ an LM component.^{1}^{1}1Albeit typically modified to condition on additional inputs, for example, a chat history in dialogue modelling. In a deep generative LM (Bowman et al., 2016), sentences are generated conditioned on samples from a continuous latent space, an idea with various practical applications. For example, it gives NLP researchers an opportunity to shape this latent space and promote generalisations that are in line with linguistic knowledge and/or intuition (Xu and Durrett, 2018). This also allows for greater flexibility in how the model is used, for example, we can generate sentences that live—in latent space—in a neighbourhood of a given observation (Bowman et al., 2016). Deterministically trained language models, e.g. recurrent NNbased LMs (Mikolov et al., 2010), lack a latent space and are thus deprived of such explicit mechanisms.^{2}^{2}2It is possible to promote some of these properties in nonprobabilistic autoencoding frameworks (Vincent et al., 2008; Tolstikhin et al., 2017).

Homotopy: ancestral samples mapped from points along a linear interpolation of two given sentences as represented in latent space. The sentences do not seem to exhibit any coherent relation, showing that the model does not exploit neighbourhood in latent space to capture regularities in data space. 
Despite this potential, VAEs that employ strong generators (e.g. recurrent NNs) tend to ignore the latent variable (Bowman et al., 2016; Zhang et al., 2016). Figure 1
illustrates this point with samples from a vanilla VAE LM: the model does not capture useful patterns in data space and behaves just like a standard recurrent LM. Various strategies to counter this problem have been independently proposed and tested, in particular, within the computer vision and machine learning communities. One of our contributions is a review and comparison of such strategies, as well as a novel strategy based on constrained optimisation.
There have also been attempts at identifying the fundamental culprit for posterior collapse (Chen et al., 2017; Alemi et al., 2018) leading to strategies based on changes to the generator, prior, and/or posterior. To follow on that, we improve inference for Bowman et al. (2016)’s VAE by employing a class of flexible approximate posteriors (Tabak et al., 2010; Rezende et al., 2014) and modify the model to employ strong priors.
Finally, we compare models and techniques intrinsically in terms of perplexity as well as bounds on mutual information between latent variable and observations. Our findings support a number of recommendations on how to effectively train a deep generative language model.
2 Density Estimation for Text
Density estimation for written text has a long history (Jelinek, 1980; Goodman, 2001), but in this work we concentrate on neural network models (Bengio et al., 2003), in particular, autoregressive ones (Mikolov et al., 2010). Following common practice, we model sentences independently, each a sequence of tokens.
2.1 Language models
A language model (LM) prescribes the generation of a sentence as a sequence of categorical draws parameterised in context, i.e.
(1) 
To condition on all of the available context, a fixed NN maps from a prefix sequence (denoted ) to the parameters of a categorical distribution over the vocabulary. Given a dataset of i.i.d. observations, we estimate the parameters of the model by searching for a local optimum of the loglikelihood function via stochastic gradientbased optimisation (Robbins and Monro, 1951; Bottou and Cun, 2004), where the expectation is taken w.r.t. the true data distribution and approximated with samples . Throughout, we refer to this model as RnnLM.
2.2 Deep generative language models
Bowman et al. (2016) model observations as draws from the marginal of a DGM. An NN maps from a latent sentence embedding to a distribution over sentences,
(2)  
where follows a standard Gaussian prior.^{3}^{3}3We use uppercase
for probability mass functions and lowercase
for probability density functions.
Generation still happens one word at a time without Markov assumptions, but now conditions on in addition to the observed prefix. The conditional is commonly referred to as generator or decoder. The quantity is the marginal likelihood, essential for parameter estimation.This model is trained to assign high (marginal) probability to observations, much like standard LMs. However, unlike standard LMs, it employs a latent space which can accommodate a lowdimensional manifold where discrete sentences are mapped to—via posterior inference —and from—via generation . This gives the model an explicit mechanism to exploit neighbourhood and smoothness in latent space to capture regularities in data space. For example, it may group sentences according to certain latent factors (e.g. lexical choices, syntactic complexity, lexical semantics, etc.). It also gives users a mechanism to steer generation towards a certain purpose, for example, one may be interested in generating sentences that are mapped from the neighbourhood of another in latent space. To the extent this embedding space captures appreciable regularities, interest in this property is heightened.
Approximate inference
Marginal inference for this model is intractable and calls for approximate methods, in particular, variational inference (VI; Jordan et al., 1999), whereby an auxiliary and independently parameterised model approximates the true posterior . Where this inference model is itself parameterised by a neural network, we have a case of amortised inference (Kingma and Welling, 2014; Rezende et al., 2014) and an instance of what is known as a VAE. Bowman et al. (2016) approach posterior inference with a Gaussian model
(3)  
whose parameters, i.e. a location vector
and a scale vector , are predicted by a neural network architecture from an encoding of the complete observation .^{4}^{4}4We use boldface for deterministic vectors and for elementwise multiplication. In this work, we use a bidirectional recurrent encoder (see Appendix A.1 for the complete design). Throughout the text we will refer to this model as SenVAE.Parameter estimation
We can jointly estimate the parameters of both models (i.e. generative and inference) by locally maximising a lowerbound on the loglikelihood function (ELBO)
(4)  
via gradientbased optimisation. For as long as we can reparameterise latent samples using a fixed random source, automatic differentiation (Baydin et al., 2018) can be used to obtain unbiased gradient estimates of the ELBO (Kingma and Welling, 2014; Rezende et al., 2014). In §5
we discuss a general class of reparameterisable distributions of which the Gaussian distribution is a special case.
3 Posterior Collapse and the Strong Generator Problem
In VI, we make inferences using an approximation to the true posterior and choose as to minimise the KL divergence . The same principle yields a lowerbound on loglikelihood used to estimate jointly with , thus making the true posterior a moving target. If the estimated conditional can be made independent of , which in our case means relying exclusively on to predict the distribution of , the true posterior will be independent of the data and equal to the prior.^{5}^{5}5This follows trivially from the definition of posterior: . Based on such observation, Chen et al. (2017) argue that information that can be modelled by the generator without using latent variables will be modelled that way—precisely because when no information is encoded in the latent variable the true posterior equals the prior and it is then trivial to reduce to . This is typically diagnosed by noting that after training for most : we say that the approximate posterior collapses to the prior.
In fact, Alemi et al. (2018) show that the rate, , is an upperbound to the mutual information (MI) between and . From the nonnegativity of MI, it follows that whenever is close to zero for most training instances, MI is either or negligible. Alemi et al. (2018) also show that the distortion, , relates to a lowerbound on MI (the lowerbound being , where is the unknown but constant data entropy). Due to this relationship to MI, they argue that reporting and along with loglikelihood on heldout data offers better insights about a trained VAE, an advice we follow in §6.
A generator that makes no Markov assumptions, such as a recurrent LM, can potentially achieve , and indeed many have noticed that VAEs whose observation models are parameterised by such strong generators (or strong decoders) learn to ignore the latent representation (Bowman et al., 2016; Higgins et al., 2017; Sønderby et al., 2016; Zhao et al., 2018b). For this reason, a strategy to prevent posterior collapse is to weaken the decoder (Yang et al., 2017; Park et al., 2018). While in many cases there are good reasons for changing the model’s factorisation, in this work we are interested in employing a strong generator, thus we will not investigate weaker decoders. Alternative solutions typically involve changes to the optimisation procedure and/or manipulations to the objective. The former aims at finding local optima of the ELBO with nonnegligible MI. The latter seeks alternatives to the ELBO that target MI more directly.
Annealing
Bowman et al. (2016) propose “KL annealing”, whereby the term in the is incorporated into the objective in gradual steps. This way early on in optimisation the optimiser can focus on reducing distortion, potentially by increasing the MI between and . They also propose to drop words from the prefix uniformly at random to somewhat weaken the decoder and promote an increase in MI—the intuition is that the model would have to rely on to compensate for missing history. As we do not want to compromise the decoder, we propose a slight modification of this technique whereby we slowly vary this word dropout rate from , instead of selecting a fixed value. In a sense, we anneal the decoder from a weak generator to a strong generator.
Targeting rates
Another idea is to target a prespecified positive rate (Alemi et al., 2018). Kingma et al. (2016) replace the term in the ELBO with , dubbed free bits (FB) because it allows encoding the first nats of information “for free”. For as long as we are not optimising a proper ELBO (it misses the term), and the introduces a discontinuity at . Chen et al. (2017) propose soft free bits (SFB), that instead multiplies the term in the ELBO with a weighing factor that is dynamically adjusted based on the target rate : is incremented (or reduced) by if (or
). Note that this technique requires hyperparameters (i.e.
) besides to be tuned in order to determine how is updated.Change of objective
If we accept there is a fundamental problem with the , we may seek alternative objectives and relate them to quantities of interest such as marginal likelihood and MI. A simple adaptation of the ELBO is weighing its KLterm by a constant factor (VAE; Higgins et al., 2017). Although it was originally aimed at disentanglement of latent features with a , setting promotes and thus increased MI. Whilst being a useful counter to posterior collapse, low might lead to variational posteriors becoming point estimates. The InfoVAE objective (Zhao et al., 2018b) mitigates this with an extra term on top of the VAE objective which minimises the divergence from the aggregated variational posterior
and the prior. In our experiments we compute this divergence with an unbiased estimate of the maximum mean discrepancy
(MMD; Gretton et al., 2012).4 Minimum desired rate
We propose minimum desired rate (MDR), a technique to attain ELBO values at a prespecified rate that does not suffer from the gradient discontinuities of FB, and does not introduce the additional hyperparameters of SFB. The idea is to optimise the ELBO subject to a minimum rate constraint :
(5)  
Because constrained optimisation is generally intractable, we optimise the Lagrangian (Boyd and Vandenberghe, 2004)
(6) 
where is a positive Lagrangian multiplier. We define the dual function and solve the dual problem
. Local minima of the resulting minmax objective can be found by performing stochastic gradient descent with respect to
and stochastic gradient ascent with respect to .Appendix B presents further theoretical remarks comparing VAE, annealing, FB, SFB and the proposed MDR. We show that MDR is a form of weighing, albeit one that targets a specific rate. It can be seen, for example, as VAE where (though note that is not fixed). Compared to annealing, we argue that a target rate is far more interpretable a hyperparameter than the length (number of steps) and type (e.g. linear or exponential) of annealing schedule. Like SFB, MDR addresses FB’s discontinuity in the gradients of the rate. Finally, we show that MDR is a form of SFB where is dynamically set to , thus much simpler to tune.
5 Expressive Latent Components
The observation by Chen et al. (2017) suggests that estimating and jointly leads to choosing a generative model such that its corresponding (true) posterior is simple and can be matched exactly. With a Gaussian prior and a complex observation model, unless the latent variable is ignored, the posterior is certainly not Gaussian and likely multimodal. In section §5.1, we modify Bowman et al. (2016)’s inference network to parameterise an expressive posterior approximation in an attempt to reach better local optima.
The information theoretic perspective of Alemi et al. (2018) suggests that the prior regularises the inference model capping the MI between and . Their bounds also suggest that, for a fixed posterior approximation, the optimum prior should be the aggregated posterior , and, therefore, investigating the use of strong priors seems like a fruitful avenue for effective estimation of DGMs. In §5.2, we modify SenVAE’s generative story to employ an expressive prior.
5.1 Expressive posterior
We improve inference for SenVAE using normalising flows (NFs; Rezende and Mohamed, 2015). An NF expresses the density of a transformed variable in terms of the density of a base variable using the change of density rule:
(7) 
were and are dimensional and is a differentiable and invertible transformation with Jacobian . For efficiency, it is crucial that the determinant of is simple, e.g. computable in . NFs parameterise (or its inverse) with neural networks, where either , the network, or both are carefully designed to comply with the aforementioned conditions. A special case of an NF is when and is affine with strictly positive slope, which essentially makes a diagonal Gaussian.
We design a posterior approximation based on an inverse autoregressive flow (IAF; Kingma et al., 2016), whereby we transform into a posterior sample by computing
(8) 
via an affine transformation whose inverse is autoregessive. This is crucial to obtaining a Jacobian whose determinant is simple to compute (see Appendix C.1 for the derivation), i.e. . For increased flexibility, we compose such transformations, each parameterised by an independent MADE (Germain et al., 2015).^{6}^{6}6A MADE is a dense layer whose weight matrix is masked to be strictly lower triangular, it realises autoregressive transformations between fixeddimension representations in parallel (see Appendix A.1 for details). We also investigate a more compact flow—in terms of number of parameters—known as a planar flow (PF; Rezende and Mohamed, 2015). The transformation in a PF is not autoregressive, but it is designed such that the determinant of its Jacobian is simple (see Appendix C.2).
5.2 Expressive priors
Here we extend the prior to some more complex, ideally multimodal, parametric family and fit . A perhaps obvious choice is a uniform mixture of Gaussians (MoG), i.e.
(9) 
where the Gaussian parameters are optimised along with other generative parameters.
A less obvious choice is a variational mixture of posteriors (VampPrior; Tomczak and Welling, 2017). This prior is motivated by the fact, for a fixed posterior approximation, the prior that optimises the is the aggregated posterior . Though we could obtain an empirical estimate of this quantity, this is an intensive computation to perform for every sampled . Instead, Tomczak and Welling (2017) propose to use learned pseudo inputs and design the prior
(10) 
where is the th such input—in their case a continuous deterministic vector. Again the parameters of the prior, i.e. , are optimised along with other generative parameters.
Applying this technique to our deep generative LM poses additional challenges as our inference model conditions on a sequence of discrete observations. We adapt this technique by pointestimating a sequence of word embeddings, which makes up a pseudo input. That is is a sequence where has the dimensionality of our embeddings, and is the length of the sequence (fixed at the beginning of training). See Appendix A.1 for remarks about both priors.
5.3 KL term
Be it due to an expressive posterior or due to an expressive prior (or both), we lose analytical access to the term in the ELBO. That is, however, not a problem, since we can MCestimate the term using samples :
(11) 
where in experiments we make .
6 Experiments
Our goal is to identify which techniques are effective in training VAEs for language modelling and our evaluation concentrates on intrinsic metrics: negative loglikelihood (NLL), perplexity per token (PPL), rate (), distortion (), the number of active units (AU; Burda et al., 2015)) and gap in accuracy of next word prediction (given gold prefixes) when decoding from prior samples versus decoding from posterior samples (ACC).
For VAE models, NLL (and therefore PPL) can only be estimated, since we do not have access to the exact marginal likelihood. For that we derive an importance sampling (IS) estimate
(12) 
using our trained approximate posterior as importance distribution (we use samples).
We train and test our models on the English Penn Treebank (PTB) dataset (Marcus et al., 1993).^{7}^{7}7We employ Dyer et al. (2016)’s preprocessing and (standard) partitioning of the data. Hyperparameters for our architectures are chosen via Bayesian optimisation (BO; Snoek et al., 2012)—see Appendix A.2 for details.
Baseline
We compare our RnnLM to an external baseline employing a comparable number of parameters (Dyer et al., 2016).^{8}^{8}8The current SOTA PTBtrained models use vastly more parameters and different preprocessing (Melis et al., 2017). Table 1 shows that our RnnLM is a strong baseline and its architecture makes a strong generator building block.
On optimisation strategies
Technique  Hyperaparameters 

KL annealing  increment 
annealed word dropout (AWD)  decrement 
FB  target rate 
SFB  target rate , , , 
MDR  target rate 
VAE  KL weight 
InfoVAE  , 
Mode  Hyperparameters  NLL  PPL  ACC  

RnnLM        118.7 0.12  107.1 0.46   
Vanilla    118.4 0.09  0.0 0.00  118.4 0.09  105.7 0.36  0.0 0.00 
annealing  : 2e  115.3 0.24  3.3 0.30  117.9 0.08  103.7 0.31  6.0 0.25 
AWD  : 2e  117.6 0.14  0.0 0.00  117.6 0.14  102.5 0.60  0.0 0.00 
FB  : 5.0  113.3 0.17  5.0 0.06  117.5 0.18  101.9 0.77  5.8 0.11 
SFB  : 6.467, : 1%,: 1,:1.05  112.00.15  6.40.04  117.30.13  101.00.51  7.0 0.08 
MDR  : 5.0  113.5 0.10  5.0 0.03  117.5 0.11  102.1 0.45  6.2 0.14 
VAE  : 0.66  113.0 0.14  5.3 0.05  117.4 0.13  101.7 0.50  6.1 0.12 
InfoVAE  : 0.700, : 31.623  113.5 0.09  4.3 0.02  117.2 0.09  100.8 0.35  5.2 0.09 
First, we assess the effectiveness of techniques that aim at promoting local optima of SenVAE with better MI tradeoff. The techniques we compare have hyperparameters of their own (see Table 2), which we tune using BO towards minimising estimated NLL of the validation data. As for the architecture, the approximate posterior employs a bidirectional recurrent encoder, and the generator is essentially our RnnLM initialised with a learned projection of (Appendix A.1 contains the complete specification). Models were trained with Adam (Kingma and Ba, 2014) with default parameters and a learning rate of until convergence five times for each technique.
Results can be found in Table 3. First, note how the vanilla VAE (no special treatment) encodes no information in latent space (). Then note that all techniques converged to VAEs that attain better perplexities than the RnnLM, and all but annealed word dropout did so at nonnegligible rate. Notably, the two most popular techniques, word dropout and annealing, perform subpar to the other techniques.^{9}^{9}9Though here we show annealed word dropout, to focus on techniques that do not weaken the generator, standard word dropout also converged to negligible rates. The techniques that work well at nonnegligible rate can be separated in two classes. The first class requires setting a target rate, whereas the second requires tuning of one or more hyperparameters.^{10}^{10}10Soft free bits actually requires both. We argue that the rate hyperparameter is more interpretable and practical in most cases, for example, it likely requires less (manual or Bayesian) tuning by the researcher. Hence, we further investigated this first class, specifically FB and MDR, by varying the target rate further. Figure 1(a) shows that they attain comparable perplexities over a large range of rates. Figure 1(c) shows the difference between the specified target rate and the rate estimated on validation data at the end of training: MDR is just as good as FB for lower targets and becomes more effective than FB for higher targets.
On expressive priors and posteriors
Posterior  Prior  PPL  AU  ACC  

        84.5 0.53     
Gaussian  Gaussian  103.5 0.14  5.0 0.06  81.5 0.49  13 0.7  5.4 0.13 
Gaussian  MoG  103.3 0.11  5.0 0.07  81.4 0.54  32 0.0  5.8 0.11 
Gaussian  Vamp  103.1 0.11  5.0 0.06  81.2 0.40  22 1.6  5.8 0.07 
Planar  Gaussian  103.4 0.09  4.9 0.02  80.9 0.33  12 0.4  5.4 0.11 
IAF  Gaussian  103.4 0.10  5.0 0.02  81.4 0.34  32 0.0  5.5 0.10 
IAF  MoG  103.2 0.25  5.1 0.05  81.5 0.70  32 0.0  6.0 0.09 
Second, we compare the impact of expressive posteriors and priors. This time, flow and prior hyperparameters were selected via grid search, and can be found in Appendix A.1. All models were trained with a target rate of five, with settings otherwise the same as the previous experiment. In Table 4 it can be seen that more expressive components did not improve perplexity further. It is possible, however, that now that we have stronger latent components, we need to target models with higher MI between and . In Figure 2(a) it can be seen that this is not the case, since all models perform roughly the same and beyond nats performance degrades quickly. It is worth highlighting that, though perplexity did not improve, models with expressive latent components did show other indicators of increased MI. Figure 2(b) shows that SenVAEs trained with expressive latent components learn to rely more on information encoded in the latent variables—note the increased gap in performance when reconstructing from posterior rather than prior samples. This result is also hinted at by the increase in active units for expressive latent components shown in Table 4.
Generated samples
Sample  Closest training instance  TER 

For example, the Dow Jones Industrial Average fell almost 80 points to close at 2643.65.  By futuresrelated program buying, the Dow Jones Industrial Average gained 4.92 points to close at 2643.65.  
The department store concern said it expects to report profit from continuing operations in 1990.  RollsRoyce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.  
The new U.S. auto makers say the accord would require banks to focus on their core businesses of their own account.  International Minerals said the sale will allow Mallinckrodt to focus its resources on its core businesses of medical products, specialty chemicals and flavors. 
The inquiry soon focused on the judge. 
The judge declined to comment on the floor. 
The judge was dismissed as part of the settlement. 
The judge was sentenced to death in prison. 
The announcement was filed against the SEC. 
The offer was misstated in late September. 
The offer was filed against bankruptcy court in New York. 
The letter was dated Oct. 6. 
Figure 4 shows samples from a welltrained SenVAE, where we decode greedily from a prior sample—this way all variability is due to the generator’s reliance on the latent sample. Recall that a vanilla VAE ignores and thus greedy generation from a prior sample is essentially deterministic in that case (see Figure 0(a)). Next to the samples we show the closest training instance, which we measure in terms of an edit distance (TER; Snover et al., 2006).^{11}^{11}11This distance metric varies from to , where indicates the sentence is completely novel and indicates the sentence is essentially copyied from the training data. The motivation to retrieve this “nearest neighbour” is to help us assess whether the generator is producing novel text or simply reproducing something it memorised from training. We also show a homotopy in Figure 5: here we decode greedily from points lying between a posterior sample conditioned on the first sentence and a posterior sample conditioned on the last sentence. In contrast to the vanilla VAE (Figure 0(b)), neighbourhood in latent space is now used to capture some regularities in data space. These samples add support to the quantitative evidence that our DGMs have been effectively trained not to neglect the latent space. In Appendix D we provide more samples (also for other variants of the model).
Recommendations
Based on our path through the land of SenVAE
s, we recommend to target a specific rate via MDR (or FB) instead of annealing (or word dropout). It is easy to pick a rate by plotting validation performance against a handful of rate values without sophisticated Bayesian optimisation. Use importancesampled estimates of NLL, rather than singlesample ELBO estimates, for model selection, for the latter can be too loose a bound and/or too heavily influenced by noisy estimates of KL. Use as many samples as you can for that, you will observe a tighter bound and lower variance (we use
). Inspect sentences generated by greedily decoding from a prior (or posterior) sample as this shows whether the generator is at all sensitive to variation in latent space. Retrieve nearest neighbours from training data to spot copying behaviour. Do investigate stronger latent components (priors and approximate posteriors), they seem to lead to higher mutual information without hurting perplexity (which weaker generators probably would).7 Related Work
In NLP, posterior collapse was probably first noticed by Bowman et al. (2016), who addressed it via word dropout and/or KL scaling. Further investigation revealed that in the presence of strong generators, the ELBO itself becomes the culprit (Chen et al., 2017; Alemi et al., 2018), for it does not have a term that explicitly promotes high MI between latent and observed data. Posterior collapse has also been ascribed to amortised inference (Kim et al., 2018). Beyond the techniques compared and developed in this work, other solutions have been proposed, including further adaptations to the generator architecture (Semeniuta et al., 2017; Yang et al., 2017; Park et al., 2018; Dieng et al., 2018), the ELBO (Tolstikhin et al., 2017; Goyal et al., 2017a), and the latent distributions (Xu and Durrett, 2018; Razavi et al., 2019).
Concurrently to this work, He et al. (2019) proposed aggressive optimisation of the inference network until convergence of MI estimates. The authors show that this outperforms KL scaling techniques, specifically VAE. However, in contrast to our MDR objective, the extra optimisation of the inference network slows down training considerably. A comparison to their technique is an interesting direction for future work.
GECO (Rezende and Viola, 2018) and the Lagrangian VAE (LagVAE; Zhao et al., 2018a) cast VAE optimisation as a dual problem, and for that they are closelyrelated to our MDR. GECO targets minimisation of under constraints on reconstruction error, whereas LagVAE targets either maximisation or minimisation of (bounds on) the MI between and under constraints on the InfoVAE objective. Contrary to MDR, GECO focuses on latent space regularisation and offers no explicit mechanism to mitigate posterior collapse. LagVAE, in MImaximisation mode, promotes nonnegligible rates, but requires constraints based on feasible ELBO values.^{12}^{12}12This might be a reasonable requirement for often explored datasets, such as MNIST. Thus, in this setting, it is somewhat the opposite of our technique: MDR minimises ELBO while targeting an upperbound on MI, LagVAE maximises MI while targeting an ELBO. It might depend on the specific problem which of the two methods is more convenient. All three techniques share the advantage that they can be trivially extended with other constraints at the researchers behest.
Expressive latent components have been extensively and successfully applied to the image domain. Expressive posteriors, based on NFs, include the IAF (Kingma et al., 2016), NAF (Huang et al., 2018a), ODE (Chen et al., 2018) and FFJORD (Grathwohl et al., 2019), Sylvester flow (van den Berg et al., 2018) and Householder flow (Tomczak and Welling, 2017). Expressive priors include the VampPrior (Tomczak and Welling, 2017), autoregressive flows (Papamakarios et al., 2017) and various nonparametric priors (Nalisnick and Smyth, 2016; Goyal et al., 2017b; Bodin et al., 2017). However, these techniques have seen little application to the language domain so far, with the exception of the Householder flow for variational topic modelling (Liu et al., 2018) and, concurrently to this work, NFs for latent sentence modelling with characterlevel latent variables and weak generators (Ziegler and Rush, 2019). We believe we are the first to employ expressive latent models at the sentencelevel, and hope it will stimulate the NLP community to further investigate these techniques.
8 Discussion
The typical RnnLM
is built upon an exact factorisation of the joint distribution, thus a well trained architecture is hard to improve upon in terms of loglikelihood of goldstandard data. Our interest in latent variable models stems from the desire to obtain generative stories that are less opaque than that of an
RnnLM, for example, in that they may expose knobs that we can use to control generation and a hierarchy of steps that may award a degree of interpretability to the model. The SenVAE is not that model, but it is a crucial building block in the pursue for hierarchical probabilistic models of language. SenVAE is a deep generative model whose generative story is rather shallow, yet, due to its strong generator component, it is hard to make effective use of the extra knob it offers. In this paper, we have shown that effective estimation of such a model is possible, in particular, optimisation subject to a minimum rate constraint seems a simple and effective strategy to alleviate posterior collapse. Many questions remain open, especially regarding the potential of expressive latent components, but we hope this work, i.e. the organised review it contributes and the techniques it introduces, will pave the way to deeper—in statistical hierarchy—generative models of language.Acknowledgments
This project has received funding from the Dutch Organization for Scientific Research VICI Grant No 27789002 and from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825299 (GoURMET).
References
 Alemi et al. (2018) Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. 2018. Fixing a broken elbo. In International Conference on Machine Learning, pages 159–168.
 authors (2016) The GPyOpt authors. 2016. GPyOpt: A bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt.
 Baydin et al. (2018) Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research, 18:1–43.
 Bengio et al. (2003) Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155.

van den Berg et al. (2018)
Rianne van den Berg, Leonard Hasenclever, Jakub Tomczak, and Max Welling. 2018.
Sylvester normalizing flows for variational inference.
In
proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI)
.  Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. 2012. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(Feb):281–305.
 Bodin et al. (2017) Erik Bodin, Iman Malik, Carl Henrik Ek, and Neill DF Campbell. 2017. Nonparametric inference for autoencoding variational bayes. arXiv preprint arXiv:1712.06536.
 Bottou and Cun (2004) Léon Bottou and Yann L. Cun. 2004. Large scale online learning. In S. Thrun, L. K. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 217–224. MIT Press.
 Bowman et al. (2016) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21.
 Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press.
 Burda et al. (2015) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. 2015. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.
 Chen et al. (2018) Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. Neural ordinary differential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6571–6583. Curran Associates, Inc.

Chen et al. (2017)
Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John
Schulman, Ilya Sutskever, and Pieter Abbeel. 2017.
Variational lossy autoencoder.
In International Conference on Machine Learning.  Corro and Titov (2018) Caio Corro and Ivan Titov. 2018. Differentiable perturbandparse: Semisupervised parsing with a structured variational autoencoder. In ICLR.
 Dieng et al. (2018) Adji B. Dieng, Yoon Kim, Alexander M. Rush, and David M. Blei. 2018. Avoiding latent variable collapse with generative skip models. CoRR, abs/1807.04863.
 Dinh et al. (2017) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. 2017. Density estimation using real nvp.
 Dyer et al. (2016) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209. Association for Computational Linguistics.
 Fraccaro et al. (2016) Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. 2016. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pages 2199–2207.
 Gal and Ghahramani (2015) Yarin Gal and Zoubin Ghahramani. 2015. Dropout as a bayesian approximation. arXiv preprint arXiv:1506.02157.

Gal and Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani. 2016.
A theoretically grounded application of dropout in recurrent neural networks.
In Advances in neural information processing systems, pages 1019–1027.  Germain et al. (2015) Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. 2015. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889.
 Goodman (2001) Joshua T. Goodman. 2001. A bit of progress in language modeling. Comput. Speech Lang., 15(4):403–434.
 Goyal et al. (2017a) Anirudh Goyal ALIAS PARTH Goyal, Alessandro Sordoni, MarcAlexandre Côté, Nan Rosemary Ke, and Yoshua Bengio. 2017a. Zforcing: Training stochastic recurrent networks. In Advances in neural information processing systems, pages 6713–6723.
 Goyal et al. (2017b) Prasoon Goyal, Zhiting Hu, Xiaodan Liang, Chenyu Wang, Eric P Xing, and Carnegie Mellon. 2017b. Nonparametric variational autoencoders for hierarchical representation learning. In ICCV, pages 5104–5112.
 Grathwohl et al. (2019) Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. 2019. Ffjord: Freeform continuous dynamics for scalable reversible generative models. International Conference on Learning Representations.
 Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773.
 He et al. (2019) Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor BergKirkpatrick. 2019. Lagging inference networks and posterior collapse in variational autoencoders. In Proceedings of ICLR.
 Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. betaVAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations.
 Huang et al. (2018a) ChinWei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. 2018a. Neural autoregressive flows. arXiv preprint arXiv:1804.00779.
 Huang et al. (2018b) ChinWei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. 2018b. Neural autoregressive flows. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2078–2087, Stockholmsmässan, Stockholm Sweden. PMLR.

Jelinek (1980)
Frederick Jelinek. 1980.
Interpolated estimation of markov source parameters from sparse data.
In
Proc. Workshop on Pattern Recognition in Practice, 1980
.  Jordan et al. (1999) MichaelI. Jordan, Zoubin Ghahramani, TommiS. Jaakkola, and LawrenceK. Saul. 1999. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233.
 Kim et al. (2018) Yoon Kim, Sam Wiseman, Andrew Miller, David Sontag, and Alexander Rush. 2018. Semiamortized variational autoencoders. In International Conference on Machine Learning, pages 2683–2692.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751.
 Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Autoencoding variational bayes. In International Conference on Learning Representations.
 Liu et al. (2018) Luyang Liu, Heyan Huang, and Yang Gao. 2018. Correlated topic modeling via householder flow.
 Lyu and Titov (2018) Chunchuan Lyu and Ivan Titov. 2018. AMR parsing as graph prediction with latent alignment. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 397–407, Melbourne, Australia. Association for Computational Linguistics.
 Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330.
 Melis et al. (2017) Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589.
 Miao and Blunsom (2016) Yishu Miao and Phil Blunsom. 2016. Language as a latent variable: Discrete generative models for sentence compression. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 319–328. Association for Computational Linguistics.
 Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In International conference on machine learning, pages 1727–1736.
 Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
 Nalisnick and Smyth (2016) Eric Nalisnick and Padhraic Smyth. 2016. Stickbreaking variational autoencoders. arXiv preprint arXiv:1605.06197.
 Papamakarios et al. (2017) George Papamakarios, Theo Pavlakou, and Iain Murray. 2017. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347.
 Park et al. (2018) Yookoon Park, Jaemin Cho, and Gunhee Kim. 2018. A hierarchical latent structure for variational conversation modeling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1792–1801. Association for Computational Linguistics.

Paszke et al. (2017)
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017.
Automatic differentiation in pytorch.
In NIPSW.  Pu et al. (2016) Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. 2016. Variational autoencoder for deep learning of images, labels and captions. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2352–2360. Curran Associates, Inc.
 Rasmussen and Williams (2005) Carl Edward Rasmussen and Christopher K. I. Williams. 2005. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
 Razavi et al. (2019) Ali Razavi, Aäron van den Oord, Ben Poole, and Oriol Vinyals. 2019. Preventing posterior collapse with deltavaes. arXiv preprint arXiv:1901.03416.
 Rezende and Mohamed (2015) Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1530–1538, Lille, France. PMLR.
 Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278–1286, Bejing, China. PMLR.
 Rezende and Viola (2018) Danilo Jimenez Rezende and Fabio Viola. 2018. Taming vaes. arXiv preprint arXiv:1810.00597.
 Rios et al. (2018) Miguel Rios, Wilker Aziz, and Khalil Simaan. 2018. Deep generative model for joint alignment and word representation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1011–1023. Association for Computational Linguistics.
 Rippel and Adams (2013) Oren Rippel and Ryan Prescott Adams. 2013. Highdimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125.
 Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407.
 Semeniuta et al. (2017) Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. 2017. A hybrid convolutional variational autoencoder for text generation. arXiv preprint arXiv:1702.02390.
 Serban et al. (2017) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoderdecoder model for generating dialogues. In ThirtyFirst AAAI Conference on Artificial Intelligence.
 Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959.
 Snover et al. (2006) Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, volume 200.
 Sønderby et al. (2016) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746.
 Tabak et al. (2010) Esteban G Tabak, Eric VandenEijnden, et al. 2010. Density estimation by dual ascent of the loglikelihood. Communications in Mathematical Sciences, 8(1):217–233.
 Tolstikhin et al. (2017) Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. 2017. Wasserstein autoencoders. arXiv preprint arXiv:1711.01558.
 Tomczak and Welling (2017) Jakub M Tomczak and Max Welling. 2017. Vae with a VampPrior. arXiv preprint arXiv:1705.07120.

Vincent et al. (2008)
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and PierreAntoine Manzagol.
2008.
Extracting and composing robust features with denoising autoencoders.
In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM.  Wang et al. (2017) Liwei Wang, Alexander Schwing, and Svetlana Lazebnik. 2017. Diverse and accurate image description using a variational autoencoder with an additive gaussian encoding space. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5756–5766. Curran Associates, Inc.
 Wen et al. (2017) TsungHsien Wen, Yishu Miao, Phil Blunsom, and Steve Young. 2017. Latent intention dialogue models. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3732–3741. JMLR. org.
 Xu and Durrett (2018) Jiacheng Xu and Greg Durrett. 2018. Spherical latent spaces for stable variational autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4503–4513, Brussels, Belgium. Association for Computational Linguistics.
 Yang et al. (2017) Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor BergKirkpatrick. 2017. Improved variational autoencoders for text modeling using dilated convolutions. arXiv preprint arXiv:1702.08139.
 Zhang et al. (2016) Biao Zhang, Deyi Xiong, jinsong su, Hong Duan, and Min Zhang. 2016. Variational neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 521–530, Austin, Texas. Association for Computational Linguistics.
 Zhao et al. (2018a) Shengjia Zhao, Jiaming Song, and Stefano Ermon. 2018a. The information autoencoding family: A lagrangian perspective on latent variable generative models. In Conference on Uncertainty in Artificial Intelligence, Monterey, California.
 Zhao et al. (2018b) Shengjia Zhao, Jiaming Song, and Stefano Ermon. 2018b. InfoVAE: Information maximizing variational autoencoders. In Theoretical Foundations and Applications of Deep Generative Models, ICML18.
 Zhou and Neubig (2017) Chunting Zhou and Graham Neubig. 2017. Multispace variational encoderdecoders for semisupervised labeled sequence transduction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 310–320. Association for Computational Linguistics.
 Ziegler and Rush (2019) Zachary M Ziegler and Alexander M Rush. 2019. Latent normalizing flows for discrete sequences. arXiv preprint arXiv:1901.10548.
Appendix A Architectures and Hyperparameters
In order to ensure that all our experiments are fully reproducible, this section provides an extensive overview of the model architectures, as well as model an optimisation hyperparameters.
Some hyperparameters are common to all experiments, e.g. optimiser and dropout, they can be found in Table 5. All models were optimised with Adam using default settings (Kingma and Ba, 2014). To regularise the models, we use (variational) dropout with a shared mask across timesteps (Gal and Ghahramani, 2016) and weight decay proportional to the dropout rate (Gal and Ghahramani, 2015) on the input and output layers of the generative networks (i.e. RnnLM and the recurrent decoder in SenVAE
). No dropout is applied to layers of the inference network as this does not lead to consistent empirical benefits and lacks a good theoretical basis. Gradient norms are clipped to prevent exploding gradients, and long sentences are truncated to three standard deviations above the average sentence length in the training data.
Parameter  Value 

Optimizer  Adam 
Optimizer Parameters  
Learning Rate  0.001 
Batch Size  64 
Decoder Dropout Rate ()  0.4 
Weight Decay  
Maximum Sentence Length  56 
Maximum Gradient Norm  1.5 
a.1 Architectures
Model  Parameter  Value 

A  embedding units ()  256 
A  vocabulary size ()  25643 
R and S  decoder layers ()  2 
R and S  decoder hidden units ()  256 
S  encoder hidden units ()  256 
S  encoder layers ()  1 
S  latent units ()  32 
I  context units ()  512 
I and P  flow steps ()  4 
MoG  mixture components ()  100 
VampPrior  pseudo inputs ()  100 
This section describes the components that parameterise our models.^{13}^{13}13All models were implemented with the PyTorch library (Paszke et al., 2017), using default modules for the recurrent networks, embedders and optimisers. We use mnemonic blocks to describe architectures. Table 6 lists hyperparameters for the models discussed in what follows.
RnnLM
At each step, an RnnLM parameterises a categorical distribution over the vocabulary, i.e. , where and
(13a)  
(13b)  
(13c) 
We employ an embedding layer (), one (or more) cell(s) ( is a parameter of the model), and an layer to map from the dimensionality of the GRU to the vocabulary size.
Gaussian SenVAE
A Gaussian SenVAE also parameterises a categorical distribution over the vocabulary for each given prefix, but, in addition, it conditions on a latent embedding , i.e. where and
(14a)  
(14b)  
(14c)  
(14d) 
Compared to RnnLM, we modify only slightly by initialising GRU cell(s) with computed as a learnt transformation of . Because the marginal of the Gaussian SenVAE is intractable, we train it via variational inference using an inference model where
(15a)  
(15b)  
(15c)  
(15d) 
Note that we reuse the embedding layer from the generative model. Finally, a sample is obtained via where .
Iaf SenVAE
Unlike the Gaussian case, an IAF (Kingma et al., 2016) does not parameterise a distribution directly, but rather a sampling procedure where we transform a dimensional sample from a base distribution (here a standard Gaussian) via an invertible and differentiable transformation. Here we show the design of an IAF where we employ MADE layers:
(16a)  
(16b)  
(16c)  
(16d)  
(16e)  
(16f)  
(16g)  
(16h)  
(16i)  
The context vector represents the complete input sequence and allows each step of the flow to condition on . Note that while is actually Gaussiandistributed, i.e. , the distribution of each for is potentially increasingly more complex. A sample from is the output of the flow at step , i.e.  
(16j)  
whose logdensity is  
(16k) 
See Appendix C for more on NFs.
Made
We denote by a masked dense layer (Germain et al., 2015) with inputs and , which is autoregressive on , where:
(17a)  
(17b)  
(17c)  
(17d)  
(17e) 
where is a lowertriangular weight matrix with nonzero diagonal elements and a strictly lowertriangular weight matrix (with zeros on and above the diagonal). The parameters of the made are .
Planar SenVAE
A planar flow has a more compact parameterisation than an IAF, but is based on the same principle, namely, we parameterise a sampling procedure by an invertible and differentiable transformation of a fixed random source (a standard Gaussian in this case):
(18a)  
(18b)  
(18c)  
(18d)  
(18e)  
(18f)  
(18g)  
(18h)  
(18i)  
(18j)  
where a sample from is the output of the flow at step , i.e. with logdensity  
(18k)  
and (Rezende and Mohamed, 2015). 
In line with the work of van den Berg et al. (2018), we amortise all parameters of the flow in addition to the parameters of the base distribution.
MoG prior
We parameterise diagonal Gaussians, which are mixed uniformly. To do so we need location vectors, each in , and scale vectors, each in . To ensure strict positivity for scales we make . The set of generative parameters is therefore extended with and , each in .
VampPrior
For this we estimate sequences of input vectors, each sequence corresponds to a pseudoinput. This means we extend the set of generative parameters with , each in , for . For each , we sample at the beginning of training and keep it fixed. Specifically, we drew samples from a normal, , which we rounded to the nearest interger. and are the dataset sentence length mean and variance respectively.
a.2 Bayesian Optimisation
Parameter  Value 

Objective Function  Validation NLL 
Kernel  Matern 
Acquisition Function  Expected Improvement 
Parameter Inference  MCMC 
MCMC Samples  10 
Leapfrog Steps  20 
Burnin Samples  100 
Bayesian optimisation (BO) is an efficient method to approximately search for global optima of a (typically expensive to compute) objective function , where is a vector containing the values of hyperparameters that may influence the outcome of the function (Snoek et al., 2012). Hence, it forms an alternative to grid search or random search (Bergstra and Bengio, 2012) for tuning the hyperparameters of a machine learning algorithm. BO works by assuming that our observations (for ) are drawn from a Gaussian process (GP; Rasmussen and Williams, 2005). Then based on the GP posterior, we can design and infer an acquisition function. This acquisition function can be used to determine where to “look next” in parameterspace, i.e. it can be used to draw for which we then evaluate the objective function . This procedure iterates until a set of optimal parameters is found with some level of confidence.
In practice, the efficiency of BO hinges on multiple choices, such as the specific form of the acquisition function, the covariance matrix (or kernel) of the GP and how the parameters of the acquisition function are estimated. Our objective function is the (importancesampled) validation NLL, which can only be computed after a model convergences (via gradientbased optimisation of the ELBO). We follow the advice of Snoek et al. (2012) and use MCMC for estimating the parameters of the acquisition function. This reduced the amount of objective function evaluations, speeding up the overall search. Other settings were also based on results by Snoek et al. (2012), and we refer the interested reader to that paper for more information about BO in general. A summary of all relevant settings of BO can be found in Table 7. We used the GPyOpt library (authors, 2016) to implement this procedure.
Appendix B Relation between optimisation techniques
It is insightful to compare the various techniques we surveyed to the technique we propose in terms of the quantities involved in their optimisation. To avoid clutter, let us assume a single data point , and denote the distortion by and the rate by .
The losses minimised by the VAE, annealing and SFB, all have the form
(19) 
where is a weighting factor. FB minimises the loss
(20) 
where is the target rate. Last, with respect to and , MDR minimises the loss
(21) 
where is the Lagrangian multiplier. And with respect to , it minimises
(22) 
Since we aim to minimise these losses as a function of the parameters with stochastic gradient descent, it makes sense to evaluate how these methods influence optimisation by checking their gradients. First, FB has the following gradients w.r.t. its parameters:
(23) 
which shows the discontinuity in the gradients as a results of this objective. I.e., there is a sudden ‘jump‘ from zero to a large gradient w.r.t. the KL when the KL dips above R. VAE, KL annealing, and SFB have a gradient that does not suffer such discontinuities:
(24) 
where you can see that the magnitude of the gradient w.r.t. the KL is influenced by the value of at that point in the optimisation. Last, observe the gradient of the MDR objective:
(25) 
thus, essentially, with . Hence, MDR is another form of KL weighting, albeit one that allows specific rate targeting.
Compared to VAE, MDR has the advantage that is not fixed, but estimated to meet the requirements on rate. This might mitigate the problem noticed by He et al. (2019) that VAE can lead to underregularisation at the end of training. Similar to their technique, MDR can cut the inference network more ‘slack’ during the start of training, but enforce stricter regularisation at the end, once the constraint is met. We observe that this happens in practice. Furthermore, we would argue that tuning towards a specific rate is more interpretable than tuning .
A similar argument can be made against annealing. Although is not fixed in this scheme, it requires multiple decisions that are not very interpretable, such as the length (number of steps) and type (e.g. linear or exponential) of the schedule.
Most similar then, seems SFB. Like MDR, it flexibly updates by targeting a rate. However, differences between the two techniques become apparent when we observe how is updated. In case of SFB:
(26) 
where , and are hyperparameters. In case of MDR (not taking optimiserspecific dynamics into account):
(27) 
where is a learning rate. From this, we can draw the conclusion that MDR is akin to SFB without any extra hyperparameters. Yet, it also gives some insight into suitable hyperparameters for SFB; if we set ,^{14}^{14}14If we always increment with this value, that is. and , SFB is essentially equal to performing Lagrangian relaxation on the ELBO with a constraint on the minimum rate.
All in all, this analysis shows that there is a clear relation between several of the optimisation techniques compared in this paper. MDR seems to be the most flexible, whilst requiring the least amount of hyperparameter tuning or heuristics.
Appendix C Normalising flows
This section reviews a general class of reparameterisable distributions known as a normalising flow (NF; Tabak et al., 2010). NFs express the density of a transformed variable in terms of the density of a base variable using the change of density rule:
(28) 
or conversely, by application of the inverse function theorem,
(29) 
were and are dimensional and is a differentiable and invertible transformation with Jacobian . The change of densities rule can be used to map a sample from a complex distribution to a sample from a simple distribution, or the other way around, and it relates their densities analytically. For efficiency, it is crucial that the determinant of be simple, e.g. assessed in time . NFs parameterise (or its inverse) with neural networks, where either , the network, or both are carefully designed to comply with the aforementioned conditions.
NFs can be used where the input to the flow is a sample from a simple fixed distribution, such as uniform or standard Gaussian, and the output is a sample from a much more complex distribution. This leads to very expressive approximate posteriors for amortised variational inference. A general strategy for designing tractable flows is to design simple transformations, each of which meets our requirements, and compose enough of them exploiting the fact that composition of invertible functions remains invertible. In fact, where the base distribution is a standard Gaussian and the transformation is affine with strictly positive slope (an invertible and differentiable function), the resulting distribution is a parameterised Gaussian, showing that Gaussians can be seen as a particularly simple normalising flow. NFs can also be used where the input to the flow is a data point and the output is a sample from a simple distribution, this leads to very expressive density estimators for continuous observations—differentiability and invertibility constraints preclude direct use of NFs to model discrete distributions.
Normalising flows have been introduced in the context of variational inference (Rezende and Mohamed, 2015) and and density estimation (Rippel and Adams, 2013). Various transformations have been designed all aiming at increasing expressiveness with manageable computation (Kingma et al., 2016; Dinh et al., 2017; Papamakarios et al., 2017; Huang et al., 2018b).
c.1 Inverse autoregressive flows
In an IAF (Kingma et al., 2016), , where
(30) 
is a differentiable transformation whose inverse
(31) 
is autoregressive (the output depends on ). The parameters of , i.e. and , are compute by neural networks, and note that in the forward direction we can compute all transformations in parallel using a MADE (Germain et al., 2015). Moreover, the Jacobian is lowertriangular and thus has a simple determinant (product of diagonal elements). To see that, let us compute the entries of the Jacobian of . Below the main diagonal, i.e. , we have:
(32) 
On the main diagonal, i.e. , we have
(33) 
And finally, above the main diagonal, i.e. , the partial derivative is zero. The Jacobian matrix is therefore lower triangular with the th element of its diagonal equal to , which leads to efficient determinant computation:
(34) 
and from the inverse function theorem it holds that
(35) 
Therefore, where is sampled from a simple random source (e.g. a Gaussian), we can assess the logdensity of via:
(36) 
Naturally, composing transformations as the one in (30) leads to more complex distributions.
c.2 Planar flows
A planar flow (Rezende and Mohamed, 2015) is based on a transformation where
(37) 
where and are parameters of the flow, is a smooth elementwise nonlinearity (we use ) and derivative . It can be shown that
(38) 
where . Then, where is sampled from a simple random source (e.g. a Gaussian),
(39) 
Comments
There are no comments yet.