Log In Sign Up

Preventing posterior collapse in variational autoencoders for text generation via decoder regularization

by   Alban Petit, et al.

Variational autoencoders trained to minimize the reconstruction error are sensitive to the posterior collapse problem, that is the proposal posterior distribution is always equal to the prior. We propose a novel regularization method based on fraternal dropout to prevent posterior collapse. We evaluate our approach using several metrics and observe improvements in all the tested configurations.


page 1

page 2

page 3

page 4


Improving Variational Autoencoders with Density Gap-based Regularization

Variational autoencoders (VAEs) are one of the powerful unsupervised lea...

On the Importance of the Kullback-Leibler Divergence Term in Variational Autoencoders for Text Generation

Variational Autoencoders (VAEs) are known to suffer from learning uninfo...

LDC-VAE: A Latent Distribution Consistency Approach to Variational AutoEncoders

Variational autoencoders (VAEs), as an important aspect of generative mo...

InfoVAE: Information Maximizing Variational Autoencoders

It has been previously observed that variational autoencoders tend to ig...

Eval all, trust a few, do wrong to none: Comparing sentence generation models

In this paper, we study recent neural generative models for text generat...

Variational Bayesian Dropout

Variational dropout (VD) is a generalization of Gaussian dropout, which ...

Finetuning Pretrained Transformers into Variational Autoencoders

Text variational autoencoders (VAEs) are notorious for posterior collaps...

1 Introduction

Deep generative models like Variational Autoencoders (VAE) Kingma and Welling [2014]

and Generative Adversarial networks (GAN)

Goodfellow et al. [2014]

, among others, enjoy great popularity in many applications of machine learning, including natural language processing (NLP). Unlike GANs, VAEs can manipulate discrete observed variables which makes them suitable for applications in text generation. In these models, a sentence is generated according to the following process:

where is a latent sentence representation (or sentence embedding) and are observed words (or the generated sentence), with the vocabulary. The generation stops when a special end of the sentence word is generated. Without loss of generality, we assume that the prior distribution is fixed. The subscript indicates the parameters of the conditional distribution and, in our case,

corresponds to the parameters of a neural network. It is important to note that we do not make any independence assumption in the conditional distribution


Training aims to search for parameters that maximize the likelihood of observed data, also called the evidence:


where is the empirical training data distribution. In general, the objective function is intractable because of the marginalization over latent variables . Variational methods propose to introduce a proposal distribution to create a surrogate objective called the evidence lower bound (ELBO) defined as follows:


where the defines a family of lower bounds on parameterized by , i..e . During training, it is important to search for proposal parameters that gives the best bounds (i.e. maximizes the bounds), hence the name variational. The new training problem is then:


The Expectation-Maximization (EM) algorithm

Dempster et al. [1977] solves this problem via block coordinate ascent, i.e. maximizing successively with respect to (E step) and (M step). Unlike EM, the VAE approach consists in optimizing this problem by joint stochastic gradient ascent over and . Moreover, unlike in the standard applications of EM, neither the distribution nor the family of the posterior are known. As a result, an independence assumption is made over the coordinates of in the distribution (also called mean field distribution). Finally, the distribution is amortized over the data and parameterized by a neural network. It is worth noting that during training, the reconstruction term in equation 2

is estimated via Monte-Carlo method using a single sample. It is usual to call the distribution

that generates a latent representation from a sentence the encoder and the distribution that reconstructs the original sentence the decoder.

Unfortunately in practice, VAEs for automatic text generation are sensitive to posterior collapse Bowman et al. [2016]. Informally, this means that the proposal distribution is not optimized correctly and remains close to the prior distribution for all data points: . This leads to a poor approximation of the objective 1 by the ELBO. In the end, the decoder ignore that latent variable and no sentence representation is learned.

Previous work on the posterior collapse problem can be classified in two categories:

  1. On the one hand, modifications to the objective function were suggested. Bowman et al. [2016] reweight the divergence with the prior term to temper its importance during training. Kingma et al. [2016] and Pelsmaeker and Aziz [2020]

    proposed to introduce constraints to force the divergence term to be greater than a hyperparameter. Finally, alternative surrogate objectives to train VAEs in the context of text generation have been proposed

    Livne et al. [2020], Havrylov and Titov [2020].

  2. On the other hand, several authors proposed to modify the architecture of the decoder so that they are forced to rely on latent variable information. Yang et al. [2017]

    replace the recurrent neural network in the decoder by a convolutional neural network.

    Dieng et al. [2019] proposed to use skip connections between the latent representation and the various hidden layers of the decoder.

In this work, we propose a different approach to prevent the posterior collapse problem based on parameter regularization, that is without changing the generative model objective function or the architecture of the decoder. Our contributions can be summarized as follows:

  • We propose to regularize the decoder parameters to prevent the posterior collapse. In particular, we use fraternal dropout Zolna et al. [2018] to force the decoder to use the latent representation.

  • We experiment our approach in various settings and report improved results with respect to several evaluation metrics.

We hope to encourage future research to explore this direction to improve VAEs for text generation.

2 Decoder regularization: fraternal dropout

We first give an intuitive motivation for our approach. LSTMs have been widely used as neural language models and achieve competitive results Merity et al. [2018].111Our method is agnostic of the decoder architecture. As such, the same methodology can be directly applied to self-attentive networks/transformers Vaswani et al. [2017].

Language models are trained via the same reconstruction term as the one used in VAEs, showing that these neural architectures can efficiently maximize this term in the absence of latent variables, i.e. it can be maximized while ignoring the latent variables values. To bypass this issue, we propose to introduce a regularization term in the objective that forces the hidden representations computed by the LSTM to be similar even if different words in the input are masked. The decoder is then forced to rely on latent variable information. To this end, we propose to rely on fraternal dropout

Zolna et al. [2018].

The reconstruction term of the ELBO maximizes the log-likelihood using the usual teacher forcing technique for language models: during training, the auto-regressive model is trained to predict the next word based on gold previous words. Let a sentence of length . Each word

is represented by a vector taken from an embedding table. We denote all of these vectors by a matrix

where is the word embedding dimension. A contextual representation is computed for each position in the sentence using a LSTM:

where is a matrix containing hidden representations of dimension for each sentence position. The latent variable is projected and then given as an initialization for both the memory and the hidden state and also concatenated to each input. We refer the reader to Li et al. [2019] for more details on the architecture and the various parameters.

Word dropout Dozat and Manning [2017] consists in randomly replacing embeddings by a vector of zeros during training to prevent overfitting:


is a vector of booleans where each element is independently drawn from a Bernoulli distribution of parameter

. The matrix corresponds to the matrix where a number of columns are filled with zeros. Fraternal dropout consists in creating two matrices and as follows:

Matrices and are both used to compute the log-likelihood of the sentence and the mean log-likelihood then replaces the original reconstruction term. Finally, a regularization term is introduced in the objective function:

where is a hyperparameter. Note that the regularization term forces the hidden representations computed by the LSTM to be similar with different masked inputs and but similar latent variables . Hence, the regularization term forces the decode to rely on latent variables to compute the LSTM’s hidden representations.

3 Experiments

We evaluate our approach using the code distributed by by Li et al. [2019]

. We kept the same hyperparameters as the ones used by the authors to avoid skewing the results in our favor.

We experiment with two datasets: Yelp Shen et al. [2017]

and the Stanford Natural Language Inference (SNLI)

Bowman et al. [2015] data. These two datasets were subsampled to contain 100,000 training sentences and 10,000 evaluation and testing sentences. SNLI and Yelp have a respective vocabulary of 9,990 and 8,411 words and both have an average of 10 words per sentence.

3.1 Evaluation metrics

We use several metrics to evaluate the quality of the learned generative models.

(Log-likelihood and perplexity per word) The negative log-likelihood (NLL)

indicates how well the model reconstructs the input sentence. The perplexity per word (PPL) is the geometric mean of the reciprocal of the probability assigned to the correct word by the model. Since we observe the opposite in the first case and the reciprocal in the second, we want to minimize both of these metrics. They are approximated by sampling 100 times in the distribution

for each sentence.

(BLEU score) For each sentence in the test set, a latent representation is sampled in the distribution and a sentence is generated without using teacher forcing from the representation. The BLEU score Papineni et al. [2002]

will represent the proportion of n-grams (for n going from 1 to 4) in the generated sentence that can be found in the original one. If the generated sentence is shorter than the original, a penalty is applied since it is easier to avoid mistakes when less words are produced.

(Active units) The number of active units (AU) represent the number of dimensions of the latent variable that co-vary with the observations. According to Burda et al. [2016], a greater number of active units is usually representative of a richer latent variable. We follow their article and use a threshold value of . The dimension of the latent variable is 32, therefore we will have at most 32 active units.

(Mutual Information)

We also report the mutual information between the latent variable and the output probability distribution. A higher mutual information indicates that the latent variable is better used by the model. We follow the methodology of

He et al. [2019].

3.2 Baseline

We compare our approach to previous work including a "standard" VAE, the free bits technique Kingma et al. [2016] and the pre-training approach Li et al. [2019]:

  • (Free bits) The free bits technique consists in introducing a constraint on the divergence term with the prior distribution so that it doesn’t fall below a pre-determined threshold . We follow previous work and fix in all of our experiments.

  • (Pre-training) This baseline consists in training the model as a classic autoencoder first. Then the decoder is reset and the model is trained as a VAE.

All of our experiments reweight the divergence term during the training of the model. This technique, proposed by Bowman et al. [2016]

, consists in steadily increasing the reweighting factor from 0 to 1 during the first epochs of training. The idea is to force the model to ignore the divergence with prior term in the beginning of training.

3.3 Results and analysis

Table 1: Impact of the fraternal dropout hyperparameter over the Yelp dataset. We want to maximize the metrics with a and minimize those with a .

0.01 28.91 18.44 4 6.34 5.20
0.1 29.50 19.12 5 7.03 6.40
0.5 30.69 21.53 6 7.47 6.81
1.0 32.30 25.28 4 6.87 6.03
2.0 33.41 28.27 4 7.29 6.09

Table 2: Results of the various metrics over Yelp and SNLI for four different VAEs without and with fraternal dropout. We want to maximize the metrics with a and minimize those with a .

Standard 33.40 28.25 2 1.14 1.43 32.57 20.64 3 0.52 2.32
+ fraternal dropout 29.50 19.12 5 7.03 6.40 30.01 16.28 2 4.75 5.73
Free bits 29.54 19.20 32 5.69 4.02 28.88 14.66 32 4.63 4.77
+ fraternal dropout 25.46 12.76 32 8.65 11.23 27.92 13.40 32 7.11 8.44
Pre-train. 33.74 29.21 2 0.71 0.83 31.76 19.14 3 1.14 2.76
+ fraternal dropout 26.18 13.71 22 8.24 9.69 24.69 9.92 22 8.32 13.43
Free bits + Pre-train. 25.93 13.37 32 8.14 7.54 23.33 8.75 32 8.49 13.9
+ fraternal dropout 23.63 10.62 32 8.81 13.54 21.00 7.04 32 9.07 21.35

We first evaluate the impact of the fraternal dropout hyperparameter on the Yelp dataset. Results are reported in Table 1. We can observe that a compromise needs to be found between all the metrics, i.e. we need to find a point of equilibrium between minimizing the NLL and the PPL and maximizing AU, MI and the BLEU score. In the following experiments, we fixed .

We report results in different configurations in Table 2. Our approach yields improvements for all the metrics in all configurations for both datasets. The posterior collapse problem is significant in configurations not using free bits or fraternal dropout, the mutual information being around 1 and the number of active units being 2. Adding fraternal dropout results in a gain between 4 and 7 points for the mutual information in the two configurations. In configurations where the model is pre-trained, we observe that this pre-training alone does not prevent the posterior collapse since there are only 2 and 3 active units respectively while the same model with fraternal dropout retains 22 active units. Interestingly, our approach has a bigger impact than free bits on Yelp and a similar one on SNLI while using less active units in both cases. This is an indication that the free bits technique forces the latent variables to be decorrelated artificially, i.e. the decoder still ignores their values.

As explained previously, the BLEU score is computed over sentences generated without teacher forcing and therefore can also be used to estimate the quality of the latent representations. Once again, fraternal dropout improves results on this metric.


We show some examples of sentences generated via interpolation between two latent representations sampled from the prior in Table 

3. We see that our method seems to produce coherent sentences with a gradual change in length and meaning between the successive sentences.

Table 3: Examples of interpolations between two representations sampled from the prior for the configuration with a pre-training and the free bits over the SNLI dataset. The second example also includes fraternal dropout.

Free bits + Pre-train Free bits + Pre-train + fraternal dropout
a boy is in front of a group of people. the young boy is in a picture.
a man in a blue shirt is standing in front of a crowd of people. the young child is in front of a mother.
a child in blue is holding a camera. the small child is in front of a mother.
a child in blue pants holding a camera while another man watches. a small child in pink holds a picture of her mother.
a child in blue pants holding a camera while another man in a black shirt looks on. a small child in pink sits in a picture with her mother.

4 Conclusion

In this work, we propose to rely on parameter regularization to prevent the posterior collapse problem in VAEs. This approach is different from previous work in the literature. We observe that our approach has two benefits: it improves the quality of generated text and increases the use of the latent variable. Future works could explore other methods for parameter regularization Kanuparthi et al. [2019], Krueger et al. [2016], Gal and Ghahramani [2016].

We thank François Yvon and Matthieu Labeau for proofreading the article. This work benefited from computations done on the Saclay-IA platform and an access to the computational resources of IDRIS through the resource allocation 20XX-AD11011600 attributed by GENCI.


  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio (2016) Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 10–21. External Links: Link, Document Cited by: item 1, §1, §3.2.
  • Y. Burda, R. B. Grosse, and R. Salakhutdinov (2016) Importance weighted autoencoders. In Proceedings of 4th International Conference on Learning Representations, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §3.1.
  • A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §1.
  • A. B. Dieng, Y. Kim, A. M. Rush, and D. M. Blei (2019) Avoiding latent variable collapse with generative skip models. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 2397–2405. External Links: Link Cited by: item 2.
  • T. Dozat and C. D. Manning (2017) Deep biaffine attention for neural dependency parsing. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.
  • Y. Gal and Z. Ghahramani (2016) A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, pp. 1019–1027. External Links: Link Cited by: §4.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §1.
  • S. Havrylov and I. Titov (2020) Preventing posterior collapse with levenshtein variational autoencoder. arXiv preprint arXiv:2004.14758. Cited by: item 1.
  • J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick (2019) Lagging inference networks and posterior collapse in variational autoencoders. In Proceedings of the 7th International Conference on Learning Representations, External Links: Link Cited by: §3.1.
  • B. Kanuparthi, D. Arpit, G. Kerg, N. R. Ke, I. Mitliagkas, and Y. Bengio (2019) H-detach: modifying the LSTM gradient towards better optimization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §4.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1.
  • D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4743–4751. External Links: Link Cited by: item 1, §3.2.
  • D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, H. Larochelle, A. C. Courville, and C. Pal (2016) Zoneout: regularizing rnns by randomly preserving hidden activations. CoRR abs/1606.01305. External Links: Link, 1606.01305 Cited by: §4.
  • B. Li, J. He, G. Neubig, T. Berg-Kirkpatrick, and Y. Yang (2019) A surprisingly effective fix for deep latent variable modeling of text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3603–3614. External Links: Link, Document Cited by: §2, §3.2, §3.
  • M. Livne, K. Swersky, and D. J. Fleet (2020) SentenceMIM: a latent variable language model. arXiv preprint arXiv:2003.02645. Cited by: item 1.
  • S. Merity, N. S. Keskar, and R. Socher (2018) Regularizing and optimizing lstm language models. In Proceedings of ICLR 2018, Cited by: §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. External Links: Link, Document Cited by: §3.1.
  • T. Pelsmaeker and W. Aziz (2020) Effective estimation of deep generative language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7220–7236. External Links: Link, Document Cited by: item 1.
  • T. Shen, T. Lei, R. Barzilay, and T. Jaakkola (2017) Style transfer from non-parallel text by cross-alignment. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. 6830–6841. External Links: Link Cited by: §3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: footnote 1.
  • Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick (2017) Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3881–3890. External Links: Link Cited by: item 2.
  • K. Zolna, D. Arpit, D. Suhubdy, and Y. Bengio (2018) Fraternal dropout. In International Conference on Learning Representations, External Links: Link Cited by: 1st item, §2.