Deep generative models like Variational Autoencoders (VAE) Kingma and Welling 
and Generative Adversarial networks (GAN)Goodfellow et al. 
, among others, enjoy great popularity in many applications of machine learning, including natural language processing (NLP). Unlike GANs, VAEs can manipulate discrete observed variables which makes them suitable for applications in text generation. In these models, a sentence is generated according to the following process:
where is a latent sentence representation (or sentence embedding) and are observed words (or the generated sentence), with the vocabulary. The generation stops when a special end of the sentence word is generated. Without loss of generality, we assume that the prior distribution is fixed. The subscript indicates the parameters of the conditional distribution and, in our case,
corresponds to the parameters of a neural network. It is important to note that we do not make any independence assumption in the conditional distribution.
Training aims to search for parameters that maximize the likelihood of observed data, also called the evidence:
where is the empirical training data distribution. In general, the objective function is intractable because of the marginalization over latent variables . Variational methods propose to introduce a proposal distribution to create a surrogate objective called the evidence lower bound (ELBO) defined as follows:
where the defines a family of lower bounds on parameterized by , i..e . During training, it is important to search for proposal parameters that gives the best bounds (i.e. maximizes the bounds), hence the name variational. The new training problem is then:
The Expectation-Maximization (EM) algorithmDempster et al.  solves this problem via block coordinate ascent, i.e. maximizing successively with respect to (E step) and (M step). Unlike EM, the VAE approach consists in optimizing this problem by joint stochastic gradient ascent over and . Moreover, unlike in the standard applications of EM, neither the distribution nor the family of the posterior are known. As a result, an independence assumption is made over the coordinates of in the distribution (also called mean field distribution). Finally, the distribution is amortized over the data and parameterized by a neural network. It is worth noting that during training, the reconstruction term in equation 2
is estimated via Monte-Carlo method using a single sample. It is usual to call the distributionthat generates a latent representation from a sentence the encoder and the distribution that reconstructs the original sentence the decoder.
Unfortunately in practice, VAEs for automatic text generation are sensitive to posterior collapse Bowman et al. . Informally, this means that the proposal distribution is not optimized correctly and remains close to the prior distribution for all data points: . This leads to a poor approximation of the objective 1 by the ELBO. In the end, the decoder ignore that latent variable and no sentence representation is learned.
Previous work on the posterior collapse problem can be classified in two categories:
On the one hand, modifications to the objective function were suggested. Bowman et al.  reweight the divergence with the prior term to temper its importance during training. Kingma et al.  and Pelsmaeker and Aziz 
proposed to introduce constraints to force the divergence term to be greater than a hyperparameter. Finally, alternative surrogate objectives to train VAEs in the context of text generation have been proposedLivne et al. , Havrylov and Titov .
On the other hand, several authors proposed to modify the architecture of the decoder so that they are forced to rely on latent variable information. Yang et al. et al.  proposed to use skip connections between the latent representation and the various hidden layers of the decoder.
In this work, we propose a different approach to prevent the posterior collapse problem based on parameter regularization, that is without changing the generative model objective function or the architecture of the decoder. Our contributions can be summarized as follows:
We propose to regularize the decoder parameters to prevent the posterior collapse. In particular, we use fraternal dropout Zolna et al.  to force the decoder to use the latent representation.
We experiment our approach in various settings and report improved results with respect to several evaluation metrics.
We hope to encourage future research to explore this direction to improve VAEs for text generation.
2 Decoder regularization: fraternal dropout
We first give an intuitive motivation for our approach. LSTMs have been widely used as neural language models and achieve competitive results Merity et al. .111Our method is agnostic of the decoder architecture. As such, the same methodology can be directly applied to self-attentive networks/transformers Vaswani et al. .
Language models are trained via the same reconstruction term as the one used in VAEs, showing that these neural architectures can efficiently maximize this term in the absence of latent variables, i.e. it can be maximized while ignoring the latent variables values. To bypass this issue, we propose to introduce a regularization term in the objective that forces the hidden representations computed by the LSTM to be similar even if different words in the input are masked. The decoder is then forced to rely on latent variable information. To this end, we propose to rely on fraternal dropoutZolna et al. .
The reconstruction term of the ELBO maximizes the log-likelihood using the usual teacher forcing technique for language models: during training, the auto-regressive model is trained to predict the next word based on gold previous words. Let a sentence of length . Each word
is represented by a vector taken from an embedding table. We denote all of these vectors by a matrixwhere is the word embedding dimension. A contextual representation is computed for each position in the sentence using a LSTM:
where is a matrix containing hidden representations of dimension for each sentence position. The latent variable is projected and then given as an initialization for both the memory and the hidden state and also concatenated to each input. We refer the reader to Li et al.  for more details on the architecture and the various parameters.
Word dropout Dozat and Manning  consists in randomly replacing embeddings by a vector of zeros during training to prevent overfitting:
is a vector of booleans where each element is independently drawn from a Bernoulli distribution of parameter. The matrix corresponds to the matrix where a number of columns are filled with zeros. Fraternal dropout consists in creating two matrices and as follows:
Matrices and are both used to compute the log-likelihood of the sentence and the mean log-likelihood then replaces the original reconstruction term. Finally, a regularization term is introduced in the objective function:
where is a hyperparameter. Note that the regularization term forces the hidden representations computed by the LSTM to be similar with different masked inputs and but similar latent variables . Hence, the regularization term forces the decode to rely on latent variables to compute the LSTM’s hidden representations.
We evaluate our approach using the code distributed by by Li et al. 
. We kept the same hyperparameters as the ones used by the authors to avoid skewing the results in our favor.
We experiment with two datasets: Yelp Shen et al. 
and the Stanford Natural Language Inference (SNLI)Bowman et al.  data. These two datasets were subsampled to contain 100,000 training sentences and 10,000 evaluation and testing sentences. SNLI and Yelp have a respective vocabulary of 9,990 and 8,411 words and both have an average of 10 words per sentence.
3.1 Evaluation metrics
We use several metrics to evaluate the quality of the learned generative models.
(Log-likelihood and perplexity per word) The negative log-likelihood (NLL)
indicates how well the model reconstructs the input sentence. The perplexity per word (PPL) is the geometric mean of the reciprocal of the probability assigned to the correct word by the model. Since we observe the opposite in the first case and the reciprocal in the second, we want to minimize both of these metrics. They are approximated by sampling 100 times in the distributionfor each sentence.
(BLEU score) For each sentence in the test set, a latent representation is sampled in the distribution and a sentence is generated without using teacher forcing from the representation. The BLEU score Papineni et al. 
will represent the proportion of n-grams (for n going from 1 to 4) in the generated sentence that can be found in the original one. If the generated sentence is shorter than the original, a penalty is applied since it is easier to avoid mistakes when less words are produced.
(Active units) The number of active units (AU) represent the number of dimensions of the latent variable that co-vary with the observations. According to Burda et al. , a greater number of active units is usually representative of a richer latent variable. We follow their article and use a threshold value of . The dimension of the latent variable is 32, therefore we will have at most 32 active units.
(Free bits) The free bits technique consists in introducing a constraint on the divergence term with the prior distribution so that it doesn’t fall below a pre-determined threshold . We follow previous work and fix in all of our experiments.
(Pre-training) This baseline consists in training the model as a classic autoencoder first. Then the decoder is reset and the model is trained as a VAE.
All of our experiments reweight the divergence term during the training of the model. This technique, proposed by Bowman et al. 
, consists in steadily increasing the reweighting factor from 0 to 1 during the first epochs of training. The idea is to force the model to ignore the divergence with prior term in the beginning of training.
3.3 Results and analysis
|+ fraternal dropout||29.50||19.12||5||7.03||6.40||30.01||16.28||2||4.75||5.73|
|+ fraternal dropout||25.46||12.76||32||8.65||11.23||27.92||13.40||32||7.11||8.44|
|+ fraternal dropout||26.18||13.71||22||8.24||9.69||24.69||9.92||22||8.32||13.43|
|Free bits + Pre-train.||25.93||13.37||32||8.14||7.54||23.33||8.75||32||8.49||13.9|
|+ fraternal dropout||23.63||10.62||32||8.81||13.54||21.00||7.04||32||9.07||21.35|
We first evaluate the impact of the fraternal dropout hyperparameter on the Yelp dataset. Results are reported in Table 1. We can observe that a compromise needs to be found between all the metrics, i.e. we need to find a point of equilibrium between minimizing the NLL and the PPL and maximizing AU, MI and the BLEU score. In the following experiments, we fixed .
We report results in different configurations in Table 2. Our approach yields improvements for all the metrics in all configurations for both datasets. The posterior collapse problem is significant in configurations not using free bits or fraternal dropout, the mutual information being around 1 and the number of active units being 2. Adding fraternal dropout results in a gain between 4 and 7 points for the mutual information in the two configurations. In configurations where the model is pre-trained, we observe that this pre-training alone does not prevent the posterior collapse since there are only 2 and 3 active units respectively while the same model with fraternal dropout retains 22 active units. Interestingly, our approach has a bigger impact than free bits on Yelp and a similar one on SNLI while using less active units in both cases. This is an indication that the free bits technique forces the latent variables to be decorrelated artificially, i.e. the decoder still ignores their values.
As explained previously, the BLEU score is computed over sentences generated without teacher forcing and therefore can also be used to estimate the quality of the latent representations. Once again, fraternal dropout improves results on this metric.
We show some examples of sentences generated via interpolation between two latent representations sampled from the prior in Table3. We see that our method seems to produce coherent sentences with a gradual change in length and meaning between the successive sentences.
|Free bits + Pre-train||Free bits + Pre-train + fraternal dropout|
|a boy is in front of a group of people.||the young boy is in a picture.|
|a man in a blue shirt is standing in front of a crowd of people.||the young child is in front of a mother.|
|a child in blue is holding a camera.||the small child is in front of a mother.|
|a child in blue pants holding a camera while another man watches.||a small child in pink holds a picture of her mother.|
|a child in blue pants holding a camera while another man in a black shirt looks on.||a small child in pink sits in a picture with her mother.|
In this work, we propose to rely on parameter regularization to prevent the posterior collapse problem in VAEs. This approach is different from previous work in the literature. We observe that our approach has two benefits: it improves the quality of generated text and increases the use of the latent variable. Future works could explore other methods for parameter regularization Kanuparthi et al. , Krueger et al. , Gal and Ghahramani .
We thank François Yvon and Matthieu Labeau for proofreading the article. This work benefited from computations done on the Saclay-IA platform and an access to the computational resources of IDRIS through the resource allocation 20XX-AD11011600 attributed by GENCI.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.
- Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 10–21. External Links: Cited by: item 1, §1, §3.2.
- Importance weighted autoencoders. In Proceedings of 4th International Conference on Learning Representations, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §3.1.
- Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §1.
- Avoiding latent variable collapse with generative skip models. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 2397–2405. External Links: Cited by: item 2.
- Deep biaffine attention for neural dependency parsing. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Cited by: §2.
- A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, pp. 1019–1027. External Links: Cited by: §4.
- Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Cited by: §1.
- Preventing posterior collapse with levenshtein variational autoencoder. arXiv preprint arXiv:2004.14758. Cited by: item 1.
- Lagging inference networks and posterior collapse in variational autoencoders. In Proceedings of the 7th International Conference on Learning Representations, External Links: Cited by: §3.1.
- H-detach: modifying the LSTM gradient towards better optimization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Cited by: §4.
- Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §1.
- Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4743–4751. External Links: Cited by: item 1, §3.2.
- Zoneout: regularizing rnns by randomly preserving hidden activations. CoRR abs/1606.01305. External Links: Cited by: §4.
- A surprisingly effective fix for deep latent variable modeling of text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3603–3614. External Links: Cited by: §2, §3.2, §3.
- SentenceMIM: a latent variable language model. arXiv preprint arXiv:2003.02645. Cited by: item 1.
- Regularizing and optimizing lstm language models. In Proceedings of ICLR 2018, Cited by: §2.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. External Links: Cited by: §3.1.
- Effective estimation of deep generative language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7220–7236. External Links: Cited by: item 1.
- Style transfer from non-parallel text by cross-alignment. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. 6830–6841. External Links: Cited by: §3.
- Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Cited by: footnote 1.
- Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3881–3890. External Links: Cited by: item 2.
- Fraternal dropout. In International Conference on Learning Representations, External Links: Cited by: 1st item, §2.