1 Introduction
Deep generative models like Variational Autoencoders (VAE) Kingma and Welling [2014]
and Generative Adversarial networks (GAN)
Goodfellow et al. [2014], among others, enjoy great popularity in many applications of machine learning, including natural language processing (NLP). Unlike GANs, VAEs can manipulate discrete observed variables which makes them suitable for applications in text generation. In these models, a sentence is generated according to the following process:
where is a latent sentence representation (or sentence embedding) and are observed words (or the generated sentence), with the vocabulary. The generation stops when a special end of the sentence word is generated. Without loss of generality, we assume that the prior distribution is fixed. The subscript indicates the parameters of the conditional distribution and, in our case,
corresponds to the parameters of a neural network. It is important to note that we do not make any independence assumption in the conditional distribution
.Training aims to search for parameters that maximize the likelihood of observed data, also called the evidence:
(1) 
where is the empirical training data distribution. In general, the objective function is intractable because of the marginalization over latent variables . Variational methods propose to introduce a proposal distribution to create a surrogate objective called the evidence lower bound (ELBO) defined as follows:
(2) 
where the defines a family of lower bounds on parameterized by , i..e . During training, it is important to search for proposal parameters that gives the best bounds (i.e. maximizes the bounds), hence the name variational. The new training problem is then:
(3) 
The ExpectationMaximization (EM) algorithm
Dempster et al. [1977] solves this problem via block coordinate ascent, i.e. maximizing successively with respect to (E step) and (M step). Unlike EM, the VAE approach consists in optimizing this problem by joint stochastic gradient ascent over and . Moreover, unlike in the standard applications of EM, neither the distribution nor the family of the posterior are known. As a result, an independence assumption is made over the coordinates of in the distribution (also called mean field distribution). Finally, the distribution is amortized over the data and parameterized by a neural network. It is worth noting that during training, the reconstruction term in equation 2is estimated via MonteCarlo method using a single sample. It is usual to call the distribution
that generates a latent representation from a sentence the encoder and the distribution that reconstructs the original sentence the decoder.Unfortunately in practice, VAEs for automatic text generation are sensitive to posterior collapse Bowman et al. [2016]. Informally, this means that the proposal distribution is not optimized correctly and remains close to the prior distribution for all data points: . This leads to a poor approximation of the objective 1 by the ELBO. In the end, the decoder ignore that latent variable and no sentence representation is learned.
Previous work on the posterior collapse problem can be classified in two categories:

On the one hand, modifications to the objective function were suggested. Bowman et al. [2016] reweight the divergence with the prior term to temper its importance during training. Kingma et al. [2016] and Pelsmaeker and Aziz [2020]
proposed to introduce constraints to force the divergence term to be greater than a hyperparameter. Finally, alternative surrogate objectives to train VAEs in the context of text generation have been proposed
Livne et al. [2020], Havrylov and Titov [2020]. 
On the other hand, several authors proposed to modify the architecture of the decoder so that they are forced to rely on latent variable information. Yang et al. [2017]
replace the recurrent neural network in the decoder by a convolutional neural network.
Dieng et al. [2019] proposed to use skip connections between the latent representation and the various hidden layers of the decoder.
In this work, we propose a different approach to prevent the posterior collapse problem based on parameter regularization, that is without changing the generative model objective function or the architecture of the decoder. Our contributions can be summarized as follows:

We propose to regularize the decoder parameters to prevent the posterior collapse. In particular, we use fraternal dropout Zolna et al. [2018] to force the decoder to use the latent representation.

We experiment our approach in various settings and report improved results with respect to several evaluation metrics.
We hope to encourage future research to explore this direction to improve VAEs for text generation.
2 Decoder regularization: fraternal dropout
We first give an intuitive motivation for our approach. LSTMs have been widely used as neural language models and achieve competitive results Merity et al. [2018].^{1}^{1}1Our method is agnostic of the decoder architecture. As such, the same methodology can be directly applied to selfattentive networks/transformers Vaswani et al. [2017].
Language models are trained via the same reconstruction term as the one used in VAEs, showing that these neural architectures can efficiently maximize this term in the absence of latent variables, i.e. it can be maximized while ignoring the latent variables values. To bypass this issue, we propose to introduce a regularization term in the objective that forces the hidden representations computed by the LSTM to be similar even if different words in the input are masked. The decoder is then forced to rely on latent variable information. To this end, we propose to rely on fraternal dropout
Zolna et al. [2018].The reconstruction term of the ELBO maximizes the loglikelihood using the usual teacher forcing technique for language models: during training, the autoregressive model is trained to predict the next word based on gold previous words. Let a sentence of length . Each word
is represented by a vector taken from an embedding table. We denote all of these vectors by a matrix
where is the word embedding dimension. A contextual representation is computed for each position in the sentence using a LSTM:where is a matrix containing hidden representations of dimension for each sentence position. The latent variable is projected and then given as an initialization for both the memory and the hidden state and also concatenated to each input. We refer the reader to Li et al. [2019] for more details on the architecture and the various parameters.
Word dropout Dozat and Manning [2017] consists in randomly replacing embeddings by a vector of zeros during training to prevent overfitting:
where
is a vector of booleans where each element is independently drawn from a Bernoulli distribution of parameter
. The matrix corresponds to the matrix where a number of columns are filled with zeros. Fraternal dropout consists in creating two matrices and as follows:Matrices and are both used to compute the loglikelihood of the sentence and the mean loglikelihood then replaces the original reconstruction term. Finally, a regularization term is introduced in the objective function:
where is a hyperparameter. Note that the regularization term forces the hidden representations computed by the LSTM to be similar with different masked inputs and but similar latent variables . Hence, the regularization term forces the decode to rely on latent variables to compute the LSTM’s hidden representations.
3 Experiments
We evaluate our approach using the code distributed by by Li et al. [2019]
. We kept the same hyperparameters as the ones used by the authors to avoid skewing the results in our favor.
We experiment with two datasets: Yelp Shen et al. [2017]
and the Stanford Natural Language Inference (SNLI)
Bowman et al. [2015] data. These two datasets were subsampled to contain 100,000 training sentences and 10,000 evaluation and testing sentences. SNLI and Yelp have a respective vocabulary of 9,990 and 8,411 words and both have an average of 10 words per sentence.3.1 Evaluation metrics
We use several metrics to evaluate the quality of the learned generative models.
(Loglikelihood and perplexity per word) The negative loglikelihood (NLL)
indicates how well the model reconstructs the input sentence. The perplexity per word (PPL) is the geometric mean of the reciprocal of the probability assigned to the correct word by the model. Since we observe the opposite in the first case and the reciprocal in the second, we want to minimize both of these metrics. They are approximated by sampling 100 times in the distribution
for each sentence.(BLEU score) For each sentence in the test set, a latent representation is sampled in the distribution and a sentence is generated without using teacher forcing from the representation. The BLEU score Papineni et al. [2002]
will represent the proportion of ngrams (for n going from 1 to 4) in the generated sentence that can be found in the original one. If the generated sentence is shorter than the original, a penalty is applied since it is easier to avoid mistakes when less words are produced.
(Active units) The number of active units (AU) represent the number of dimensions of the latent variable that covary with the observations. According to Burda et al. [2016], a greater number of active units is usually representative of a richer latent variable. We follow their article and use a threshold value of . The dimension of the latent variable is 32, therefore we will have at most 32 active units.
(Mutual Information)
We also report the mutual information between the latent variable and the output probability distribution. A higher mutual information indicates that the latent variable is better used by the model. We follow the methodology of
He et al. [2019].3.2 Baseline
We compare our approach to previous work including a "standard" VAE, the free bits technique Kingma et al. [2016] and the pretraining approach Li et al. [2019]:

(Free bits) The free bits technique consists in introducing a constraint on the divergence term with the prior distribution so that it doesn’t fall below a predetermined threshold . We follow previous work and fix in all of our experiments.

(Pretraining) This baseline consists in training the model as a classic autoencoder first. Then the decoder is reset and the model is trained as a VAE.
All of our experiments reweight the divergence term during the training of the model. This technique, proposed by Bowman et al. [2016]
, consists in steadily increasing the reweighting factor from 0 to 1 during the first epochs of training. The idea is to force the model to ignore the divergence with prior term in the beginning of training.
3.3 Results and analysis

NLL  PPL  AU  MI  BLEU  

0.01  28.91  18.44  4  6.34  5.20  
0.1  29.50  19.12  5  7.03  6.40  
0.5  30.69  21.53  6  7.47  6.81  
1.0  32.30  25.28  4  6.87  6.03  
2.0  33.41  28.27  4  7.29  6.09 
Yelp  SNLI  

Configuration  NLL  PPL  UA  IM  BLEU  NLL  PPL  UA  IM  BLEU 
Standard  33.40  28.25  2  1.14  1.43  32.57  20.64  3  0.52  2.32 
+ fraternal dropout  29.50  19.12  5  7.03  6.40  30.01  16.28  2  4.75  5.73 
Free bits  29.54  19.20  32  5.69  4.02  28.88  14.66  32  4.63  4.77 
+ fraternal dropout  25.46  12.76  32  8.65  11.23  27.92  13.40  32  7.11  8.44 
Pretrain.  33.74  29.21  2  0.71  0.83  31.76  19.14  3  1.14  2.76 
+ fraternal dropout  26.18  13.71  22  8.24  9.69  24.69  9.92  22  8.32  13.43 
Free bits + Pretrain.  25.93  13.37  32  8.14  7.54  23.33  8.75  32  8.49  13.9 
+ fraternal dropout  23.63  10.62  32  8.81  13.54  21.00  7.04  32  9.07  21.35 
We first evaluate the impact of the fraternal dropout hyperparameter on the Yelp dataset. Results are reported in Table 1. We can observe that a compromise needs to be found between all the metrics, i.e. we need to find a point of equilibrium between minimizing the NLL and the PPL and maximizing AU, MI and the BLEU score. In the following experiments, we fixed .
We report results in different configurations in Table 2. Our approach yields improvements for all the metrics in all configurations for both datasets. The posterior collapse problem is significant in configurations not using free bits or fraternal dropout, the mutual information being around 1 and the number of active units being 2. Adding fraternal dropout results in a gain between 4 and 7 points for the mutual information in the two configurations. In configurations where the model is pretrained, we observe that this pretraining alone does not prevent the posterior collapse since there are only 2 and 3 active units respectively while the same model with fraternal dropout retains 22 active units. Interestingly, our approach has a bigger impact than free bits on Yelp and a similar one on SNLI while using less active units in both cases. This is an indication that the free bits technique forces the latent variables to be decorrelated artificially, i.e. the decoder still ignores their values.
As explained previously, the BLEU score is computed over sentences generated without teacher forcing and therefore can also be used to estimate the quality of the latent representations. Once again, fraternal dropout improves results on this metric.
Interpolation
We show some examples of sentences generated via interpolation between two latent representations sampled from the prior in Table
3. We see that our method seems to produce coherent sentences with a gradual change in length and meaning between the successive sentences.Free bits + Pretrain  Free bits + Pretrain + fraternal dropout 
a boy is in front of a group of people.  the young boy is in a picture. 
a man in a blue shirt is standing in front of a crowd of people.  the young child is in front of a mother. 
a child in blue is holding a camera.  the small child is in front of a mother. 
a child in blue pants holding a camera while another man watches.  a small child in pink holds a picture of her mother. 
a child in blue pants holding a camera while another man in a black shirt looks on.  a small child in pink sits in a picture with her mother. 
4 Conclusion
In this work, we propose to rely on parameter regularization to prevent the posterior collapse problem in VAEs. This approach is different from previous work in the literature. We observe that our approach has two benefits: it improves the quality of generated text and increases the use of the latent variable. Future works could explore other methods for parameter regularization Kanuparthi et al. [2019], Krueger et al. [2016], Gal and Ghahramani [2016].
We thank François Yvon and Matthieu Labeau for proofreading the article. This work benefited from computations done on the SaclayIA platform and an access to the computational resources of IDRIS through the resource allocation 20XXAD11011600 attributed by GENCI.
References
 A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.
 Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 10–21. External Links: Link, Document Cited by: item 1, §1, §3.2.
 Importance weighted autoencoders. In Proceedings of 4th International Conference on Learning Representations, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §3.1.
 Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §1.
 Avoiding latent variable collapse with generative skip models. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 2397–2405. External Links: Link Cited by: item 2.
 Deep biaffine attention for neural dependency parsing. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.
 A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, pp. 1019–1027. External Links: Link Cited by: §4.
 Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §1.
 Preventing posterior collapse with levenshtein variational autoencoder. arXiv preprint arXiv:2004.14758. Cited by: item 1.
 Lagging inference networks and posterior collapse in variational autoencoders. In Proceedings of the 7th International Conference on Learning Representations, External Links: Link Cited by: §3.1.
 Hdetach: modifying the LSTM gradient towards better optimization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, External Links: Link Cited by: §4.
 Autoencoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1.
 Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4743–4751. External Links: Link Cited by: item 1, §3.2.
 Zoneout: regularizing rnns by randomly preserving hidden activations. CoRR abs/1606.01305. External Links: Link, 1606.01305 Cited by: §4.
 A surprisingly effective fix for deep latent variable modeling of text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), Hong Kong, China, pp. 3603–3614. External Links: Link, Document Cited by: §2, §3.2, §3.
 SentenceMIM: a latent variable language model. arXiv preprint arXiv:2003.02645. Cited by: item 1.
 Regularizing and optimizing lstm language models. In Proceedings of ICLR 2018, Cited by: §2.
 Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. External Links: Link, Document Cited by: §3.1.
 Effective estimation of deep generative language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7220–7236. External Links: Link, Document Cited by: item 1.
 Style transfer from nonparallel text by crossalignment. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. 6830–6841. External Links: Link Cited by: §3.
 Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: footnote 1.
 Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3881–3890. External Links: Link Cited by: item 2.
 Fraternal dropout. In International Conference on Learning Representations, External Links: Link Cited by: 1st item, §2.