1 Introduction
Variational autoencoders (Kingma and Welling, 2013; Rezende et al., 2014, VAEs) have been widely adopted in natural language generation (Bowman et al., 2016)
. VAE employs a global latent variable to represent semantics, leading to diverse and coherent generated sentences. Vanilla VAE adopts a continuous latent variable following a multivariate Gaussian distribution with a diagonal covariance matrix. Recently,
Zhao et al. (2018b) propose to replace the continuous latent variable of VAE with a discrete one for better interpretation in generating dialog. The discrete latent variable could represent the dialog actions in their system, which gives promising results even in an unsupervised setting.However, we argue that VAE only with a discrete latent variable is not sufficient for interpretable language generation. Compared with the continuous latent variable, the discrete one suffers from its relatively low model capacity. Because the discrete latent space only includes limited size of points, it is unable to convey as much information as the continuous latent space (infinite points).
In this paper, we propose to generate text using Gaussian mixture VAE (GMVAE). GMVAE has been effective in image modeling (Dilokthanakul et al., 2016; Jiang et al., 2017). It enjoys the benefits of both discrete and continuous latent space where the discrete variable is easy to control. This is superior to semisupervised VAEs (Kingma et al., 2014; Hu et al., 2017; Zhou and Neubig, 2017), containing two independent discrete and continuous latent variables.
However, vanilla GMVAE suffers from the modecollapse problem in language generation, where the multiple Gaussian priors tend to concentrate during training and eventually degenerate into a single Gaussian (Fig. 8). Essentially, each Gaussian component in latent space tends to have close means. GMVAE fails to capture the multimodes structure of the data and cannot effectively utilize the discrete latent variable. For example, as illustrated in Fig. (a)a, utterances requesting the weather and requesting an appointment will be mapped into the same mode due to the modecollapse. The modecollapse problem has also been observed in the image modeling task (Dilokthanakul et al., 2016) using GMVAE. However, the problem is more severe in the scenario of language generation.
In this paper, we propose the DispersedGMVAE (DGMVAE), which fixes the modecollapse problem in GMVAE. First, we theoretically analyze that the modecollapse problem is intrinsically caused by the evidence lower bound () of GMVAE. Maximizing the of GMVAE implicitly aggregates the mean of the Gaussian mixture priors. We introduce two extra terms in our proposed DGMVAE to alleviate modecollapse and to obtain a better structured latent space (Fig. (b)b). Experimental results show that DGMVAE can avoid the modecollapse problem effectively. Furthermore, DGMVAE achieves significantly better results in language modeling on Penn Treebanks (Marcus et al., 1993, PTB) and in interpretable text generation over two dialog benchmarks.
Our contributions include: a) we proposed to use GMVAE for interpretable text generation; b) we theoretically analyze the modecollapse problem in GMVAE and address it by proposing DGMVAE; c) we empirically studied the performance of DGMVAE and show it obtains good results on various generation tasks.
2 Related Work
VAEs for Language Generation. Variational autoencoders are proposed by Kingma and Welling (2013, VAEs) and Rezende et al. (2014), and applied by Bowman et al. (2016) for natural language generation. VAEs are extended by many following works in various specific language generation tasks, such as dialog generation (Serban et al., 2017; Wen et al., 2017; Zhao et al., 2017b, 2018b), summarization (Li et al., 2017a) and other natural language generation tasks (Miao et al., 2016; Zhang et al., 2016; Semeniuta et al., 2017; Gupta et al., 2018; Xu and Durrett, 2018).
Additionally, Wen et al. (2017) and Zhao et al. (2018b) propose to replace the continuous latent variable with a discrete one for interpretable sentence generation. Kingma et al. (2014) propose the semi
VAE for semisupervised learning. This model is then adopted by
Hu et al. (2017); Zhou and Neubig (2017) for styletransfer and labeled sequence transduction, respectively. Different from GMVAE, continuous and discrete latent variables in semiVAE are independent.Gaussian Mixture VAEs.
Using Gaussian mixture models as priors in VAEs is not new. Gaussian mixture variational autoencoder has been used in the unsupervised clustering
(Dilokthanakul et al., 2016; Jiang et al., 2017), obtaining promising results. Wang et al. (2019) used GMM as priors for topicguided text generation. GMVAE used in this paper is similar to Jiang et al. (2017). However, we apply GMVAE for interpretable text generation and propose the DGMVAE to address the modecollapse problem according to our theoretical analysis.KL Collapse vs. Mode Collapse. The vanilla VAE models usually suffer from the KL collapse problem in language generation, in which the KL regularization term will quickly collapse to 0. A line of following work (Bowman et al., 2016; Zhao et al., 2017b, 2018b; Higgins et al., 2017) is proposed to avoid the KL collapse problem. More specifically, mode collapse is related to mixture models, in which multiple modes vanish and collapse into a single mode. Mode collapse is also caused by the KL term, but the essential cause is different. In this paper, we focus on addressing the modecollapse problem.
3 Proposed Approach
Probabilistic graphical models of VAE and its variants are shown in Fig. 8. Vanilla VAE (Kingma and Welling, 2013) only includes a continuous latent variable; discrete VAE models such as DIVAE (Zhao et al., 2018b) adopt a discrete latent variable for interpretability; semiVAE (Kingma et al., 2014) employ independent discrete and continuous latent variables; Gaussian mixture VAE (GMVAE) (Dilokthanakul et al., 2016; Jiang et al., 2017) use dependent discrete and continuous latent variables for better interpretable generation performance.
In the following, we will first describe the vanilla GMVAE in Sec. 3.1. We will give a theoretical analysis of the modecollapse issue in GMVAE. Based on the theoretical insights, we propose DGMVAE to fix the issue.
3.1 Gaussian Mixture VAE
GMVAE is a probabilistic generative model that adopts the Gaussian mixture models (Bishop, 2006) as its prior. GMVAE employs a discrete latent variable and a continuous latent variable , with dependent on . In this model, the marginal likelihood of a sentence is:
(1) 
in which is the parameters of generation model which generates from . is the Gaussian mixture prior distribution and can be computed by . Intuitively, represents the components of mixture Gaussian and
could be assumed as an uniform distribution; while
is a multivariate Gaussian distribution of the corresponding component.Testing. During testing, a mixture Gaussian component is first chosen according to the prior distribution . Then the continuous variable is sampled from the chosen Gaussian prior . As in Bowman et al. (2016), a generation network takes as input and generate the sentence through a decoder .
Training. Optimizing and inference for Eq. 1 is difficult. Following previous work of Kingma and Welling (2013) and Rezende et al. (2014), we use a variational posterior distribution with parameters to approximate the real posterior distribution . With the mean field approximation (Xing et al., 2003), can be factorized as:
The posterior is assumed as a multivariate Gaussian distribution, whose mean
and variance
are obtained through a neural network (
recognition network). is calculated according to:(2) 
in which
is the probability of generating the mapped vector of
in latent space by th Gaussian component, and could be taken as prior . In practice, is taken as the deterministic mapping of in the latent space, which does not damage the conditional independence in mean field approximation.Instead of optimizing the marginal likelihood in Eq. 1, we maximize a evidence lower bound (). The can be decomposed as the summation of a reconstruction term and regularization terms for and , respectively:
(3) 
All parameters including , could be jointly trained with reparameterization tricks (Kingma and Welling, 2013) (for sampling ) and gumbelsoftmax (Jang et al., 2016) (for sampling ).
However, GMVAE often encounters modecollapse, where all components of GMM are concentrated into one Gaussian distribution and GMVAE degenerates into uniGaussian VAE.
3.2 ModeCollapse Problem
We further investigate the objective function to analyze modecollapse. To this end, we present two theorems, which indicate that the regularization terms of GMVAE’s , and , are responsible for the mode collapse problem. We only give explanations and remarks for each theorem, with the details included in the supplementary materials.
We abbreviate , the posterior mean of given , as , and , the set of means of Gaussian components, as . The trace of variance matrix of mean in is denoted as .
Theorem 1.
Maximizing the pushes a close upper bound of , , to decrease. Here is the squared sum of distance between and .
By performing some algebraic operations, we find that the inner product of and is always nonpositive, which means the directions of their gradients are opposite. So, performing gradient ascent on will make smaller. As a result, is limited by the decreasing bound .
We abbreviate as the trace of variance matrix of mean under the distribution of ,
, assuming the standard deviation of all Gaussian components equal, i.e.,
:Theorem 2.
contains a negative regularization term of .
could be rewritten as
where is a multivariate Gaussian distribution, whose mean is the expectation of the mean of priors under distribution .
Hence, maximizing implicitly minimize , which may also lead to mode collapse in GMVAE.
3.3 DispersedGMVAE
In this section, we propose the DispersedGMVAE (DGMVAE), which is a simple yet effective way to avoid the modecollapse problem.
According to the theoretical insights in Sec. 3.2, we propose to include two extra terms in our objective to balance the mode collapse from . We propose a new objective for sampled from the dataset :
(4)  
where
is the posterior marginal distribution estimated by
. Empirically, it is obtained by averaging within the minibatch.DGMVAE add an extra variance regularization term and a mutual information term in its objective. Intuitively, they serve different roles. with a hyperparameter is proposed to regularize the concentration trends of Gaussian mixture components. We can tune to make a tradeoff between variance and concentration degree.
We also include a mutual information term in Eq. 4: . As introduced by previous works (Chen et al., 2016; Zhao et al., 2017a, 2018b), could enhance the interpretability and alleviate the KLcollapse. Our ablation study in the experiments shows that and are both necessary to obtain good empirical performances of DGMVAE.
The final objective of DGMVAE could be:
in which . More details on how to obtain the final objective function from Eq. 3 and Eq. 4 could be found in supplementary materials.
Except for the learning objective, DGMVAE has similar architecture as GMVAE. It consists of an RNN encoder for learning posterior and an RNN decoder for generation.
Encoder
. Recurrent neural networks such as GRU
(Chung et al., 2014) as recognition networks encode sentences into compact hidden states. The mean and variance of the posterior distribution (assumed as a multivariate diagonal Gaussian) are obtained from the last hidden states through two affine transformations.Decoder. In the decoding phase, we first sample a from the GMM priors (in testing) or from posterior (in training) by the reparameterization trick (Kingma and Welling, 2013). The sentences will be generated on a recurrent neural language model fashion (generation networks), with the as the initialized hidden state. We use multiple independent discrete latent variables following Zhao et al. (2018b).
Interpretable Dialog Generation. We follow the same approach of DIVAE (Zhao et al., 2018b) for interpretable dialog generation. The approach could be extended to other scenarios of interpretable generations, but we only validate our DGMVAE on dialog for comparing with Zhao et al. (2018b).
Specifically, in dialog generation, we generate response given the dialog context . A DGMVAE model is pretrained in all utterances of the training set to capture the interpretable facts (discrete latent variable ) such as dialog actions or intentions. In training, a hierarchical recurrent encoderdecoder model (HRED) with attention (Sordoni et al., 2015; Serban et al., 2016) is trained to generate the response. Here is obtained from the pretrained recognition network of DGMVAE and then fed into the decoder. A policy network is trained jointly to predict sampled from in order to predict in the testing stage.
4 Experiments
In this section, we empirically test the generation quality and interpretable ability of our proposed model on standard benchmarks, compared with a line of baselines.
4.1 Setup
We conduct experiments following Zhao et al. (2018b). For generation quality, we use the Penn Treebanks (Marcus et al., 1993, PTB) preprocessed by Mikolov (Mikolov et al., 2010) as the benchmark. For interpretability, we use the Daily Dialogs (Li et al., 2017b, DD) and the Stanford MultiDomain Dialog (Eric et al., 2017, SMD) datasets. DD is a chatoriented dataset containing 13,118 multiturn dialogs, annotated with dialog actions and emotions. SMD contains 3,031 humanWoz, taskoriented dialogs collected from 3 different domains (navigation, weather and scheduling).
We compare our model with the following baselines: 1) RNNLM, language model (Mikolov et al., 2010) implemented by LSTM (Hochreiter and Schmidhuber, 1997)^{1}^{1}1https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py; 2) AE, autoencoders (Vincent et al., 2010) without latent space regularization; 3) DAE, autoencoders with discrete latent space; 4) VAE, the vanilla VAE (Kingma and Welling, 2013)
with only continuous latent variable and normal distribution prior; 5) DVAE, VAE with discrete latent variables; 6) DIVAE, a DVAE variant
(Zhao et al., 2018b) with an extra mutual information term; 7) semiVAE, semisupervised VAE model proposed by Kingma et al. (2014) with independent discrete and continuous latent variables; 8) GMVAE, vanilla GMVAE models as introduced in 3.1. Results of these baselines are obtained by our implementation except DIVAE^{2}^{2}2We get results of DIVAE with their released codes. Our reported results differs as in their paper because we find they perform tokenization twice in their codes.. Gumbelsoftmax (Jang et al., 2016) is used for reparameterization in VAE variants with discrete latent variable.The encoder and decoder in all models are implemented with singlelayer GRU (Chung et al., 2014), with the hidden size as 512. The dimension of discrete latent variables is set to 10 for PTB and 5 for DD and SMD, while the number of discrete latent variables is set to 20, 3 and 3. The dimension of continuous latent space is 100 for PTB, 15 for DD and 48 for SMD. is set to 0.9 for DGMVAE. KL annealing with logistic weight function is adopted for all VAE variants. All hyperparameters including are chosen according to the objective (language generation task) or BLEU scores (dialog generation task) in the validation set. Details of hyperparameters are included in the supplementary.
4.2 Effects of DGMVAE on ModeCollapse
We illustrate the effectiveness of DGMVAE to alleviate the modecollapse problem. Fig. 17 gives a visualization. We train GMVAE and DGMVAE in utterances on the DD dataset, and randomly sample 300 points from test data at 2,000 and 10,000 training steps, respectively. The dimension of latent space is set to 2 for visualization. As in Fig. 17, the mean and variance of GMM priors are indicated by grey points and circles, respectively. The means of posteriors are marked as colored points (points with different discrete latent variables are associated with different colors).
It can be seen that, after 10,000 training steps, the vanilla GMVAE degenerates into uniGaussian VAE, with the same mean values of all Gaussian components (Fig. (a)a and (b)b). DGMVAE gives quite promising results as shown in Fig. (g)g and (h)h, in which different components of the GMM are dispersed and cluster data points into multiple modes well. In order to verify the effects of the additional two terms of DGMVAE, we incorporate and incrementally. GMVAE indeed helps alleviate the modecollapse, however, the posterior points are quite concentrated to the priors. This indicates the latent space in such case is not smooth enough. GMVAE can also avoid the modecollapse problem, but it does not cluster points with the same discrete labels together well.
4.3 Language Generation Performance
We evaluate the performance of language generation on PTB in Tab. 1, comparing DGMVAE with baselines described in Sec. 4.1. The test set of PTB is also included for comparison of text fluency.
We include four metrics to evaluate the generation performances: reverse perplexity (rPPL), BLEU (Papineni et al., 2002), word level KL divergence (wKL) and perplexity (PPL). Reverse perplexity is the perplexity of a LSTM language model (Merity et al., 2017) trained on the synthetic data sampled from priors of VAE variants, and evaluated on the test set (Zhao et al., 2018a). Lower rPPL indicates that the sampled sentences are more diverse and fluent. The BLEU score between input and reconstructed sentences reflects the ability of reconstruction. Word level KL divergence between word frequencies calculated in generated data and training data shows the quality of generation. Perplexity of samples generated from posterior in test set measures the fluency of reconstruction. These metrics are evaluated on the test set of PTB, except rPPL and wKL, which are calculated on sentences generated by sampling from these models’ prior distribution^{3}^{3}3Sample size here is 40,000, the same as PTB training set. (sampling a random vector for AE).
Besides, the values of the regularization terms are also included in order to give some indications of the modecollapse and KL collapse. We list the KL divergence of continuous latent variables (KL(z)^{4}^{4}4For models with VM term, it is the KL divergence between posterior and the expected prior .) and discrete latent variables (KL(c)^{5}^{5}5For models with MI term, it is the KL divergence between marginal posterior and prior .), which are averaged by the number of discrete variables. The variance of GMM mean (VM^{6}^{6}6VM is calculated on the posterior and taking average over all samples.) and mutual information (MI) terms are shown as well.
Evaluation Results  Regularization Terms  
Model  rPPL  BLEU  wKL  PPL  KL(z)  KL(c)  VM  MI 
Test Set    100.0  0.14           
RNNLM (Mikolov et al., 2010)        117.60         
AE (Vincent et al., 2010)  730.81  10.88  0.58  31.90         
VAE (Kingma and Welling, 2013)  922.71  3.73  0.76  91.95  6.62       
DAE  797.17  3.93  0.58  88.55         
DVAE  453.53  3.61  0.58  100.56    1.74    1.22 
DIVAE (Zhao et al., 2018b)  425.11  4.19  0.69  93.72    0.13    1.26 
semiVAE (Kingma et al., 2014)  779.53  3.59  0.79  93.78  6.97  0.02    0.019 
semiVAE  721.34  4.87  0.73  92.95  0.49  0.14    1.34 
GMVAE  923.66  4.17  0.80  90.26  7.13  0.02  0.38  0.016 
DGMVAE  331.80  6.34  0.45  61.77  13.03  0.10  9.93  1.30 
DGMVAE  560.56  5.64  0.62  71.12  3.87  0.31  24.84  0.28 
DGMVAE  244.30  8.45  0.35  49.60  6.41  0.10  21.42  1.19 
We first present the ablation study to show whether the two extra terms (Sec. 3.3) both contribute to the performance gains. As shown in Tab. 1, VM term helps to alleviate the modecollapse, according to the higher variance of mean. MI term helps to increase the information encoded by discrete latent variables, according to higher mutual information. As shown in the last three rows of Tab. 1, both the two terms improve the performance. The DGMVAE with both terms achieves the best results.
With the presence of both continuous and discrete latent variables, DGMVAE enjoys its higher model capacity and gives the best reconstruction performance (BLEU, PPL), superior to other VAE variants. Although semiVAE also includes discrete and continuous latent variables, it fails to make use of both of them because of the independent hypothesis. As shown in Tab. 1, either discrete or continuous latent variable collapses in semiVAEs. AE could reproduce input sentences well, but it fails to generate diverse sentences.
Besides the reconstruction, we also find that DGMVAE significantly outperforms related work in generating highquality sentences. rPPL is a powerful metric for measuring the fluency and diversity; DGMVAE obtains the lowest rPPL, which is significantly better than all other models. The lowest wKL also shows that word distribution in DGMVAE generations is most consistent with the training set.
4.4 Interpretable Generation Results
Following Zhao et al. (2018b), we include the experiments of interpretable language generation on DD and dialog generation on SMD, respectively.
Because utterances in DD are annotated with Action and Emotion labels, we evaluate the ability of DGMVAE to capture these latent attributes on DD. We take the index
with the largest posterior probability
as latent action labels. Following Zhao et al. (2018b), we use homogeneity as the metric to evaluate the consistency between golden action and emotion labels with labels obtained from DGMVAE. The number of our labels is 125. Results of homogeneity of action (act) and emotion (em) together with MI term and BLEU are shown in Tab. 3. It shows that DGMVAE outperforms other VAEs in reconstruction and gives the best homogeneity on both the action and emotion.We also evaluate the ability of interpretable dialog generation of DGMVAE on SMD. Both automatic evaluation and human evaluation are conducted. BLEU and three word embedding^{7}^{7}7We use GloVe (Pennington et al., 2014) word embeddings of 300 dimension trained on 840B tokens from https://nlp.stanford.edu/projects/glove/. based topic similarity (Serban et al., 2017): Embedding Average, Embedding Extrema and Embedding Greedy (Mitchell and Lapata, 2008; Forgues et al., 2014; Rus and Lintean, 2012) are used to evaluate the quality of responses. In addition, three human evaluators were asked to score the quality (from 0 to 3) of 159 responses generated by DIVAE and DGMVAE. Because SMD does not offer human annotated action labels of dialog utterances, we follow Zhao et al. (2018b) to label dialog actions by human experts for each discrete latent variable , according to their sampled utterances. Another 3 annotators are asked to evaluate the consistency between the action name and another 5 sampled utterances, which showing the interpretability.
Results are shown in Tab. 3
. Both automatic and human evaluations show that DGMVAE obtains better generation quality and interpretability than DIVAE on SMD. We perform onetail ttests on human evaluation scores and find that the superiority of our model is significant in both quality and consistency with pvalues no more than 0.05.
We perform case studies to validate the performance of DGMVAE qualitatively. Some dialog actions with their utterances discovered by DGMVAE are shown in Tab. 5. It can be seen that utterances of the same actions could be assigned with the same discrete latent variable . We also give some dialog cases generated by DGMVAE in Tab. 5 with their contexts. Given the same context, responses with different actions are generated by sampling different values of discrete latent variables, which shows that DGMVAE has the ability to generate diverse and interpretable responses. More cases can be found in the supplementary materials.
5 Conclusion
The modecollapse problem always occurs in GMVAE practically. In this paper, we give a theoretical analysis of this problem. Given the theoretical insights, we propose the DGMVAE, which can effectively alleviate the mode collapse problem. Additionally, experimental results show that DGMVAE outperforms a line of related works, obtaining higher language generation performance and better interpretable results.
References
 Bishop (2006) Christopher M Bishop. 2006. Pattern Recognition and Machine Learning.
 Bowman et al. (2016) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21.
 Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: interpretable representation learning by information maximizing generative adversarial nets. Neural Information Processing Systems, pages 2180–2188.

Chung et al. (2014)
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014.
Empirical evaluation of gated recurrent neural networks on sequence
modeling.
In
NIPS 2014 Workshop on Deep Learning, December 2014
.  Dilokthanakul et al. (2016) Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. 2016. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648.
 Eric et al. (2017) Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D Manning. 2017. Keyvalue retrieval networks for taskoriented dialogue. Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 37–49.

Forgues et al. (2014)
Gabriel Forgues, Joelle Pineau, JeanMarie Larchevêque, and Réal
Tremblay. 2014.
Bootstrapping dialog systems with word embeddings.
In
NIPS, Modern Machine Learning and Natural Language Processing Workshop
, volume 2. 
Gupta et al. (2018)
Ankush Gupta, Arvind Agarwal, Prawaan Singh, and Piyush Rai. 2018.
A deep generative framework for paraphrase generation.
National Conference on Artificial Intelligence
.  Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher P Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. betavae: Learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation, 9(8):1735–1780.
 Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. In International Conference on Machine Learning, pages 1587–1596.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144.
 Jiang et al. (2017) Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. 2017. Variational deep embedding: An unsupervised and generative approach to clustering. International Joint Conference on Artificial Intelligence, pages 1965–1972.
 Kingma et al. (2014) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.

Li et al. (2017a)
Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. 2017a.
Deep recurrent generative decoder for abstractive text summarization.
Empirical Methods in Natural Language Processing, pages 2091–2100.  Li et al. (2017b) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017b. Dailydialog: A manually labelled multiturn dialogue dataset. International Joint Conference on Natural Language Processing, 1:986–995.
 Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 19(2):313–330.
 Merity et al. (2017) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182.
 Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. International Conference on Machine Learning, pages 1727–1736.
 Mikolov et al. (2010) Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. Interspeech, pages 1045–1048.
 Mitchell and Lapata (2008) Jeff Mitchell and Mirella Lapata. 2008. Vectorbased models of semantic composition. proceedings of ACL08: HLT, pages 236–244.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Weijing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. pages 311–318.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
 Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
 Rus and Lintean (2012) Vasile Rus and Mihai Lintean. 2012. A comparison of greedy and optimal assessment of natural language student input using wordtoword similarity metrics. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 157–162. Association for Computational Linguistics.

Semeniuta et al. (2017)
Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. 2017.
A hybrid convolutional variational autoencoder for text generation.
In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 627–637.  Serban et al. (2016) Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building endtoend dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence.
 Serban et al. (2017) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoderdecoder model for generating dialogues. In ThirtyFirst AAAI Conference on Artificial Intelligence.
 Sordoni et al. (2015) Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and JianYun Nie. 2015. A hierarchical recurrent encoderdecoder for generative contextaware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 553–562. ACM.

Vincent et al. (2010)
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and
Pierre Antoine Manzagol. 2010.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research, 11(12):3371–3408.  Wang et al. (2019) Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, and Lawrence Carin. 2019. Topicguided variational autoencoders for text generation. arXiv preprint arXiv:1903.07137.
 Wen et al. (2017) Tsunghsien Wen, Yishu Miao, Phil Blunsom, and Steve J Young. 2017. Latent intention dialogue models. International Conference on Machine Learning, pages 3732–3741.
 Xing et al. (2003) Eric P Xing, Michael I Jordan, and Stuart Russell. 2003. A generalized mean field algorithm for variational inference in exponential families. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence, pages 583–591. Morgan Kaufmann Publishers Inc.
 Xu and Durrett (2018) Jiacheng Xu and Greg Durrett. 2018. Spherical latent spaces for stable variational autoencoders. Empirical Methods in Natural Language Processing.

Zhang et al. (2016)
Biao Zhang, Deyi Xiong, Hong Duan, Min Zhang, et al. 2016.
Variational neural machine translation.
In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 521–530.  Zhao et al. (2018a) Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexander M Rush, and Yann Lecun. 2018a. Adversarially regularized autoencoders. International Conference on Machine Learning, pages 5897–5906.
 Zhao et al. (2017a) Shengjia Zhao, Jiaming Song, and Stefano Ermon. 2017a. Infovae: Information maximizing variational autoencoders. arXiv: Learning.
 Zhao et al. (2018b) Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. 2018b. Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In ACL.
 Zhao et al. (2017b) Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017b. Learning discourselevel diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 654–664.
 Zhou and Neubig (2017) Chunting Zhou and Graham Neubig. 2017. Multispace variational encoderdecoders for semisupervised labeled sequence transduction. Meeting of the Association for Computational Linguistics, 1:310–320.
Appendix A Appendix
a.1 Derivation of Mutual Information Term and Final Objective
In this part, we first split the mutual information term from of GMVAE. We focus on the over corpus,
(5) 
Out of the meanfield approximation, expectation on could be removed by integral.
(6)  
in which, and is a mutual information between discrete latent variable and input data .
Replacing by Eq. 6 and replacing by two terms we derived in Theorem 2, the of dataset is written as
(7)  
Adding regularization terms and to , the final objective is
(8)  
which consists of a reconstruction term, a KL divergence over , a KL divergence over and a variance of means term.
a.2 Proof of Theorem 1
We first simplify :
(9) 
where is the number of Gaussian components and
(10) 
It is because
(11) 
and
(12) 
Now we derive the partial derivative of with respect to :
(13) 
where
(14)  
(15)  
(16)  
(17) 
Adding these components, we get:
(18) 
At the same time,
(19) 
Finally, we multiply and to get , because for :
(20) 
Define and ,
(21) 
where
(22)  
(23)  
(24) 
as
(25) 
As , . So when performing gradient ascent on , the gradient of will make smaller. As a result, () is limited by a decreasing bound.
a.3 Proof of Theorem 2
We prove this property in the case of univariate Gaussian distribution priors. It is straightforward to generalize it to diagonal multivariate Gaussian distributions by summing over all dimensions.
The term can be rewritten as
(26) 
The is defined as Gaussian distribution with mean and standard deviation ,
(27) 
We define , which can be written as
(28)  
According to Eq. 28,
can be split into two multiplied terms. The first term is the probability density function of a Gaussian distribution, denoted as
,(29) 
whose mean is the expectation of under posterior distribution and standard deviation is . The second term is actually the variance of means of Gaussian components under the distribution of ,
(30) 
a.4 HyperParameters
Adam optimizer is adopted with learning rate of 0.001. Batch size is set to 30. The size of word embedding is set as 300. Sentence longer than 40 will be cut off. Vocabulary size is set to 10,000. All results were obtained by repeating the experiment three times and taking an average.
We also illustrate how will affect the model performance in Fig. 18; and we generally find that larger can get better results. As a result, we use for experiment in PTB and DD.
a.5 More Cases
More examples of actions discovered by DGMVAE is shown in Tab. 6. More examples on responses generated by DGMVAE are shown in Tab. 7. In Tab. 7, an example without context is given to show the ability to begin a dialog in different topics (weather, navigation and scheduling). For DGMVAE, we can sample different continuous latent variables from one component. As shown in Tab. 8, diverse responses with the same actions could be generated.
Action Name  Requestlocation 

Utterances  Which location do you want the weather for? 
Which location should I look up information about?  
Which city are you asking about?  
Action Name  Informtime/appointment 
Utterances  Your next dinner event is with your father on Friday. 
Your father will be attending your yoga activity on the 2nd with you.  
Your doctor’s appointment is Monday at 1 pm.  
Action Name  Generalthanks 
Utterances  Thanks. 
Thanks a lot.  
Perfect. Thanks. 
Context  User: What’s the temperature going to be this week? 

Sys: What city are you wanting to know the temperature for?  
Predict  (110, informaddress) Cleveland. 
(402, requestweather) Will it rain in Redwood City today?  
Context  None 
Predict  (140, requestroute/address) find me a nearby coffee shop 
(104, requestweather) what’s the weather going to be like today and tomorrow  
(124, commandreminder) remind me about meeting later 
Context  User: What is the highest temperature in Brentwood over the next two days? 

Action Name  (230) informweather 
Predict  It is currently foggy in Brentwood on Tuesday. 
It will be between 70  40F and turn  40F on Saturday.  
Context  User: I need gas. 
Action Name  (242) informroute/address 
Predict  There is a Chevron 3 miles from you. 
There is a Safeway.  
Context  User: schedule meeting 
Action Name  (304) requesttime 
Predict  What day and time should I set your meeting for? 
What time should I set the alarm? 