Log In Sign Up

Fixing Gaussian Mixture VAEs for Interpretable Text Generation

Variational auto-encoder (VAE) with Gaussian priors is effective in text generation. To improve the controllability and interpretability, we propose to use Gaussian mixture distribution as the prior for VAE (GMVAE), since it includes an extra discrete latent variable in addition to the continuous one. Unfortunately, training GMVAE using standard variational approximation often leads to the mode-collapse problem. We theoretically analyze the root cause --- maximizing the evidence lower bound of GMVAE implicitly aggregates the means of multiple Gaussian priors. We propose Dispersed-GMVAE (DGMVAE), an improved model for text generation. It introduces two extra terms to alleviate mode-collapse and to induce a better structured latent space. Experimental results show that DGMVAE outperforms strong baselines in several language modeling and text generation benchmarks.


page 1

page 2

page 3

page 4


Implicit Deep Latent Variable Models for Text Generation

Deep latent variable models (LVM) such as variational auto-encoder (VAE)...

Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

This paper explores image caption generation using conditional variation...

Towards Multimodal Response Generation with Exemplar Augmentation and Curriculum Optimization

Recently, variational auto-encoder (VAE) based approaches have made impr...

Stylized Text Generation Using Wasserstein Autoencoders with a Mixture of Gaussian Prior

Wasserstein autoencoders are effective for text generation. They do not ...

Cauchy-Schwarz Regularized Autoencoder

Recent work in unsupervised learning has focused on efficient inference ...

Paraphrase Generation with Latent Bag of Words

Paraphrase generation is a longstanding important problem in natural lan...

Text Generation with Deep Variational GAN

Generating realistic sequences is a central task in many machine learnin...

1 Introduction

Variational auto-encoders (Kingma and Welling, 2013; Rezende et al., 2014, VAEs) have been widely adopted in natural language generation (Bowman et al., 2016)

. VAE employs a global latent variable to represent semantics, leading to diverse and coherent generated sentences. Vanilla VAE adopts a continuous latent variable following a multivariate Gaussian distribution with a diagonal covariance matrix. Recently,

Zhao et al. (2018b) propose to replace the continuous latent variable of VAE with a discrete one for better interpretation in generating dialog. The discrete latent variable could represent the dialog actions in their system, which gives promising results even in an unsupervised setting.

However, we argue that VAE only with a discrete latent variable is not sufficient for interpretable language generation. Compared with the continuous latent variable, the discrete one suffers from its relatively low model capacity. Because the discrete latent space only includes limited size of points, it is unable to convey as much information as the continuous latent space (infinite points).

In this paper, we propose to generate text using Gaussian mixture VAE (GMVAE). GMVAE has been effective in image modeling (Dilokthanakul et al., 2016; Jiang et al., 2017). It enjoys the benefits of both discrete and continuous latent space where the discrete variable is easy to control. This is superior to semi-supervised VAEs (Kingma et al., 2014; Hu et al., 2017; Zhou and Neubig, 2017), containing two independent discrete and continuous latent variables.

However, vanilla GMVAE suffers from the mode-collapse problem in language generation, where the multiple Gaussian priors tend to concentrate during training and eventually degenerate into a single Gaussian (Fig. 8). Essentially, each Gaussian component in latent space tends to have close means. GMVAE fails to capture the multi-modes structure of the data and cannot effectively utilize the discrete latent variable. For example, as illustrated in Fig. (a)a, utterances requesting the weather and requesting an appointment will be mapped into the same mode due to the mode-collapse. The mode-collapse problem has also been observed in the image modeling task (Dilokthanakul et al., 2016) using GMVAE. However, the problem is more severe in the scenario of language generation.

Figure 3: Latent space learned by GMVAE and our proposed DGMVAE. Points correspond to dialog utterances. Noice that our proposed DGMVAE avoids mode-collapse by GMVAE.
(a) VAE (b) DI-VAE (c) semi-VAE (d) GMVAE
Figure 8: Graphical models for various VAEs. and are continuous and discrete latent variables, respectively. is observed data. The solid lines are conditional dependencies and dashed lines represent variational posteriors.

In this paper, we propose the Dispersed-GMVAE (DGMVAE), which fixes the mode-collapse problem in GMVAE. First, we theoretically analyze that the mode-collapse problem is intrinsically caused by the evidence lower bound () of GMVAE. Maximizing the of GMVAE implicitly aggregates the mean of the Gaussian mixture priors. We introduce two extra terms in our proposed DGMVAE to alleviate mode-collapse and to obtain a better structured latent space (Fig. (b)b). Experimental results show that DGMVAE can avoid the mode-collapse problem effectively. Furthermore, DGMVAE achieves significantly better results in language modeling on Penn Treebanks (Marcus et al., 1993, PTB) and in interpretable text generation over two dialog benchmarks.

Our contributions include: a) we proposed to use GMVAE for interpretable text generation; b) we theoretically analyze the mode-collapse problem in GMVAE and address it by proposing DGMVAE; c) we empirically studied the performance of DGMVAE and show it obtains good results on various generation tasks.

2 Related Work

VAEs for Language Generation. Variational auto-encoders are proposed by Kingma and Welling (2013, VAEs) and Rezende et al. (2014), and applied by Bowman et al. (2016) for natural language generation. VAEs are extended by many following works in various specific language generation tasks, such as dialog generation (Serban et al., 2017; Wen et al., 2017; Zhao et al., 2017b, 2018b), summarization (Li et al., 2017a) and other natural language generation tasks (Miao et al., 2016; Zhang et al., 2016; Semeniuta et al., 2017; Gupta et al., 2018; Xu and Durrett, 2018).

Additionally, Wen et al. (2017) and Zhao et al. (2018b) propose to replace the continuous latent variable with a discrete one for interpretable sentence generation. Kingma et al. (2014) propose the semi

-VAE for semi-supervised learning. This model is then adopted by

Hu et al. (2017); Zhou and Neubig (2017) for style-transfer and labeled sequence transduction, respectively. Different from GMVAE, continuous and discrete latent variables in semi-VAE are independent.

Gaussian Mixture VAEs.

Using Gaussian mixture models as priors in VAEs is not new. Gaussian mixture variational auto-encoder has been used in the unsupervised clustering 

(Dilokthanakul et al., 2016; Jiang et al., 2017), obtaining promising results. Wang et al. (2019) used GMM as priors for topic-guided text generation. GMVAE used in this paper is similar to  Jiang et al. (2017). However, we apply GMVAE for interpretable text generation and propose the DGMVAE to address the mode-collapse problem according to our theoretical analysis.

KL Collapse vs. Mode Collapse. The vanilla VAE models usually suffer from the KL collapse problem in language generation, in which the KL regularization term will quickly collapse to 0. A line of following work (Bowman et al., 2016; Zhao et al., 2017b, 2018b; Higgins et al., 2017) is proposed to avoid the KL collapse problem. More specifically, mode collapse is related to mixture models, in which multiple modes vanish and collapse into a single mode. Mode collapse is also caused by the KL term, but the essential cause is different. In this paper, we focus on addressing the mode-collapse problem.

3 Proposed Approach

Probabilistic graphical models of VAE and its variants are shown in Fig. 8. Vanilla VAE (Kingma and Welling, 2013) only includes a continuous latent variable; discrete VAE models such as DI-VAE (Zhao et al., 2018b) adopt a discrete latent variable for interpretability; semi-VAE (Kingma et al., 2014) employ independent discrete and continuous latent variables; Gaussian mixture VAE (GMVAE) (Dilokthanakul et al., 2016; Jiang et al., 2017) use dependent discrete and continuous latent variables for better interpretable generation performance.

In the following, we will first describe the vanilla GMVAE in Sec. 3.1. We will give a theoretical analysis of the mode-collapse issue in GMVAE. Based on the theoretical insights, we propose DGMVAE to fix the issue.

3.1 Gaussian Mixture VAE

GMVAE is a probabilistic generative model that adopts the Gaussian mixture models (Bishop, 2006) as its prior. GMVAE employs a discrete latent variable and a continuous latent variable , with dependent on . In this model, the marginal likelihood of a sentence is:


in which is the parameters of generation model which generates from . is the Gaussian mixture prior distribution and can be computed by . Intuitively, represents the components of mixture Gaussian and

could be assumed as an uniform distribution; while

is a multivariate Gaussian distribution of the corresponding component.

Testing. During testing, a mixture Gaussian component is first chosen according to the prior distribution . Then the continuous variable is sampled from the chosen Gaussian prior . As in Bowman et al. (2016), a generation network takes as input and generate the sentence through a decoder .

Training. Optimizing and inference for Eq. 1 is difficult. Following previous work of  Kingma and Welling (2013) and Rezende et al. (2014), we use a variational posterior distribution with parameters to approximate the real posterior distribution . With the mean field approximation (Xing et al., 2003), can be factorized as:

The posterior is assumed as a multivariate Gaussian distribution, whose mean

and variance

are obtained through a neural network (

recognition network). is calculated according to:


in which

is the probability of generating the mapped vector of

in latent space by th Gaussian component, and could be taken as prior . In practice, is taken as the deterministic mapping of in the latent space, which does not damage the conditional independence in mean field approximation.

Instead of optimizing the marginal likelihood in Eq. 1, we maximize a evidence lower bound (). The can be decomposed as the summation of a reconstruction term and regularization terms for and , respectively:


All parameters including , could be jointly trained with reparameterization tricks (Kingma and Welling, 2013) (for sampling ) and gumbel-softmax (Jang et al., 2016) (for sampling ).

However, GMVAE often encounters mode-collapse, where all components of GMM are concentrated into one Gaussian distribution and GMVAE degenerates into uni-Gaussian VAE.

3.2 Mode-Collapse Problem

We further investigate the objective function to analyze mode-collapse. To this end, we present two theorems, which indicate that the regularization terms of GMVAE’s , and , are responsible for the mode collapse problem. We only give explanations and remarks for each theorem, with the details included in the supplementary materials.

We abbreviate , the posterior mean of given , as , and , the set of means of Gaussian components, as . The trace of variance matrix of mean in is denoted as .

Theorem 1.

Maximizing the pushes a close upper bound of , , to decrease. Here is the squared sum of distance between and .

By performing some algebraic operations, we find that the inner product of and is always non-positive, which means the directions of their gradients are opposite. So, performing gradient ascent on will make smaller. As a result, is limited by the decreasing bound .

We abbreviate as the trace of variance matrix of mean under the distribution of ,

, assuming the standard deviation of all Gaussian components equal, i.e.,


Theorem 2.

contains a negative regularization term of .

could be re-written as

where is a multivariate Gaussian distribution, whose mean is the expectation of the mean of priors under distribution .

Hence, maximizing implicitly minimize , which may also lead to mode collapse in GMVAE.

3.3 Dispersed-GMVAE

In this section, we propose the Dispersed-GMVAE (DGMVAE), which is a simple yet effective way to avoid the mode-collapse problem.

According to the theoretical insights in Sec. 3.2, we propose to include two extra terms in our objective to balance the mode collapse from . We propose a new objective for sampled from the dataset :



is the posterior marginal distribution estimated by

. Empirically, it is obtained by averaging within the mini-batch.

DGMVAE add an extra variance regularization term and a mutual information term in its objective. Intuitively, they serve different roles. with a hyper-parameter is proposed to regularize the concentration trends of Gaussian mixture components. We can tune to make a trade-off between variance and concentration degree.

We also include a mutual information term in Eq. 4: . As introduced by previous works (Chen et al., 2016; Zhao et al., 2017a, 2018b), could enhance the interpretability and alleviate the KL-collapse. Our ablation study in the experiments shows that and are both necessary to obtain good empirical performances of DGMVAE.

The final objective of DGMVAE could be:

in which . More details on how to obtain the final objective function from Eq. 3 and Eq. 4 could be found in supplementary materials.

Except for the learning objective, DGMVAE has similar architecture as GMVAE. It consists of an RNN encoder for learning posterior and an RNN decoder for generation.


. Recurrent neural networks such as GRU 

(Chung et al., 2014) as recognition networks encode sentences into compact hidden states. The mean and variance of the posterior distribution  (assumed as a multivariate diagonal Gaussian) are obtained from the last hidden states through two affine transformations.

Decoder. In the decoding phase, we first sample a from the GMM priors (in testing) or from posterior (in training) by the reparameterization trick (Kingma and Welling, 2013). The sentences will be generated on a recurrent neural language model fashion (generation networks), with the as the initialized hidden state. We use multiple independent discrete latent variables following Zhao et al. (2018b).

Interpretable Dialog Generation. We follow the same approach of DI-VAE (Zhao et al., 2018b) for interpretable dialog generation. The approach could be extended to other scenarios of interpretable generations, but we only validate our DGMVAE on dialog for comparing with Zhao et al. (2018b).

Specifically, in dialog generation, we generate response given the dialog context . A DGMVAE model is pre-trained in all utterances of the training set to capture the interpretable facts (discrete latent variable ) such as dialog actions or intentions. In training, a hierarchical recurrent encoder-decoder model (HRED) with attention (Sordoni et al., 2015; Serban et al., 2016) is trained to generate the response. Here is obtained from the pre-trained recognition network of DGMVAE and then fed into the decoder. A policy network is trained jointly to predict sampled from in order to predict in the testing stage.

4 Experiments

In this section, we empirically test the generation quality and interpretable ability of our proposed model on standard benchmarks, compared with a line of baselines.

4.1 Setup

We conduct experiments following Zhao et al. (2018b). For generation quality, we use the Penn Treebanks (Marcus et al., 1993, PTB) pre-processed by Mikolov (Mikolov et al., 2010) as the benchmark. For interpretability, we use the Daily Dialogs (Li et al., 2017b, DD) and the Stanford Multi-Domain Dialog (Eric et al., 2017, SMD) datasets. DD is a chat-oriented dataset containing 13,118 multi-turn dialogs, annotated with dialog actions and emotions. SMD contains 3,031 human-Woz, task-oriented dialogs collected from 3 different domains (navigation, weather and scheduling).

We compare our model with the following baselines: 1) RNNLM, language model (Mikolov et al., 2010) implemented by LSTM (Hochreiter and Schmidhuber, 1997)111; 2) AE, auto-encoders (Vincent et al., 2010) without latent space regularization; 3) DAE, auto-encoders with discrete latent space; 4) VAE, the vanilla VAE (Kingma and Welling, 2013)

with only continuous latent variable and normal distribution prior; 5) DVAE, VAE with discrete latent variables; 6) DI-VAE, a DVAE variant 

(Zhao et al., 2018b) with an extra mutual information term; 7) semi-VAE, semi-supervised VAE model proposed by Kingma et al. (2014) with independent discrete and continuous latent variables; 8) GMVAE, vanilla GMVAE models as introduced in 3.1. Results of these baselines are obtained by our implementation except DI-VAE222We get results of DI-VAE with their released codes. Our reported results differs as in their paper because we find they perform tokenization twice in their codes.. Gumbel-softmax (Jang et al., 2016) is used for reparameterization in VAE variants with discrete latent variable.

The encoder and decoder in all models are implemented with single-layer GRU (Chung et al., 2014), with the hidden size as 512. The dimension of discrete latent variables is set to 10 for PTB and 5 for DD and SMD, while the number of discrete latent variables is set to 20, 3 and 3. The dimension of continuous latent space is 100 for PTB, 15 for DD and 48 for SMD. is set to 0.9 for DGMVAE. KL annealing with logistic weight function is adopted for all VAE variants. All hyper-parameters including are chosen according to the objective (language generation task) or BLEU scores (dialog generation task) in the validation set. Details of hyper-parameters are included in the supplementary.

4.2 Effects of DGMVAE on Mode-Collapse

(a) GMVAE #2000 (b) GMVAE #10000 (c) GMVAE #2000 (d) GMVAE #10000 (e) GMVAE #2000 (f) GMVAE #10000 (g) DGMVAE #2000 (h) DGMVAE #10000
Figure 17: Visualization of the mode collapse problem in DD dataset for GMVAE, GMVAE adding term or adding term, and DGMVAE. GMM priors are represented by grey points (mean) and circles (variance). The mean of posteriors are marked as colored points (colors are associated with discrete latent variables).

We illustrate the effectiveness of DGMVAE to alleviate the mode-collapse problem. Fig. 17 gives a visualization. We train GMVAE and DGMVAE in utterances on the DD dataset, and randomly sample 300 points from test data at 2,000 and 10,000 training steps, respectively. The dimension of latent space is set to 2 for visualization. As in Fig. 17, the mean and variance of GMM priors are indicated by grey points and circles, respectively. The means of posteriors are marked as colored points (points with different discrete latent variables are associated with different colors).

It can be seen that, after 10,000 training steps, the vanilla GMVAE degenerates into uni-Gaussian VAE, with the same mean values of all Gaussian components (Fig. (a)a and (b)b). DGMVAE gives quite promising results as shown in Fig. (g)g and (h)h, in which different components of the GMM are dispersed and cluster data points into multiple modes well. In order to verify the effects of the additional two terms of DGMVAE, we incorporate and incrementally. GMVAE indeed helps alleviate the mode-collapse, however, the posterior points are quite concentrated to the priors. This indicates the latent space in such case is not smooth enough. GMVAE can also avoid the mode-collapse problem, but it does not cluster points with the same discrete labels together well.

4.3 Language Generation Performance

We evaluate the performance of language generation on PTB in Tab. 1, comparing DGMVAE with baselines described in Sec. 4.1. The test set of PTB is also included for comparison of text fluency.

We include four metrics to evaluate the generation performances: reverse perplexity (rPPL), BLEU (Papineni et al., 2002), word level KL divergence (wKL) and perplexity (PPL). Reverse perplexity is the perplexity of a LSTM language model (Merity et al., 2017) trained on the synthetic data sampled from priors of VAE variants, and evaluated on the test set (Zhao et al., 2018a). Lower rPPL indicates that the sampled sentences are more diverse and fluent. The BLEU score between input and reconstructed sentences reflects the ability of reconstruction. Word level KL divergence between word frequencies calculated in generated data and training data shows the quality of generation. Perplexity of samples generated from posterior in test set measures the fluency of reconstruction. These metrics are evaluated on the test set of PTB, except rPPL and wKL, which are calculated on sentences generated by sampling from these models’ prior distribution333Sample size here is 40,000, the same as PTB training set. (sampling a random vector for AE).

Besides, the values of the regularization terms are also included in order to give some indications of the mode-collapse and KL collapse. We list the KL divergence of continuous latent variables (KL(z)444For models with VM term, it is the KL divergence between posterior and the expected prior .) and discrete latent variables (KL(c)555For models with MI term, it is the KL divergence between marginal posterior and prior .), which are averaged by the number of discrete variables. The variance of GMM mean (VM666VM is calculated on the posterior and taking average over all samples.) and mutual information (MI) terms are shown as well.

Evaluation Results Regularization Terms
Test Set - 100.0 0.14 - - - - -
RNNLM (Mikolov et al., 2010) - - - 117.60 - - - -
AE (Vincent et al., 2010) 730.81 10.88 0.58 31.90 - - - -
VAE  (Kingma and Welling, 2013) 922.71 3.73 0.76 91.95 6.62 - - -
DAE 797.17 3.93 0.58 88.55 - - - -
DVAE 453.53 3.61 0.58 100.56 - 1.74 - 1.22
DI-VAE (Zhao et al., 2018b) 425.11 4.19 0.69 93.72 - 0.13 - 1.26
semi-VAE (Kingma et al., 2014) 779.53 3.59 0.79 93.78 6.97 0.02 - 0.019
semi-VAE 721.34 4.87 0.73 92.95 0.49 0.14 - 1.34
GMVAE 923.66 4.17 0.80 90.26 7.13 0.02 0.38 0.016
DGMVAE 331.80 6.34 0.45 61.77 13.03 0.10 9.93 1.30
DGMVAE 560.56 5.64 0.62 71.12 3.87 0.31 24.84 0.28
DGMVAE 244.30 8.45 0.35 49.60 6.41 0.10 21.42 1.19
Table 1: Language generation results on PTB. for DGMVAE. The larger (or lower), the better.

We first present the ablation study to show whether the two extra terms (Sec. 3.3) both contribute to the performance gains. As shown in Tab. 1, VM term helps to alleviate the mode-collapse, according to the higher variance of mean. MI term helps to increase the information encoded by discrete latent variables, according to higher mutual information. As shown in the last three rows of Tab. 1, both the two terms improve the performance. The DGMVAE with both terms achieves the best results.

With the presence of both continuous and discrete latent variables, DGMVAE enjoys its higher model capacity and gives the best reconstruction performance (BLEU, PPL), superior to other VAE variants. Although semi-VAE also includes discrete and continuous latent variables, it fails to make use of both of them because of the independent hypothesis. As shown in Tab. 1, either discrete or continuous latent variable collapses in semi-VAEs. AE could reproduce input sentences well, but it fails to generate diverse sentences.

Besides the reconstruction, we also find that DGMVAE significantly outperforms related work in generating high-quality sentences. rPPL is a powerful metric for measuring the fluency and diversity; DGMVAE obtains the lowest rPPL, which is significantly better than all other models. The lowest wKL also shows that word distribution in DGMVAE generations is most consistent with the training set.

4.4 Interpretable Generation Results

Following Zhao et al. (2018b), we include the experiments of interpretable language generation on DD and dialog generation on SMD, respectively.

Because utterances in DD are annotated with Action and Emotion labels, we evaluate the ability of DGMVAE to capture these latent attributes on DD. We take the index

with the largest posterior probability

as latent action labels. Following Zhao et al. (2018b), we use homogeneity as the metric to evaluate the consistency between golden action and emotion labels with labels obtained from DGMVAE. The number of our labels is 125. Results of homogeneity of action (act) and emotion (em) together with MI term and BLEU are shown in Tab. 3. It shows that DGMVAE outperforms other VAEs in reconstruction and gives the best homogeneity on both the action and emotion.

DD Model MI BLEU act em DI-VAE 1.20 3.05 0.18 0.09 semi-VAE 0.03 4.06 0.02 0.08 semi-VAE 1.21 3.69 0.21 0.14 GMVAE 0.00 2.03 0.08 0.02 DGMVAE 1.41 2.96 0.19 0.09 DGMVAE 0.53 7.63 0.11 0.09 DGMVAE 1.32 7.39 0.23 0.16
Table 2: Results of interpretable language generation on DD. Mutual information (MI), BLEU and homogeneity with actions (act) and emotions (em) are shown. The larger, the better.
Automatic Metrics Model BLEU Ave. Ext. Grd. DI-VAE 7.06 76.17 43.98 60.92 DGMVAE 10.16 78.93 48.14 64.87 Human Evaluation Model Quality Consistency DI-VAE 2.31 3.08 DGMVAE 2.45 3.35
Table 3: Dialog evaluation results on SMD. Four automatic metrics: BLEU, average (Ave.), extrema (Ext.) and greedy (Grd.) word embedding based similarity are shown. Response quality and consistency within the same are scored by human.

We also evaluate the ability of interpretable dialog generation of DGMVAE on SMD. Both automatic evaluation and human evaluation are conducted. BLEU and three word embedding777We use GloVe (Pennington et al., 2014) word embeddings of 300 dimension trained on 840B tokens from based topic similarity (Serban et al., 2017): Embedding Average, Embedding Extrema and Embedding Greedy (Mitchell and Lapata, 2008; Forgues et al., 2014; Rus and Lintean, 2012) are used to evaluate the quality of responses. In addition, three human evaluators were asked to score the quality (from 0 to 3) of 159 responses generated by DI-VAE and DGMVAE. Because SMD does not offer human annotated action labels of dialog utterances, we follow Zhao et al. (2018b) to label dialog actions by human experts for each discrete latent variable , according to their sampled utterances. Another 3 annotators are asked to evaluate the consistency between the action name and another 5 sampled utterances, which showing the interpretability.

Results are shown in Tab. 3

. Both automatic and human evaluations show that DGMVAE obtains better generation quality and interpretability than DI-VAE on SMD. We perform one-tail t-tests on human evaluation scores and find that the superiority of our model is significant in both quality and consistency with p-values no more than 0.05.

Act Inform-route/address Utt There is a Safeway 4 miles away. There are no hospitals within 2 miles. There is Jing Jing and PF Changs. Act Request-weather Utt What is the weather today? What is the weather like in the city? What’s the weather forecast in New York?
Table 4: Example actions (Act) and corresponding utterances (Utt) discovered by DGMVAE on SMD. The action name is annotated by experts.
Context Sys: Taking you to Chevron. Predict (1-1-3, thanks) Thank you car, let’s go there! (1-0-2, request-address) What is the address? Context User: Make an appointment for the doctor. Predict (3-2-4, set-reminder) Setting a reminder for your doctor’s appointment on the 12th at 3pm. (3-0-4, request-time) What time would you like to be schedule your doctor’s appointment?
Table 5: Dialog cases on SMD, which are generated by sampling different from policy network. The label of sampled are listed in parentheses with the annotated action name.

We perform case studies to validate the performance of DGMVAE qualitatively. Some dialog actions with their utterances discovered by DGMVAE are shown in Tab. 5. It can be seen that utterances of the same actions could be assigned with the same discrete latent variable . We also give some dialog cases generated by DGMVAE in Tab. 5 with their contexts. Given the same context, responses with different actions are generated by sampling different values of discrete latent variables, which shows that DGMVAE has the ability to generate diverse and interpretable responses. More cases can be found in the supplementary materials.

5 Conclusion

The mode-collapse problem always occurs in GMVAE practically. In this paper, we give a theoretical analysis of this problem. Given the theoretical insights, we propose the DGMVAE, which can effectively alleviate the mode collapse problem. Additionally, experimental results show that DGMVAE outperforms a line of related works, obtaining higher language generation performance and better interpretable results.


  • Bishop (2006) Christopher M Bishop. 2006. Pattern Recognition and Machine Learning.
  • Bowman et al. (2016) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21.
  • Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: interpretable representation learning by information maximizing generative adversarial nets. Neural Information Processing Systems, pages 2180–2188.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In

    NIPS 2014 Workshop on Deep Learning, December 2014

  • Dilokthanakul et al. (2016) Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. 2016. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648.
  • Eric et al. (2017) Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D Manning. 2017. Key-value retrieval networks for task-oriented dialogue. Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 37–49.
  • Forgues et al. (2014) Gabriel Forgues, Joelle Pineau, Jean-Marie Larchevêque, and Réal Tremblay. 2014. Bootstrapping dialog systems with word embeddings. In

    NIPS, Modern Machine Learning and Natural Language Processing Workshop

    , volume 2.
  • Gupta et al. (2018) Ankush Gupta, Arvind Agarwal, Prawaan Singh, and Piyush Rai. 2018. A deep generative framework for paraphrase generation.

    National Conference on Artificial Intelligence

  • Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher P Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. In International Conference on Machine Learning, pages 1587–1596.
  • Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
  • Jiang et al. (2017) Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. 2017. Variational deep embedding: An unsupervised and generative approach to clustering. International Joint Conference on Artificial Intelligence, pages 1965–1972.
  • Kingma et al. (2014) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Li et al. (2017a) Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. 2017a.

    Deep recurrent generative decoder for abstractive text summarization.

    Empirical Methods in Natural Language Processing, pages 2091–2100.
  • Li et al. (2017b) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017b. Dailydialog: A manually labelled multi-turn dialogue dataset. International Joint Conference on Natural Language Processing, 1:986–995.
  • Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 19(2):313–330.
  • Merity et al. (2017) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182.
  • Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. International Conference on Machine Learning, pages 1727–1736.
  • Mikolov et al. (2010) Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. Interspeech, pages 1045–1048.
  • Mitchell and Lapata (2008) Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. proceedings of ACL-08: HLT, pages 236–244.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Weijing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. pages 311–318.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
  • Rus and Lintean (2012) Vasile Rus and Mihai Lintean. 2012. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 157–162. Association for Computational Linguistics.
  • Semeniuta et al. (2017) Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. 2017.

    A hybrid convolutional variational autoencoder for text generation.

    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 627–637.
  • Serban et al. (2016) Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence.
  • Serban et al. (2017) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence.
  • Sordoni et al. (2015) Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 553–562. ACM.
  • Vincent et al. (2010) Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre Antoine Manzagol. 2010.

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

    Journal of Machine Learning Research, 11(12):3371–3408.
  • Wang et al. (2019) Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, and Lawrence Carin. 2019. Topic-guided variational autoencoders for text generation. arXiv preprint arXiv:1903.07137.
  • Wen et al. (2017) Tsunghsien Wen, Yishu Miao, Phil Blunsom, and Steve J Young. 2017. Latent intention dialogue models. International Conference on Machine Learning, pages 3732–3741.
  • Xing et al. (2003) Eric P Xing, Michael I Jordan, and Stuart Russell. 2003. A generalized mean field algorithm for variational inference in exponential families. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence, pages 583–591. Morgan Kaufmann Publishers Inc.
  • Xu and Durrett (2018) Jiacheng Xu and Greg Durrett. 2018. Spherical latent spaces for stable variational autoencoders. Empirical Methods in Natural Language Processing.
  • Zhang et al. (2016) Biao Zhang, Deyi Xiong, Hong Duan, Min Zhang, et al. 2016.

    Variational neural machine translation.

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 521–530.
  • Zhao et al. (2018a) Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexander M Rush, and Yann Lecun. 2018a. Adversarially regularized autoencoders. International Conference on Machine Learning, pages 5897–5906.
  • Zhao et al. (2017a) Shengjia Zhao, Jiaming Song, and Stefano Ermon. 2017a. Infovae: Information maximizing variational autoencoders. arXiv: Learning.
  • Zhao et al. (2018b) Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. 2018b. Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In ACL.
  • Zhao et al. (2017b) Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017b. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 654–664.
  • Zhou and Neubig (2017) Chunting Zhou and Graham Neubig. 2017. Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction. Meeting of the Association for Computational Linguistics, 1:310–320.

Appendix A Appendix

a.1 Derivation of Mutual Information Term and Final Objective

In this part, we first split the mutual information term from of GMVAE. We focus on the over corpus,


Out of the mean-field approximation, expectation on could be removed by integral.


in which, and is a mutual information between discrete latent variable and input data .

Replacing by Eq. 6 and replacing by two terms we derived in Theorem 2, the of dataset is written as


Adding regularization terms and to , the final objective is


which consists of a reconstruction term, a KL divergence over , a KL divergence over and a variance of means term.

a.2 Proof of Theorem 1

We first simplify :


where is the number of Gaussian components and


It is because




Now we derive the partial derivative of with respect to :




Adding these components, we get:


At the same time,


Finally, we multiply and to get , because for :


Define and ,






As , . So when performing gradient ascent on , the gradient of will make smaller. As a result, () is limited by a decreasing bound.

a.3 Proof of Theorem 2

We prove this property in the case of univariate Gaussian distribution priors. It is straightforward to generalize it to diagonal multivariate Gaussian distributions by summing over all dimensions.

The term can be rewritten as


The is defined as Gaussian distribution with mean and standard deviation ,


We define , which can be written as


According to Eq. 28,

can be split into two multiplied terms. The first term is the probability density function of a Gaussian distribution, denoted as



whose mean is the expectation of under posterior distribution and standard deviation is . The second term is actually the variance of means of Gaussian components under the distribution of ,


According to Eq.  28-30, Eq. 26 consists of two terms:


The expectation on of the second term is removed by integral since is independent on . In Eq. 31, there contains a negative variance term of the prior means.

a.4 Hyper-Parameters

Adam optimizer is adopted with learning rate of 0.001. Batch size is set to 30. The size of word embedding is set as 300. Sentence longer than 40 will be cut off. Vocabulary size is set to 10,000. All results were obtained by repeating the experiment three times and taking an average.

We also illustrate how will affect the model performance in Fig. 18; and we generally find that larger can get better results. As a result, we use for experiment in PTB and DD.

Figure 18: VM and PPL of DGMVAE with different , on the validation set of PTB. Results of DGMVAE and DGMVAE are displayed by the blue lines and orange lines, respectively.

a.5 More Cases

More examples of actions discovered by DGMVAE is shown in Tab. 6. More examples on responses generated by DGMVAE are shown in Tab. 7. In Tab. 7, an example without context is given to show the ability to begin a dialog in different topics (weather, navigation and scheduling). For DGMVAE, we can sample different continuous latent variables from one component. As shown in Tab. 8, diverse responses with the same actions could be generated.

Action Name Request-location
Utterances Which location do you want the weather for?
Which location should I look up information about?
Which city are you asking about?
Action Name Inform-time/appointment
Utterances Your next dinner event is with your father on Friday.
Your father will be attending your yoga activity on the 2nd with you.
Your doctor’s appointment is Monday at 1 pm.
Action Name General-thanks
Utterances Thanks.
Thanks a lot.
Perfect. Thanks.
Table 6: Example actions discovered by DGMVAE on SMD. The action name is annotated by experts.
Context User: What’s the temperature going to be this week?
Sys: What city are you wanting to know the temperature for?
Predict (1-1-0, inform-address) Cleveland.
(4-0-2, request-weather) Will it rain in Redwood City today?
Context None
Predict (1-4-0, request-route/address) find me a nearby coffee shop
(1-0-4, request-weather) what’s the weather going to be like today and tomorrow
(1-2-4, command-reminder) remind me about meeting later
Table 7: Dialog cases on SMD, which are generated by sampling different from policy network. The label of sampled are listed in parentheses with the annotated action name. When context is None, it means to predict the beginning a dialog.
Context User: What is the highest temperature in Brentwood over the next two days?
Action Name (2-3-0) inform-weather
Predict It is currently foggy in Brentwood on Tuesday.
It will be between 70 - 40F and turn - 40F on Saturday.
Context User: I need gas.
Action Name (2-4-2) inform-route/address
Predict There is a Chevron 3 miles from you.
There is a Safeway.
Context User: schedule meeting
Action Name (3-0-4) request-time
Predict What day and time should I set your meeting for?
What time should I set the alarm?
Table 8: Dialog cases on SMD, which are generated by sampling different from the same actions (i.e., the ).