Conditional Response Generation Using Variational Alignment

11/10/2019 ∙ by Kashif Khan, et al. ∙ University of Waterloo 0

Generating relevant/conditioned responses in dialog is challenging, and requires not only proper modelling of context in the conversation, but also the ability to generate fluent sentences during inference. In this paper, we propose a two-step framework based on generative adversarial nets for generating conditioned responses. Our model first learns meaningful representations of sentences, and then uses a generator to match the query with the response distribution. Latent codes from the latter are then used to generate responses. Both quantitative and qualitative evaluations show that our model generates more fluent, relevant and diverse responses than the existing state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dialog generation refers to the task of generating responses for a given utterance. It is a challenging problem in generation because it not only requires us to model the context in the conversation, but also to exploit it to generate a relevant and fluent response. Therefore, dialog generation can be divided into two parts: 1) encoding the context of conversation, and 2) generating a response conditioned on the given context. A generated response is considered to be “good” if it is meaningful, fluent, and most importantly, related/relevant to the given context.

The encoder-decoder based sequence-to-sequence models (Seq2Seq) Sutskever et al. (2014) coupled with effective attention mechanisms Bahdanau et al. (2014); Luong et al. (2015) have served as the de facto

frameworks to tackle such text generation tasks. Inspired by the success of deep generative models, such as, variational autoencoders (VAEs)

Kingma and Welling (2013), and generative adversarial nets (GANs) Goodfellow et al. (2014) in vision, there have been multiple attempts to adapt them for the text domain Bowman et al. (2015); Yu et al. (2017)

and leverage their outstanding generation quality. However, autoregressive nature of the decoder, and innate discreteness of the text domain, respectively, give rise to numerous optimization challenges. For instance, bowman2015generating discuss KL-cost annealing and history-less decoding for stable VAE training; yu2017seqgan introduce a reinforcement-learning based algorithm to address non-differentiability induced in the network due to sampling from the generator, but it results in an unstable training procedure.

A common choice of loss function in the existing methods is the cross entropy loss, which learns a widespread distribution using maximum a posteriori. It, thus, results in the generated responses being generic, and is analogous to the mode averaging problem for continuous variables. To mitigate this problem of generic response generation, we use adversarial loss which exhibits the property of mode-collapse rather than mode-averaging. Furthermore, we use the mean squared error (MSE) as an auxiliary loss that helps the network select a few meaningful modes, and results in a more related response. However, training an adversarial network on words is difficult, hence, we train the network on their latent codes. More precisely, we employ a two-step training procedure, where: 1) we train a variational autoencoder to learn meaningful representations of input sentences, and 2) use a generator to

transform a query’s (more generally, the context’s) latent code into that of response’s which, in turn, is fed to the pretrained language model’s decoder for actual response generation. Such splitting of task into two easier steps, and enforcing adversarial training only on the latent codes reap multiple optimization benefits, in addition to achieving the primary task of better conditioned response generation. As compared to gu2018dialogwae, the current state-of-the-art for dialog generation, we achieve better results in terms of BLEU scores, diversity, and fluency.

We evaluate our model on a deduplicated version Bahuleyan et al. (2017)

of the benchmark DailyDialog dataset. Our model outperforms the prior methods on all automatic evaluation metrics. Results indicate that responses generated by our model are more relevant to the input query (context, in general) and are simultaneously more diverse and fluent.

The rest of the paper is organized as follows: Section 2 touches upon the relevant literature; Section 3 describes the proposed methodology in detail; Section 4 presents experiment results and analysis; and, finally, we conclude in Section 5.

(a) Step 1: Variational Autoencoder network
(b) Step 2: Adversarial network (dashed box)
Figure 1: Two-step training procedure. (a) We first train an autoencoder which takes as input, an utterance , gets its latent code, , from the encoder, and then feeds it through the decoder to get the reconstructed utterance . (b) We use the pretrained encoder from step 1 to get the latent variable, and , of the query () and response () utterances, respectively. After that, query latent variable () is fed to the generator (G) which maps it to the corresponding response latent variable . When training the generator, we aim to match and through the generator loss combined with a mean-squared error loss. When training the discriminator, we pass and (obtained by concatenating with and with , respectively) through a classification layer that tries to guess its source of input. Note: denotes concatenation.

2 Related Work

RNN based encoder decoder approaches have been very popular for dialog models. To encourage diversity and moving away from safe responses several variants have been proposed. Some variants focus on making use of additional information like topic, Xing et al. (2016). Some models use a more complex architecture like HRED Serban et al. (2015) which uses a hierarchical encoder decoder architecture to encourage more complex and diverse responses.

Variational Autoencoder (VAE) Kingma and Welling (2013) based models are used in dialog systems to encourage complex and diverse responses by capturing the variability. The conditional variational autoencoder (CVAE) Zhao et al. (2017) encourages diverse responses by using the variational inference framework to learn the posterior conditioned on the context latent variable. But the ”posterior collapse” issue remains a challenge for VAE based models. Several variants have been proposed that tackle this and allow for more complex representation, like VHRED Serban et al. (2016). It makes use of the hierarchical encoder decoder framework and augments it using a latent variable at the decoder which is trained by maximizing a variational lower bound in the log likelihood. Park et. al Park et al. (2018) propose a model called Variational Hierarchical Conversation RNNs (VHCR) which imposes hierarchical structure on the learned latent variables. A variant of the CVAE called CVAE-CO was proposed by Shen et al. (2018) that makes use of autoencoder training which collaborates with CVAE to generate better responses.

Generative adversarial networks (GAN) proposed by goodfellow2014generative have been used with great success in generating images. Text generation with GANs has not been as successful due to the discrete nature of the text Xu et al. (2017). Li et. al. Li et al. (2017) proposed combining reinforcement learning with GANs where the outputs from the discriminator are used as rewards for the generator. However, Shen et. al. Shen et al. (2017)

argue that training with REINFORCE is unstable due high variance of the sampled gradients. Makhzani et. al.

Makhzani et al. (2015) proposed Adversarial Autoencoders which used adversarial regularization on the latent space to match the learned aggregated posterior to an arbitrary prior. DialogWAE proposed by Gu et. al. Gu et al. (2018b) combined a conditional Wasserstein autoencoder (WAE) with adversarial learning for dialog response generation.

3 Approach

Figure 1 provides an overview of our proposed two-step approach. We provide a background on the relevant components used in our approach: VAE and GAN. We then provide the problem formulation and a stepwise description of our approach.

3.1 Variational Autoencoder (VAE)

Variational autoencoder (VAE) was proposed by bowman2015generating to generate sentences from a continuous space. VAE imposes a probabilistic distribution on the latent vector, and it uses KL-divergence to match the posterior with a standard prior distribution. Then, the decoder reconstructs data based on the sampled latent vector from its posterior distribution.

The autoencoding loss, in this case, is given by:



is the prior distribution, usually set to the standard normal distribution

. is a tunable hyperparamter representing KL term’s weight, and represents posterior distribution of the form , where , and are learnt by the encoder.

3.2 Generative Adversarial Network (GAN)

Generative adversarial network was introduced by goodfellow2014generative. It has two components, a generator , and a discriminator

. The generator tries to produce fake samples in order to maximize the probability of discriminator making a mistake. The optimizing criterion corresponding to such a minimax game is shown:


In a traditional GAN, represents the data distribution to be learnt by the generator network , which uses as input, and learns an internal distribution to mimic as close as possible. As shown in Goodfellow et al. (2014), the above criterion can be rewritten as:


Where is the Jensen-Shannon divergence between and , the two distributions.

3.3 Problem Formulation

First, we define all the variables that we use. We denote the set of all the queries by and the set of all the responses by . We denote the set of all the sentences made up of queries and responses by . A sentence containing n tokens is denoted by .

The main task can be described as follows: given a query , generate the corresponding response .

Formally, generating a response for a given query can be viewed as learning the conditional distribution , where and

denote random variables corresponding to query and the response distribution respectively.

We employ a two-step procedure to learn the conditional distribution. The details of the two steps are as follows:

3.4 Step 1: Training an Autoencoder

First, we train an autoencoder to learn meaningful encoding of a given sentence irrespective of whether it is a query or a response.


Where is the latent representation of a sentence and is the weight of the KL term.

3.5 Step 2: Learning the Conditional Distribution

Once we have finished training the autoencoder, we proceed to our second step. In the second step, we train a conditional generative adversarial network to learn the conditional response distribution.

Apart from the objective function proposed for conditional GAN by DBLP:journals/corr/MirzaO14, we apply MSE (Mean Squared Error) loss on the generator to promote training stability and faster convergence. Therefore, the training objective for our second step is:






Here represents the learned distribution, and

is a tunable hyperparameter that moderates the effect of the MSE loss.

3.6 Multi-Turn Setting

We train and evaluate our proposed approach in two different settings: single-turn and multi-turn. In single-turn setting, we form a query-response by extracting every possible pairs of consecutive utterances belonging to the same dialog. However, in the multi-turn setting, for every response utterance we use all the available preceding utterances in the same conversation as part of the context. We encode this multi-utterance context using the pretrained encoder and use an RNN based generator to utilize the full multi-utterance context.

Model BLEU Diversity Fluency
P R F Intra-1 Intra-2 Inter-1 Inter-2 ASL PPL
Seq2Seq 0.143 0.217 0.172 0.99 0.99 0.46 0.49 4.63 18.45
WED-S 21.47 35.74 26.82 0.94 0.99 0.48 0.74 10.42 33.91
DialogWAE 0.296 0.356 0.323 0.85 0.97 0.42 0.74 19.34 20
VAE-AM (ours) 0.32 0.378 0.347 0.91 0.99 0.51 0.86 16.11 18.41
VAE-M (ours) 0.259 0.304 0.28 0.93 0.99 0.05 0.36 13.39 73.51
VAE-A (ours) 0.259 0.303 0.28 0.8 0.84 0.23 0.40 9.04 323.7
Table 1: DailyDialog Dataset: Single-turn results, suffix A- adversarial loss, suffix M - MSE loss
Model BLEU Diversity Fluency
P R F Intra-1 Intra-2 Inter-1 Inter-2 ASL PPL
HRED* 0.232 0.232 0.232 0.94 0.97 0.09 0.09 10.1 -
CVAE* 0.222 0.265 0.242 0.94 0.97 0.09 0.09 10.0 -
CVAE-CO* 0.244 0.259 0.251 0.94 0.97 0.09 0.09 11.2 -
VHCR* 0.266 0.289 0.277 0.85 0.97 0.42 0.74 16.9 -
DialogWAE 0.282 0.369 0.320 0.771 0.91 0.34 0.66 21.97 209.59
VAE-AM (ours) 0.324 0.389 0.353 0.93 0.95 0.48 0.94 15.4 122.09
Table 2: DailyDialog Dataset: Multi-turn results. * denotes results taken from Gu et al. (2018a) as is. Thus, the numbers correspond to evaluation on the non de-duplicated dataset and the PPL metric was also not reported

4 Experiments

We perform experiments on the DailyDialog dataset Li et al. (2017), a manually labelled multi turn dialog dataset. We use the original split after removing duplicates following bahuleyan2018stochastic.

4.1 Baselines

We use the following baseline models:

Seq2Seq: We use a standard Seq2Seq model using bidirectional LSTM with attention mechanism.

HRED: We use the HRED, a generalized Seq2Seq model that uses hierarchical RNN encoder Serban et al. (2015)

CVAE: A conditional VAE model with KL annealing Zhao et al. (2017).

CVAE-CO: A collaborative conditional VAE model. Shen et al. (2018)

WED-S: We use the stochastic Wasserstein Encoder Decoder with the default hyperparameters on the deduplicated dataset. Bahuleyan et al. (2018)

DialogWAE We also compare against the DialogWAE model proposed by gu2018dialogwae. This model also trains a GAN on the latent space, but uses a WAE instead of a VAE to encode sentences, and gumbel-softmax to generate diverse/multi-modal responses.

4.2 Parameter Settings and Training

We use the Bidirectional LSTMs Hochreiter and Schmidhuber (1997) for the encoder of the VAE and a unidirectional LSTM for the decoder. Both use a hidden size of 512. We use an embedding dimension of 300. The dimension of our latent vectors if 128.

We also adopt the techniques of KL annealing and word dropout from Bowman et al. (2015). We use a dropout probability of 0.5 and a sigmoid annealing schedule to anneal the KL weight to 0.15 for 4500 iterations.

For the GAN we adopt a standard feedforward network architecture with a hidden layer of 256 units, along with batch normalization

Ioffe and Szegedy (2015) and LeakyReLU activation Maas et al. (2013). We follow standard GAN tricks from Goodfellow (2016) and train our GAN till convergence.

4.3 Metrics

We measure the performance of the model based on three criteria: relevance of response, diversity of responses, and the fluency of responses.


We measure the relevance of the generated responses by measuring n-gram overlaps with the gold response. We use the BLEU scores proposed by

Papineni et al. (2002) and adopt the smoothing techniques proposed by Chen and Cherry (2014). We use BLEU-3 and Smoothing 7 for our metrics.

For each query we sample 10 responses and compute Precision (Average), Recall (Maximum) and F-Score (Harmonic mean of precision and recall) BLEU scores.

Diversity For each query we sample 10 responses and compute two types of diversity measures, Intra distinct-1 and Intra distinct-2, which measure the proportion of distinct unigrams and bigrams respectively in each response.

Inter distinct-1 and Inter distinct-2 measure the proportion of distinct unigrams and bigrams respectively across all 10 responses for a query.

We also report the average sentence length of the responses. For comparison, the average response length of the ground truth responses is 14.43.

Perplexity We measure the Perplexity (PPL) of our generated responses using a Kneser–Ney trigram language model Kneser and Ney (1995)

4.4 Results and Analysis

We present our results in Table 1.

Relevance: We notice that our model is the best performing model in terms of relevance. We achieve the best Precision and Recall for BLEU scores. The DialogWAE model also produces good BLEU scores, the Seq2Seq model is the worst performing model in terms of BLEU scores.

Diversity: The Intra diversity scores are pretty similar for most models. They indicate diversity of words within a sentence and a low score indicates poor training artifacts such as repetition of words. The Inter diversity scores are more important however. We notice that our model performs the best across both the Inter diversity metrics. We also measure the Average Sentence Length of our responses. As expected the responses generated by the Seq2Seq model are very short, the DialogWAE model generates longer responses on average however our model is closer in terms of length to the ground truth (14.43).

Fluency: The PPL scores measure the fluency and how similar our responses are to the responses found in the dataset. We notice that our model achieves the best PPL scores, although DialogWAE is pretty close. The good scores of Seq2Seq model are likely due to very short and generic responses.

We can see that our model performs well across all criteria. Our model also shows significant improvement in the diversity of responses for a given query (Inter-1) and (Inter-2). Our ablation results show that combining the adversarial loss and the MSE loss leads to significant improvement across almost all metrics especially overlap, response diversity and fluency. We also noticed that the MSE term leads to quicker and more stable convergence of the GAN (within 6 epochs), making training easier.

5 Conclusion

We propose an efficient two stage model for conditional text generation. We make use of the semantically aware sentence representations learned by a Variational Autoencoder and train a conditional Generative Adversarial Network on the VAE latent space to generate diverse responses conditioned on the query. Our model achieves excellent results and outperforms existing state-of-the-art VAE based approaches and generates more diverse, fluent and relevant sentences.


  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
  • H. Bahuleyan, L. Mou, O. Vechtomova, and P. Poupart (2017) Variational attention for sequence-to-sequence models. arXiv preprint arXiv:1712.08207. Cited by: §1.
  • H. Bahuleyan, L. Mou, H. Zhou, and O. Vechtomova (2018) Stochastic wasserstein autoencoder for probabilistic sentence generation. arXiv preprint arXiv:1806.08462. Cited by: §4.1.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §1, §4.2.
  • B. Chen and C. Cherry (2014) A systematic comparison of smoothing techniques for sentence-level BLEU. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 362–367. External Links: Link, Document Cited by: §4.3.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §3.2.
  • I. Goodfellow (2016) NIPS 2016 tutorial: generative adversarial networks. External Links: 1701.00160 Cited by: §4.2.
  • X. Gu, K. Cho, J. Ha, and S. Kim (2018a) DialogWAE: multimodal response generation with conditional wasserstein auto-encoder. arXiv preprint arXiv:1805.12352. Cited by: Table 2.
  • X. Gu, K. Cho, J. Ha, and S. Kim (2018b) DialogWAE: multimodal response generation with conditional wasserstein auto-encoder. CoRR abs/1805.12352. External Links: Link, 1805.12352 Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. External Links: 1502.03167 Cited by: §4.2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.
  • R. Kneser and H. Ney (1995) Improved backing-off for m-gram language modeling. In ICASSP, pp. 181–184. External Links: Document Cited by: §4.3.
  • J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky (2017) Adversarial learning for neural dialogue generation. CoRR abs/1701.06547. External Links: Link, 1701.06547 Cited by: §2.
  • Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu (2017) DailyDialog: a manually labelled multi-turn dialogue dataset. In

    Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    Taipei, Taiwan, pp. 986–995. External Links: Link Cited by: §4.
  • M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §1.
  • A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013)

    Rectifier nonlinearities improve neural network acoustic models


    in ICML Workshop on Deep Learning for Audio, Speech and Language Processing

    Cited by: §4.2.
  • A. Makhzani, J. Shlens, N. Jaitly, and I. J. Goodfellow (2015) Adversarial autoencoders. CoRR abs/1511.05644. External Links: Link, 1511.05644 Cited by: §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Stroudsburg, PA, USA, pp. 311–318. External Links: Link, Document Cited by: §4.3.
  • Y. Park, J. Cho, and G. Kim (2018) A hierarchical latent structure for variational conversation modeling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1792–1801. External Links: Link, Document Cited by: §2.
  • I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau (2015) Building end-to-end dialogue systems using generative hierarchical neural network models. External Links: 1507.04808 Cited by: §2, §4.1.
  • I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio (2016) A hierarchical latent variable encoder-decoder model for generating dialogues. External Links: 1605.06069 Cited by: §2.
  • T. Shen, T. Lei, R. Barzilay, and T. S. Jaakkola (2017) Style transfer from non-parallel text by cross-alignment. CoRR abs/1705.09655. External Links: Link, 1705.09655 Cited by: §2.
  • X. Shen, H. Su, S. Niu, and V. Demberg (2018) Improving variational encoder-decoders in dialogue generation. External Links: 1802.02032 Cited by: §2, §4.1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
  • C. Xing, W. Wu, Y. Wu, J. Liu, Y. Huang, M. Zhou, and W. Ma (2016) Topic aware neural response generation. External Links: 1606.08340 Cited by: §2.
  • Z. Xu, B. Liu, B. Wang, C. Sun, X. Wang, Z. Wang, and C. Qi (2017) Neural response generation via GAN with an approximate embedding layer. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 617–626. External Links: Link, Document Cited by: §2.
  • L. Yu, W. Zhang, J. Wang, and Y. Yu (2017) Seqgan: sequence generative adversarial nets with policy gradient. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §1.
  • T. Zhao, R. Zhao, and M. Eskenazi (2017) Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: Link, Document Cited by: §2, §4.1.