Dialog generation refers to the task of generating responses for a given utterance. It is a challenging problem in generation because it not only requires us to model the context in the conversation, but also to exploit it to generate a relevant and fluent response. Therefore, dialog generation can be divided into two parts: 1) encoding the context of conversation, and 2) generating a response conditioned on the given context. A generated response is considered to be “good” if it is meaningful, fluent, and most importantly, related/relevant to the given context.
The encoder-decoder based sequence-to-sequence models (Seq2Seq) Sutskever et al. (2014) coupled with effective attention mechanisms Bahdanau et al. (2014); Luong et al. (2015) have served as the de facto2013), and generative adversarial nets (GANs) Goodfellow et al. (2014) in vision, there have been multiple attempts to adapt them for the text domain Bowman et al. (2015); Yu et al. (2017)
and leverage their outstanding generation quality. However, autoregressive nature of the decoder, and innate discreteness of the text domain, respectively, give rise to numerous optimization challenges. For instance, bowman2015generating discuss KL-cost annealing and history-less decoding for stable VAE training; yu2017seqgan introduce a reinforcement-learning based algorithm to address non-differentiability induced in the network due to sampling from the generator, but it results in an unstable training procedure.
A common choice of loss function in the existing methods is the cross entropy loss, which learns a widespread distribution using maximum a posteriori. It, thus, results in the generated responses being generic, and is analogous to the mode averaging problem for continuous variables. To mitigate this problem of generic response generation, we use adversarial loss which exhibits the property of mode-collapse rather than mode-averaging. Furthermore, we use the mean squared error (MSE) as an auxiliary loss that helps the network select a few meaningful modes, and results in a more related response. However, training an adversarial network on words is difficult, hence, we train the network on their latent codes. More precisely, we employ a two-step training procedure, where: 1) we train a variational autoencoder to learn meaningful representations of input sentences, and 2) use a generator totransform a query’s (more generally, the context’s) latent code into that of response’s which, in turn, is fed to the pretrained language model’s decoder for actual response generation. Such splitting of task into two easier steps, and enforcing adversarial training only on the latent codes reap multiple optimization benefits, in addition to achieving the primary task of better conditioned response generation. As compared to gu2018dialogwae, the current state-of-the-art for dialog generation, we achieve better results in terms of BLEU scores, diversity, and fluency.
We evaluate our model on a deduplicated version Bahuleyan et al. (2017)
of the benchmark DailyDialog dataset. Our model outperforms the prior methods on all automatic evaluation metrics. Results indicate that responses generated by our model are more relevant to the input query (context, in general) and are simultaneously more diverse and fluent.
The rest of the paper is organized as follows: Section 2 touches upon the relevant literature; Section 3 describes the proposed methodology in detail; Section 4 presents experiment results and analysis; and, finally, we conclude in Section 5.
2 Related Work
RNN based encoder decoder approaches have been very popular for dialog models. To encourage diversity and moving away from safe responses several variants have been proposed. Some variants focus on making use of additional information like topic, Xing et al. (2016). Some models use a more complex architecture like HRED Serban et al. (2015) which uses a hierarchical encoder decoder architecture to encourage more complex and diverse responses.
Variational Autoencoder (VAE) Kingma and Welling (2013) based models are used in dialog systems to encourage complex and diverse responses by capturing the variability. The conditional variational autoencoder (CVAE) Zhao et al. (2017) encourages diverse responses by using the variational inference framework to learn the posterior conditioned on the context latent variable. But the ”posterior collapse” issue remains a challenge for VAE based models. Several variants have been proposed that tackle this and allow for more complex representation, like VHRED Serban et al. (2016). It makes use of the hierarchical encoder decoder framework and augments it using a latent variable at the decoder which is trained by maximizing a variational lower bound in the log likelihood. Park et. al Park et al. (2018) propose a model called Variational Hierarchical Conversation RNNs (VHCR) which imposes hierarchical structure on the learned latent variables. A variant of the CVAE called CVAE-CO was proposed by Shen et al. (2018) that makes use of autoencoder training which collaborates with CVAE to generate better responses.
Generative adversarial networks (GAN) proposed by goodfellow2014generative have been used with great success in generating images. Text generation with GANs has not been as successful due to the discrete nature of the text Xu et al. (2017). Li et. al. Li et al. (2017) proposed combining reinforcement learning with GANs where the outputs from the discriminator are used as rewards for the generator. However, Shen et. al. Shen et al. (2017)
argue that training with REINFORCE is unstable due high variance of the sampled gradients. Makhzani et. al.Makhzani et al. (2015) proposed Adversarial Autoencoders which used adversarial regularization on the latent space to match the learned aggregated posterior to an arbitrary prior. DialogWAE proposed by Gu et. al. Gu et al. (2018b) combined a conditional Wasserstein autoencoder (WAE) with adversarial learning for dialog response generation.
Figure 1 provides an overview of our proposed two-step approach. We provide a background on the relevant components used in our approach: VAE and GAN. We then provide the problem formulation and a stepwise description of our approach.
3.1 Variational Autoencoder (VAE)
Variational autoencoder (VAE) was proposed by bowman2015generating to generate sentences from a continuous space. VAE imposes a probabilistic distribution on the latent vector, and it uses KL-divergence to match the posterior with a standard prior distribution. Then, the decoder reconstructs data based on the sampled latent vector from its posterior distribution.
The autoencoding loss, in this case, is given by:
is the prior distribution, usually set to the standard normal distribution. is a tunable hyperparamter representing KL term’s weight, and represents posterior distribution of the form , where , and are learnt by the encoder.
3.2 Generative Adversarial Network (GAN)
Generative adversarial network was introduced by goodfellow2014generative. It has two components, a generator , and a discriminator
. The generator tries to produce fake samples in order to maximize the probability of discriminator making a mistake. The optimizing criterion corresponding to such a minimax game is shown:
In a traditional GAN, represents the data distribution to be learnt by the generator network , which uses as input, and learns an internal distribution to mimic as close as possible. As shown in Goodfellow et al. (2014), the above criterion can be rewritten as:
Where is the Jensen-Shannon divergence between and , the two distributions.
3.3 Problem Formulation
First, we define all the variables that we use. We denote the set of all the queries by and the set of all the responses by . We denote the set of all the sentences made up of queries and responses by . A sentence containing n tokens is denoted by .
The main task can be described as follows: given a query , generate the corresponding response .
Formally, generating a response for a given query can be viewed as learning the conditional distribution , where and
denote random variables corresponding to query and the response distribution respectively.
We employ a two-step procedure to learn the conditional distribution. The details of the two steps are as follows:
3.4 Step 1: Training an Autoencoder
First, we train an autoencoder to learn meaningful encoding of a given sentence irrespective of whether it is a query or a response.
Where is the latent representation of a sentence and is the weight of the KL term.
3.5 Step 2: Learning the Conditional Distribution
Once we have finished training the autoencoder, we proceed to our second step. In the second step, we train a conditional generative adversarial network to learn the conditional response distribution.
Apart from the objective function proposed for conditional GAN by DBLP:journals/corr/MirzaO14, we apply MSE (Mean Squared Error) loss on the generator to promote training stability and faster convergence. Therefore, the training objective for our second step is:
Here represents the learned distribution, and
is a tunable hyperparameter that moderates the effect of the MSE loss.
3.6 Multi-Turn Setting
We train and evaluate our proposed approach in two different settings: single-turn and multi-turn. In single-turn setting, we form a query-response by extracting every possible pairs of consecutive utterances belonging to the same dialog. However, in the multi-turn setting, for every response utterance we use all the available preceding utterances in the same conversation as part of the context. We encode this multi-utterance context using the pretrained encoder and use an RNN based generator to utilize the full multi-utterance context.
We perform experiments on the DailyDialog dataset Li et al. (2017), a manually labelled multi turn dialog dataset. We use the original split after removing duplicates following bahuleyan2018stochastic.
We use the following baseline models:
Seq2Seq: We use a standard Seq2Seq model using bidirectional LSTM with attention mechanism.
HRED: We use the HRED, a generalized Seq2Seq model that uses hierarchical RNN encoder Serban et al. (2015)
CVAE: A conditional VAE model with KL annealing Zhao et al. (2017).
CVAE-CO: A collaborative conditional VAE model. Shen et al. (2018)
WED-S: We use the stochastic Wasserstein Encoder Decoder with the default hyperparameters on the deduplicated dataset. Bahuleyan et al. (2018)
DialogWAE We also compare against the DialogWAE model proposed by gu2018dialogwae. This model also trains a GAN on the latent space, but uses a WAE instead of a VAE to encode sentences, and gumbel-softmax to generate diverse/multi-modal responses.
4.2 Parameter Settings and Training
We use the Bidirectional LSTMs Hochreiter and Schmidhuber (1997) for the encoder of the VAE and a unidirectional LSTM for the decoder. Both use a hidden size of 512. We use an embedding dimension of 300. The dimension of our latent vectors if 128.
We also adopt the techniques of KL annealing and word dropout from Bowman et al. (2015). We use a dropout probability of 0.5 and a sigmoid annealing schedule to anneal the KL weight to 0.15 for 4500 iterations.
We measure the performance of the model based on three criteria: relevance of response, diversity of responses, and the fluency of responses.
We measure the relevance of the generated responses by measuring n-gram overlaps with the gold response. We use the BLEU scores proposed byPapineni et al. (2002) and adopt the smoothing techniques proposed by Chen and Cherry (2014). We use BLEU-3 and Smoothing 7 for our metrics.
Diversity For each query we sample 10 responses and compute two types of diversity measures, Intra distinct-1 and Intra distinct-2, which measure the proportion of distinct unigrams and bigrams respectively in each response.
Inter distinct-1 and Inter distinct-2 measure the proportion of distinct unigrams and bigrams respectively across all 10 responses for a query.
We also report the average sentence length of the responses. For comparison, the average response length of the ground truth responses is 14.43.
Perplexity We measure the Perplexity (PPL) of our generated responses using a Kneser–Ney trigram language model Kneser and Ney (1995)
4.4 Results and Analysis
We present our results in Table 1.
Relevance: We notice that our model is the best performing model in terms of relevance. We achieve the best Precision and Recall for BLEU scores. The DialogWAE model also produces good BLEU scores, the Seq2Seq model is the worst performing model in terms of BLEU scores.
Diversity: The Intra diversity scores are pretty similar for most models. They indicate diversity of words within a sentence and a low score indicates poor training artifacts such as repetition of words. The Inter diversity scores are more important however. We notice that our model performs the best across both the Inter diversity metrics. We also measure the Average Sentence Length of our responses. As expected the responses generated by the Seq2Seq model are very short, the DialogWAE model generates longer responses on average however our model is closer in terms of length to the ground truth (14.43).
Fluency: The PPL scores measure the fluency and how similar our responses are to the responses found in the dataset. We notice that our model achieves the best PPL scores, although DialogWAE is pretty close. The good scores of Seq2Seq model are likely due to very short and generic responses.
We can see that our model performs well across all criteria. Our model also shows significant improvement in the diversity of responses for a given query (Inter-1) and (Inter-2). Our ablation results show that combining the adversarial loss and the MSE loss leads to significant improvement across almost all metrics especially overlap, response diversity and fluency. We also noticed that the MSE term leads to quicker and more stable convergence of the GAN (within 6 epochs), making training easier.
We propose an efficient two stage model for conditional text generation. We make use of the semantically aware sentence representations learned by a Variational Autoencoder and train a conditional Generative Adversarial Network on the VAE latent space to generate diverse responses conditioned on the query. Our model achieves excellent results and outperforms existing state-of-the-art VAE based approaches and generates more diverse, fluent and relevant sentences.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
- Variational attention for sequence-to-sequence models. arXiv preprint arXiv:1712.08207. Cited by: §1.
- Stochastic wasserstein autoencoder for probabilistic sentence generation. arXiv preprint arXiv:1806.08462. Cited by: §4.1.
- Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §1, §4.2.
- A systematic comparison of smoothing techniques for sentence-level BLEU. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 362–367. External Links: Cited by: §4.3.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §3.2.
- NIPS 2016 tutorial: generative adversarial networks. External Links: Cited by: §4.2.
- DialogWAE: multimodal response generation with conditional wasserstein auto-encoder. arXiv preprint arXiv:1805.12352. Cited by: Table 2.
- DialogWAE: multimodal response generation with conditional wasserstein auto-encoder. CoRR abs/1805.12352. External Links: Cited by: §2.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.2.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. External Links: Cited by: §4.2.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.
- Improved backing-off for m-gram language modeling. In ICASSP, pp. 181–184. External Links: Cited by: §4.3.
- Adversarial learning for neural dialogue generation. CoRR abs/1701.06547. External Links: Cited by: §2.
DailyDialog: a manually labelled multi-turn dialogue dataset.
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, pp. 986–995. External Links: Cited by: §4.
- Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §1.
Rectifier nonlinearities improve neural network acoustic models. In
in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Cited by: §4.2.
- Adversarial autoencoders. CoRR abs/1511.05644. External Links: Cited by: §2.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Stroudsburg, PA, USA, pp. 311–318. External Links: Cited by: §4.3.
- A hierarchical latent structure for variational conversation modeling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1792–1801. External Links: Cited by: §2.
- Building end-to-end dialogue systems using generative hierarchical neural network models. External Links: Cited by: §2, §4.1.
- A hierarchical latent variable encoder-decoder model for generating dialogues. External Links: Cited by: §2.
- Style transfer from non-parallel text by cross-alignment. CoRR abs/1705.09655. External Links: Cited by: §2.
- Improving variational encoder-decoders in dialogue generation. External Links: Cited by: §2, §4.1.
- Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
- Topic aware neural response generation. External Links: Cited by: §2.
- Neural response generation via GAN with an approximate embedding layer. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 617–626. External Links: Cited by: §2.
Seqgan: sequence generative adversarial nets with policy gradient.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
- Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: Cited by: §2, §4.1.