DeepAI

# DialogWAE: Multimodal Response Generation with Conditional Wasserstein Auto-Encoder

Variational autoencoders (VAEs) have shown a promise in data-driven conversation modeling. However, most VAE conversation models match the approximate posterior distribution over the latent variables to a simple prior such as standard normal distribution, thereby restricting the generated responses to a relatively simple (e.g., single-modal) scope. In this paper, we propose DialogWAE, a conditional Wasserstein autoencoder (WAE) specially designed for dialogue modeling. Unlike VAEs that impose a simple distribution over the latent variables, DialogWAE models the distribution of data by training a GAN within the latent variable space. Specifically, our model samples from the prior and posterior distributions over the latent variables by transforming context-dependent random noise using neural networks and minimizes the Wasserstein distance between the two distributions. We further develop a Gaussian mixture prior network to enrich the latent space. Experiments on two widely-used datasets show that DialogWAE outperforms the state-of-the-art approaches in generating more coherent, informative and diverse responses.

• 24 publications
• 200 publications
• 1 publication
• 14 publications
11/22/2019

### A Discrete CVAE for Response Generation on Short-Text Conversation

Neural conversation models such as encoder-decoder models are easy to ge...
04/26/2020

### Towards Multimodal Response Generation with Exemplar Augmentation and Curriculum Optimization

Recently, variational auto-encoder (VAE) based approaches have made impr...
04/10/2018

### A Hierarchical Latent Structure for Variational Conversation Modeling

Variational autoencoders (VAE) combined with hierarchical RNNs have emer...
07/26/2022

### Advanced Conditional Variational Autoencoders (A-CVAE): Towards interpreting open-domain conversation generation via disentangling latent feature representation

Currently end-to-end deep learning based open-domain dialogue systems re...
12/01/2022

### Modeling Complex Dialogue Mappings via Sentence Semantic Segmentation Guided Conditional Variational Auto-Encoder

Complex dialogue mappings (CDM), including one-to-many and many-to-one m...
12/01/2016

### Piecewise Latent Variables for Neural Variational Text Processing

Advances in neural variational inference have facilitated the learning o...
04/13/2021

### Variational Autoencoder Analysis of Ising Model Statistical Distributions and Phase Transitions

Variational autoencoders employ an encoding neural network to generate a...

## 1 Introduction

Neural response generation has been a long interest of natural language research. Most of the recent approaches to data-driven conversation modeling primarily build upon sequence-to-sequence learning (Cho et al., 2014; Sutskever et al., 2014). Previous research has demonstrated that sequence-to-sequence conversation models often suffer from the safe response problem and fail to generate meaningful, diverse on-topic responses (Li et al., 2015; Sato et al., 2017). Conditional variational autoencoders (CVAE) have shown promising results in addressing the safe response issue (Zhao et al., 2017; Shen et al., 2018). CVAE generates the response conditioned on a latent variable –- representing topics, tones and situations of the response –- and approximate the posterior distribution over latent variables using a neural network. The latent variable captures variabilities in the dialogue and thus generates more diverse responses. However, previous studies have shown that VAE models tend to suffer from the posterior collapse problem, where the decoder learns to ignore the latent variable and degrades to a vanilla RNN (Shen et al., 2018; Park et al., 2018; Bowman et al., 2015). Furthermore, they match the approximate posterior distribution over the latent variables to a simple prior such as standard normal distribution, thereby restricting the generated responses to a relatively simple (e.g., unimodal) scope (Goyal et al., 2017).

A number of studies have sought GAN-based approaches (Goodfellow et al., 2014; Li et al., 2017a; Xu et al., 2017) which directly model the distribution of the responses. However, adversarial training over discrete tokens has been known to be difficult due to the non-differentiability. Li et al. (2017a)

proposed a hybrid model of GAN and reinforcement learning (RL) where the score predicted by a discriminator is used as a reward to train the generator. However, training with REINFORCE has been observed to be unstable due to the high variance of the gradient estimate

(Shen et al., 2017). Xu et al. (2017)

make the GAN model differentiable with an approximate word embedding layer. However, their model only injects variability at the word level, thus limited to represent high-level response variabilities such as topics and situations.

In this paper, we propose DialogWAE, a novel variant of GAN for neural conversation modeling. Unlike VAE conversation models that impose a simple distribution over latent variables, DialogWAE models the data distribution by training a GAN within the latent variable space. Specifically, it samples from the prior and posterior distributions over the latent variables by transforming context-dependent random noise with neural networks, and minimizes the Wasserstein distance (Arjovsky et al., 2017) between the prior and the approximate posterior distributions. Furthermore, our model takes into account a multimodal111

A multimodal distribution is a continuous probability distribution with two or more modes.

nature of responses by using a Gaussian mixture prior network. Adversarial training with the Gaussian mixture prior network enables DialogWAE to capture a richer latent space, yielding more coherent, informative and diverse responses.

Our main contributions are two-fold: (1) A novel GAN-based model for neural dialogue modeling, which employs GAN to generate samples of latent variables. (2) A Gaussian mixture prior network to sample random noise from a multimodal prior distribution. To the best of our knowledge, the proposed DialogWAE is the first GAN conversation model that exploits multimodal latent structures.

We evaluate our model on two benchmark datasets, SwitchBoard (Godfrey and Holliman, 1997) and DailyDialog (Li et al., 2017b). The results demonstrate that our model substantially outperforms the state-of-the-art methods in terms of BLEU, word embedding similarity, and distinct. Furthermore, we highlight how the GAN architecture with a Gaussian mixture prior network facilitates the generation of more diverse and informative responses.

## 2 Related Work

Encoder-decoder variants  To address the “safe response” problem of the naive encoder-decoder conversation model, a number of variants have been proposed. Li et al. (2015) proposed a diversity-promoting objective function to encourage more various responses. Sato et al. (2017) propose to incorporate various types of situations behind conversations when encoding utterances and decoding their responses, respectively. Xing et al. (2017) incorporate topic information into the sequence-to-sequence framework to generate informative and interesting responses. Our work is different from the aforementioned studies, as it does not rely on extra information such as situations and topics.

VAE conversation models  The variational autoencoder (VAE) (Kingma and Welling, 2014) is among the most popular frameworks for dialogue modeling (Zhao et al., 2017; Shen et al., 2018; Park et al., 2018). Serban et al. (2017) propose VHRED, a hierarchical latent variable sequence-to-sequence model that explicitly models multiple levels of variability in the responses. A main challenge for the VAE conversation models is the so-called “posterior collapse”. To alleviate the problem, Zhao et al. (2017) introduce an auxiliary bag-of-words loss to the decoder. They further incorporate extra dialogue information such as dialogue acts and speaker profiles. Shen et al. (2018) propose a collaborative CVAE model which samples the latent variable by transforming a Gaussian noise using neural networks and matches the prior and posterior distributions of the Gaussian noise with KL divergence. Park et al. (2018) propose a variational hierarchical conversation RNN (VHCR) which incorporates a hierarchical structure to latent variables. DialogWAE addresses the limitation of VAE conversation models by using a GAN architecture in the latent space.

GAN conversation models  Although GAN/CGAN has shown great success in image generation, adapting it to natural dialog generators is a non-trivial task. This is due to the non-differentiable nature of natural language tokens (Shen et al., 2017; Xu et al., 2017). Li et al. (2017a) address this problem by combining GAN with Reinforcement Learning (RL) where the discriminator predicts a reward to optimize the generator. However, training with REINFORCE can be unstable due to the high variance of the sampled gradient (Shen et al., 2017). Xu et al. (2017)

make the sequence-to-sequence GAN differentiable by directly multiplying the word probabilities obtained from the decoder to the corresponding word vectors, yielding an approximately vectorized representation of the target sequence. However, their approach injects diversity in the word level rather than the level of the whole responses. DialogWAE differs from exiting GAN conversation models in that it shapes the distribution of responses in a high level latent space rather than direct tokens and does not rely on RL where the gradient variances are large.

## 3 Proposed Approach

### 3.1 Problem Statement

Let =[] denote a dialogue of utterances where =[] represents an utterance and denotes the -th word in . Let =[] denote a dialogue context, the - historical utterances, and = be a response which means the next utterance. Our goal is to estimate the conditional distribution .

As and are sequences of discrete tokens, it is non-trivial to find a direct coupling between them. Instead, we introduce a continuous latent variable  that represents the high-level representation of the response. The response generation can be viewed as a two-step procedure, where a latent variable is sampled from a distribution on a latent space , and then the response is decoded from with . Under this model, the likelihood of a response is

 pθ(x|c)=∫zp(x|c,z)p(z|c)dz. (1)

The exact log-probability is difficult to compute since it is intractable to marginalize out . Therefore, we approximate the posterior distribution of as which can be computed by a neural network named recognition network. Using this approximate posterior, we can instead compute the evidence lower bound (ELBO):

 (2)

where represents the prior distribution of given and can be modeled with a neural network named prior network.

### 3.2 Conditional Wasserstein Auto-Encoders for Dialogue Modeling

The conventional VAE conversation models assume that the latent variable  follows a simple prior distribution such as the normal distribution. However, the latent space of real responses is more complicated and difficult to be estimated with such a simple distribution. This often leads to the posterior collapse problem (Shen et al., 2018).

Inspired by GAN and the adversarial auto-encoder (AAE) (Makhzani et al., 2015; Tolstikhin et al., 2017; Zhao et al., 2018), we model the distribution of by training a GAN within the latent space. We sample from the prior and posterior over the latent variables by transforming random noise  using neural networks. Specifically, the prior sample is generated by a generator  from context-dependent random noise , while the approximate posterior sample is generated by a generator  from context-dependent random noise . Both and are drawn from a normal distribution whose mean and covariance matrix (assumed diagonal) are computed from

prior network and recognition network, respectively:

 ~z=Gθ(~ϵ),   ~ϵ∼N(ϵ;~μ,~σ2I),   [~μlog~σ2]=~Wfθ(c)+~b (3)
 z=Qϕ(ϵ),   ϵ∼N(ϵ;μ,σ2I),   [μlogσ2]=Wgϕ([xc])+b, (4)

where and are feed-forward neural networks. Our goal is to minimize the divergence between and while maximizing the log-probability of a reconstructed response from . We thus solve the following problem:

 minθ,ϕ,ψ−Eqϕ(z|x,c)logpψ(x|z,c)+W(qϕ(z|x,c)||pθ(z|c)), (5)

where and are neural networks implementing Equations 3 and 4, respectively. is a decoder. W() represents the Wasserstein distance between these two distributions (Arjovsky et al., 2017)

. We choose the Wasserstein distance as the divergence since the WGAN has been shown to produce good results in text generation

(Zhao et al., 2018).

Figure 1 illustrates an overview of our model. The utterance encoder (RNN) transforms each utterance (including the response ) in the dialogue into a real-valued vector. For the -th utterance in the context, the context encoder (RNN) takes as input the concatenation of its encoding vector and the conversation floor (1 if the utterance is from the speaker of the response, otherwise 0) and computes its hidden state . The final hidden state of the context encoder is used as the context representation.

At generation time, the model draws a random noise  from the prior network (PriNet) which transforms through a feed-forward network followed by two matrix multiplications which result in the mean and diagonal covariance, respectively. Then, the generator G generates a sample of latent variable  from the noise through a feed-forward network. The decoder RNN decodes the generated into a response.

At training time, the model infers the posterior distribution of the latent variable conditioned on the context  and the response . The recognition network (RecNet) takes as input the concatenation of both and and transforms them through a feed-forward network followed by two matrix multiplications which define the normal mean and diagonal covariance, respectively. A Gaussian noise  is drawn from the recognition network with the re-parametrization trick. Then, the generator Q transforms the Gaussian noise  into a sample of latent variable  through a feed-forward network. The response decoder (RNN) computes the reconstruction loss:

 Lrec=−Ez=Q(ϵ),ϵ∼RecNet(x,c)logpψ(x|c,z) (6)

We match the approximate posterior with the prior distributions of by introducing an adversarial discriminator D which tells apart the prior samples from posterior samples. D is implemented as a feed-forward neural network which takes as input the concatenation of and and outputs a real value. We train D by minimizing the discriminator loss:

 Ldisc=Eϵ∼RecNet(x,c)[D(Q(ϵ),c)]−E~ϵ∼PriNet(c)[D(G(~ϵ),c)] (7)

### 3.3 Multimodal Response Generation with a Gaussian Mixture Prior Network

It is a usual practice for the prior distribution in the AAE architecture to be a normal distribution. However, responses often have a multimodal nature reflecting many equally possible situations (Sato et al., 2017)

, topics and sentiments. A random noise with normal distribution could restrict the generator to output a latent space with a single dominant mode due to the unimodal nature of Gaussian distribution. Consequently, the generated responses could follow simple prototypes.

To capture multiple modes in the probability distribution over the latent variable, we further propose to use a distribution that explicitly defines more than one mode. Each time, the noise to generate the latent variable is selected from one of the modes. To achieve so, we make the prior network to capture a mixture of Gaussian distributions, namely, , where , and are parameters of the -th component. This allows it to learn a multimodal manifold in the latent variable space in a two-step generation process – first choosing a component  with , and then sampling Gaussian noise within the selected component:

 p(ϵ|c)=K∑k=1vkN(ϵ;μk,σ2kI), (8)

where is a component indicator with class probabilities ,,; is the mixture coefficient of the -th component of the GMM. They are computed as

 πk=exp(ek)∑Ki=1exp(ei),  where ⎡⎢⎣ekμklogσ2k⎤⎥⎦= Wkfθ(c)+bk (9)

Instead of exact sampling, we use Gumbel-Softmax re-parametrization (Kusner and Hernández-Lobato, 2016) to sample an instance of :

 vk=exp((ek+gk)/τ)∑Ki=1exp((ei+gi)/τ), (10)

where is a Gumbel noise computed as

 gi=−log(−log(ui)),ui∼U(0,1)

and [0,1] is the softmax temperature which is set to 0.1 in all experiments.

We refer to this framework as DialogWAE-GMP. A comparison of performance with different numbers of prior components will be shown in Section 5.1.

### 3.4 Training

Our model is trained epochwise until a convergence is reached. In each epoch, we train the model iteratively by alternating two phases

an AE phase during which the reconstruction loss of decoded responses is minimized, and a GAN phase which minimizes the Wasserstein distance between the prior and approximate posterior distributions over the latent variables. The detailed procedures are presented in Algorithm 1

## 4 Experimental Setup

Datasets  We evaluate our model on two dialogue datasets, Dailydialog (Li et al., 2017b) and Switchboard (Godfrey and Holliman, 1997), which have been widely used in recent studies (Shen et al., 2018; Zhao et al., 2017). Dailydialog has 13,118 daily conversations for a English learner in a daily life. Switchboard contains 2,400 two-way telephone conversations under 70 specified topics. The datasets are separated into training, validation, and test sets with the same ratios as in the baseline papers, that is, 2316:60:62 for Switchboard (Zhao et al., 2017) and 10:1:1 for Dailydialog (Shen et al., 2018), respectively.

Metrics  To measure the performance of DialogWAE, we adopted several standard metrics widely used in existing studies: BLEU (Papineni et al., 2002), BOW Embedding (Liu et al., 2016) and distinct (Li et al., 2015). In particular, BLEU measures how much a generated response contains -gram overlaps with the reference. We compute BLEU scores for n4 using smoothing techniques (smoothing 7) (Chen and Cherry, 2014). For each test context, we sample 10 responses from the models and compute their BLEU scores. We define -gram precision and -gram recall as the average and the maximum score respectively (Zhao et al., 2017).

BOW embedding metric is the cosine similarity of bag-of-words embeddings between the hypothesis and the reference. We use three metrics to compute the word embedding similarity: 1.

Greedy: greedily matching words in two utterances based on the cosine similarities between their embeddings, and to average the obtained scores (Rus and Lintean, 2012). 2. Average: cosine similarity between the averaged word embeddings in the two utterances (Mitchell and Lapata, 2008). 3. Extrema: cosine similarity between the largest extreme values among the word embeddings in the two utterances (Forgues et al., 2014). We use Glove vectors (Pennington et al., 2014) as the embeddings which will be discussed later in this section. For each test context, we report the maximum BOW embedding score among the 10 sampled responses.

Distinct computes the diversity of the generated responses. dist- is defined as the ratio of unique -grams (n=12) over all -grams in the generated responses. As we sample multiple responses for each test context, we evaluate diversities for both within and among the sampled responses. We define intra-dist as the average of distinct values within each sampled response and inter-dist as the distinct value among all sampled responses.

Baselines  We compare the performance of DialogWAE with seven recently-proposed baselines for dialogue modeling: (i) HRED: a generalized sequence-to-sequence model with hierarchical RNN encoder (Serban et al., 2016), (ii) SeqGAN: a GAN based model for sequence generation (Li et al., 2017a), (iii) CVAE: a conditional VAE model with KL-annealing (Zhao et al., 2017), (iv) CVAE-BOW: a conditional VAE model with a BOW loss (Zhao et al., 2017), (v) CVAE-CO: a collaborative conditional VAE model (Shen et al., 2018), (vi) VHRED: a hierarchical VAE model (Serban et al., 2017), and (vii) VHCR: a hierarchical VAE model with conversation modeling (Park et al., 2018).

Training and Evaluation Details

We use the gated recurrent units (GRU)

(Cho et al., 2014) for the RNN encoders and decoders. The utterance encoder is a bidirectional GRU with 300 hidden units in each direction. The context encoder and decoder are both GRUs with 300 hidden units. The prior and the recognition networks are both 2-layer feed-forward networks of size 200 with tanh non-linearity. The generators  and as well as the discriminator

are 3-layer feed-forward networks with ReLU non-linearity

(Nair and Hinton, 2010) and hidden sizes of 200, 200 and 400, respectively. The dimension of a latent variable

is set to 200. The initial weights for all fully connected layers are sampled from a uniform distribution [-0.02, 0.02]. The gradient penalty is used when training

(Gulrajani et al., 2017) and its hyper-parameter  is set to 10. We set the vocabulary size to 10,000 and define all the out-of-vocabulary words to a special token unk. The word embedding size is 200 and initialized with Glove vectors pre-trained on Twitter (Pennington et al., 2014)

. The size of context window is set to 10 with a maximum utterance length of 40. We sample responses with greedy decoding so that the randomness entirely come from the latent variables. The baselines were implemented with the same set of hyper-parameters. All the models are implemented with Pytorch 0.4.0

, and fine-tuned with NAVER Smart Machine Learning (NSML) platform

(Sung et al., 2017; Kim et al., 2018).

The models are trained with mini-batches containing 32 examples each in an end-to-end manner. In the AE phase, the models are trained by SGD with an initial learning rate of 1.0 and gradient clipping at 1

(Pascanu et al., 2013)

. We decay the learning rate by 40% every 10th epoch. In the GAN phase, the models are updated using RMSprop

(Tieleman and Hinton, ) with fixed learning rates of and for the generator and the discriminator, respectively. We tune the hyper-parameters on the validation set and measure the performance on the test set.

## 5 Experimental Results

### 5.1 Quantitative Analysis

Tables 1 and 2 show the performance of DialogWAE and baselines on the two datasets. DialogWAE outperforms the baselines in the majority of the experiments. In terms of BLEU scores, DialogWAE (with a Gaussian mixture prior network) generates more relevant responses, with the average recall of 42.0% and 37.2% on both of the datasets. These are significantly higher than those of the CVAE baselines (29.9% and 26.5%). We observe a similar trend to the BOW embedding metrics.

DialogWAE generates more diverse responses than the baselines do. The inter-dist scores are significantly higher than those of the baseline models. This indicates the sampled responses contain more distinct -grams. DialogWAE does not show better intra-distinct scores. We conjecture that this is due to the relatively long responses generated by the DialogWAE as shown in the last columns of both tables. It is highly unlikely for there to be many repeated -grams in a short response.

We further investigate the effects of the number of prior components (). Figure 2 shows the performance of DialogWAE-GMP with respect to the number of prior components . We vary from 1 to 9. As shown in the results, in most cases, the performance increases with and decreases once reaches a certain threshold, for example, three. The optimal on both of the datasets was around 3. We attribute this degradation to training difficulty of a mixture density network and the lack of appropriate regularization, which is left for future investigation.

### 5.2 Qualitative Analysis

Table 3 presents examples of responses generated by the models on the DailyDialog dataset. Due to the space limitation, we report the results of CVAE-CO and DialogWAE-GMP, which are the representative models among the baselines and the proposed models. For each context in the test set, we show three samples of generated responses from each model. As we expected, DialogWAE generates more coherent and diverse responses that cover multiple plausible aspects. Furthermore, we notice that the generated response is long and exhibits informative content. By contrast, the responses generated by the baseline model exhibit relatively limited variations. Although the responses show some variants in contents, most of them share a similar prefix such as “how much”.

We further investigate the interpretability of Gaussian components in the prior network, that is, what each Gaussian model has captured before generation. We pick a dialogue context “I’d like to invite you to dinner tonight, do you have time?” which is also used in (Shen et al., 2018) for analysis and generate five responses for each Gaussian component.

As shown in Table 4, different Gaussian models generate different types of responses: component 1 expresses a strong will, while component 2 expresses some uncertainty, and component 3 generates strong negative responses. The overlap between components is marginal (around 1/5). The results indicate that the Gaussian mixture prior network can successfully capture the multimodal distribution of the responses.

To validate the previous results, we further conduct a human evaluation with Amazon Mechanical Turk. We randomly selected 50 dialogues from the test set of DailyDialog. For each dialogue context, we generated 10 responses from each of the four models. Responses for each context were inspected by 5 participants who were asked to choose the model which performs the best in regarding to coherence, diversity and informative while being blind to the underlying algorithms. The average percentages that each model was selected as the best to a specific criterion are shown in Table 5.

The proposed approach clearly outperforms the current state of the art, CVAE-CO and VHCR, by a large margin in terms of all three metrics. This improvement is especially clear when the Gaussian mixture prior was used.

## 6 Conclusion

In this paper, we introduced a new approach, named DialogWAE, for dialogue modeling. Different from existing VAE models which impose a simple prior distribution over the latent variables, DialogWAE samples the prior and posterior samples of latent variables by transforming context-dependent Gaussian noise using neural networks, and minimizes the Wasserstein distance between the prior and posterior distributions. Furthermore, we enhance the model with a Gaussian mixture prior network to enrich the latent space. Experiments on two widely used datasets show that our model outperforms state-of-the-art VAE models and generates more coherent, informative and diverse responses.

### Acknowledgments

This work was supported by the Creative Industrial Technology Development Program (10053249) funded by the Ministry of Trade, Industry and Energy (MOTIE, Korea).