Tensorflow Implementation of Improving Variational Encoder-Decoders in Dialogue Generation
Variational encoder-decoders (VEDs) have shown promising results in dialogue generation. However, the latent variable distributions are usually approximated by a much simpler model than the powerful RNN structure used for encoding and decoding, yielding the KL-vanishing problem and inconsistent training objective. In this paper, we separate the training step into two phases: The first phase learns to autoencode discrete texts into continuous embeddings, from which the second phase learns to generalize latent representations by reconstructing the encoded embedding. In this case, latent variables are sampled by transforming Gaussian noise through multi-layer perceptrons and are trained with a separate VED model, which has the potential of realizing a much more flexible distribution. We compare our model with current popular models and the experiment demonstrates substantial improvement in both metric-based and human evaluations.READ FULL TEXT VIEW PDF
Variational autoencoders (VAEs) have shown a promise in data-driven
Sequential data often possesses a hierarchical structure with complex
Advances in neural variational inference have facilitated the learning o...
Paraphrase generation is a longstanding important problem in natural lan...
Despite the great promise of Transformers in many sequence modeling task...
Variational autoencoders (VAE) combined with hierarchical RNNs have emer...
Variational autoencoders employ an encoding neural network to generate a...
Tensorflow Implementation of Improving Variational Encoder-Decoders in Dialogue Generation
Tensorflow Implementation of Improving Variational Encoder-Decoders in Dialogue Generation
Deep latent variable models are a popular way to learn such representations in a generative setting. Latent representations and generators can be jointly trained in an unsupervised way. By learning the probability of synthesizing real data from intermediate latent variables, they are expected to uncover and disentangle causal factors that are most important to explain the data. The exact log-likelihood normally requires integral in high-dimensional space and cannot be analytically expressed. Current approaches solve this intractability problem by imposing a recognition network to approximate the real posterior probability. Variational autoencoders (VAEs)[Kingma and Welling2014, Rezende, Mohamed, and Wierstra2014]
with RNN encoder-decoders in dialogue generation, in hope of CVAE’s advantage of learning global representations being a good complement of RNN’s power at modeling local dependencies. However, this simple combination runs into the KL-vanishing problem that the RNN part ends up explaining all the structures without making use of the latent representation. The reason is that RNN is a universal approximator with much more flexibility than the simple gaussian distributed latent variables so that the model lacks enough motivation to utilize them.
Current approaches normally address this problem by weakening the RNN decoder to match the simpler latent variable distribution, which essentially sacrifices the generating capacity for better representation learning and is inappropriate when our main goal is to learn a generative model. In this paper, on the contrary, we take advantage of the universality of RNNs to help realize a more flexible latent variable distribution. By this means, we can not only add motivation for utilizing latent variables, but also strengthen the expressiveness of the generating model. Specifically, we split the whole structure into a CVAE module and an autoencoder (AE) module. The CVAE module learns to generate latent variables while the AE module builds the connection between them and real dialogue utterances. The outputs of the CVAE serve as input latent variables for the AE module, which is potentially much more flexible than restricting the latent variables to follow a fixed distribution. As the RNN encoder-decoders in the AE module are universal approximators, they are adjusted to extract continuous vectors from the dialogue data that can be more easily modelled by the CVAE module. Combined with a scheduled sampling trick, this structure can significantly improve the generating performance. We show this structure can be compared to an adversarial encoder-decoder which substitutes the GAN step with a VAE alternative. Though theoretically less accurate, our framework is preferred to AED as the training process of VAE is much more reliable than GAN in seq2seq tasks and the universality of RNN ensures this inaccuracy can be controlled within an acceptable range.
In this section, we review the VAE and VHRED structure, then analyze where the training difficulty comes from when applied in dialogue generation and how current approaches try to solve this problem.
The variational autoencoder (VAE) [Kingma and Welling2014, Rezende, Mohamed, and Wierstra2014] is a popular generative model. Its generating process is as follows: data is generated by the generative distribution and is sampled from the prior distribution . In contrast to calculating the exact log-likelihood, it can be efficiently trained by optimizing a valid lower bound [Jordan et al.1999]. The objective takes the following form:
is the real posterior distribution of given the prior distribution and the likelihood . The optimizing objective is namely maximizing the likelihood and at the same time minimizing the mismatch between the approximated posterior , which is parametrized by neural networks, and the real posterior . When the gap is large, the objective becomes inconsistent and the generating process cannot recover the real data distribution even in the global optimum.
The whole process can be conditioned on an additional context , which leads to the conditional VAE [Sohn, Lee, and Yan2015] (CVAE): the output is generated from the distribution , latent variable is drawn from the prior distribution . The variational lower bound of CVAE is written as follows:
Specially, to some extent, when both the context and output are sequential data, CVAE can also be treated as a seq2seq model [Sutskever, Vinyals, and Le2014].
The variational hierarchical recurrent encoder-decoder (VHRED) [Serban et al.2017b] is a CVAE with hierarchical RNN encoders, where the first-layer RNN encodes token-level variations and the second-layer RNN captures sentence-level topic shifts. In this case, in Equation. 2 stands for dialogue history, is the response to be decoded and is the latent variable reflecting the high-level representation of . The distribution and are usually set as simple Gaussian distributions with diagonal covariance matrix.
In VHRED, straightforwardly optimizing with Equation. 2 suffers from the KL-vanishing problem because the RNN decoder is a universal function approximator and tends to represent the distribution without referring to the latent variable. At the beginning of the training process, when the approximate posterior carries little useful information, it is natural for the model to blindly set closer to the Gaussian prior so that the extra cost from the KL divergence can be avoided [Chen et al.2017].
To better analyze where the optimizing comes from, we can rewrite Equation. 2 as the following:
Let’s first take a look at the first item, . When the family of is complex enough and includes the real distribution of , the optimal value of this item is and the reliance on is not necessary. However, reliance on provides the model with a chance of taking advantage of distribution and reduces the complexity requirement for the distribution family . For example, suppose and , modeling accurately without reliance on requires to include the Gaussian distribution, while by means of the linear mapping between and , can describe the real distribution with only linear complexity. When Gaussian distribution is not covered in the family , this model has to exploit the relation between and to model the real distribution. Likewise, in dialogue generation, although the RNN decoder can in theory approximate arbitrary function, perfectly fitting the real dialogue distribution is still difficult due to the optimizing challenge, training corpus size and approximating errors. Therefore, to achieve the global optimum, we believe this first item will always prefer utilizing the latent variables, so long as the decoder is not perfect. The weaker the decoder family is, the more it will be biased to utilizing latent variables. A more flexible prior distribution will also increase the chance as it provides more possibilities for the utilisation.
The second item is the KL divergence, whose minimum value is 0 if and only if
. According to the Bayes theorem, we can expressas:
By ignoring the latent variable , and cancels out, setting can easily arrive at the global optimum 0. Otherwise, when is parametrised as a mean-field Gaussian distribution as in VHRED, the real posterior is impossible to fall into the same distribution family. Firstly, the independence relation cannot be satisfied. To make dimensions of independent with each other, the likelihood must exactly disentangle the effect of every dimension, which is unrealistic when is a categorical distribution modelled by the RNN softmax. Secondly, the real posterior distribution can hardly still follow a Gaussian distribution when the likelihood is based on discrete sequential data. Normally the training process will adjust to make the real posterior easier to be modelled by [Hinton et al.1995]. However, when represents sentences with variable length, the value of vanishes greatly when the length grows, which makes the adjusting task much more difficult. This implies the second item will always prefer ignoring the latent variables, so long as the approximated posterior is not powerful enough to perfectly match the real posterior. The weaker the approximating posterior distribution family is, the more it will be biased to ignoring latent variables.
Above all, the objective function of variational encoder-decoders in dialogue generation is essentially the competition of these two items, who is biased to utilizing or ignoring latent variables respectively. The reason of KL divergence vanishing in the global optimum is that the second term can gain more from ignoring the latent variables than the first term from utilizing them.
If we use the ELBO objective, as explained, there are two directions to prevent the KL-vanishing problem: improving the advantage of utilising latent variables in or weakening the advantage of abandoning latent variables in .
For the former direction, we need to use a smaller distribution family to model the decoder . When the decoder is weaker, if ignoring latent variables, it becomes farther from the real distribution at the global optimum thus encouraging latent variables to be exploited. Word drop-out [Bowman et al.2016] is a common method to weaken the RNN decoder. At each time step, the input word has a certain chance (drop-out rate) of becoming another word, the RNN decoder therefore cannot store a continuous history context. In [Xie et al.2017], word drop-out is also explained as a special kind of smoothing. Similarly, for CNN decoders, limiting their power can also encode more information to latent variables [Yang et al.2017, Chen et al.2017]. Bag-of-word loss proposed by [Zhao, Zhao, and Eskenazi2017] can also fall into this category. It imposes an extra loss which forces the latent variable to predict the whole sentence without word inputs, which is essentially increasing the weight of the reconstruction loss with the drop-out rate set to 1.
For the latter direction, we need to use a more flexible prior or posterior distribution for latent variables. Once the approximated posterior distribution is powerful enough, the KL divergence can be close to zero without losing the dependence on latent variables. [Serban et al.2017a] applies a piecewise distribution to replace the Gaussian prior distribution. Though can represent multi-modal conditions, it is still limited as a fixed distribution with pre-defined number of modes. [Salimans, Kingma, and Welling2015]
samples latent variables through Markov chains, but it imposed an extra approximation and the objective becomes less accurate.[Rezende and Mohamed2015, Kingma et al.2016, Chen et al.2017]
use a normalizing flow. Latent variables are first sampled from a simple distribution then passed through several invertible transformations to get better flexibility. Normalizing flow is computationally more costly and has not been applied in text generation yet.
We can also change the original ELBO objective for easier optimization. KL-annealing [Bowman et al.2016] and free bits [Kingma et al.2016] are two popular strategies. In KL-annealing, a small weight is added to the KL divergence term in Equation. 2, which starts from zero and gradually increases to 1. This prevents the model from zeroing out the KL divergence at the earlier training stage. Once the KL divergence vanishes, it is difficult to be recovered for the short sight nature of gradient descent. Free bits reserve some space of KL divergence for every dimension of latent variables. KL divergence is only optimized when exceeding the predefined quota. Similar ideas can be found in [Yang et al.2017], which reserved space for the total KL divergence instead of for every dimension.
As discussed above, two ways for alleviating the optimizing challenge includes weakening the RNN decoders and improving the flexibility of latent variable distributions. The latter class is more fundamental since it also brings more expressiveness to the generating model. Weakening the decoders, though attenuating the KL-vanishing problem, will inevitably hurt the overall performance.
An ideal way of representing the latent variable distribution is to use a universal approximator like neural networks. [Makhzani et al.2016] proposed adversarial autoencoder (AAE) which samples posterior latent variables by transforming Gaussian noise through multi-layer-perceptrons. The flexibility of neural networks ensures it can fit arbitrary distribution. However, the probability density is intractable, so adversarial learning [Goodfellow et al.2014] must be implemented to replace the original KL divergence term.
We can apply this idea to dialogue generation, where AAE is changed to context-dependent adversarial encoder-decoder (AED). The training objective can be represented as:
The training alternates between the autoencoder (AE) phase to optimize and the GAN phase to match the aggregated posterior and the prior . and
are implicitly defined by passing context-dependent Gaussian random variablesthrough multi-layer perceptrons. The graphical model is depicted in Figure. 2. It can be shown that this objective differs from the original ELBO by adding an extra punishment to the entropy of and using Jensen-Shannon divergence in lieu of KL divergence. In the non-parametric limit, its generating model can recover the exact data distribution.
The idea of AED sounds appealing, but GAN is notoriously difficult to train, especially when both the prior and posterior need to be updated towards each other, the model becomes extremely sensitive to hyper-parameters and the training is very unstable. In consequence, we try replacing the GAN phase with a CVAE alternative. An RNN encoder is first applied to extract the corresponding latent variable target for each dialogue turn , based on which a CVAE is trained to reconstruct it through context-dependent Gaussian noise. The connection to AED can be seen in Figure 2. Specifically, we just replace the in Equation 5 with the following CVAE objective:
is an approximated posterior. It can be easily proved when is powerful enough to cover the real posterior , objective 6 has the same global optimum as in . We can therefore instead alternate between the AE phase and the CVAE phase to achieve the same effect as in AED.
The accuracy of the CVAE objective relies on the matching degree of and . Therefore, in the AE phase, apart from encoding representative information to reduce the normal AE reconstruction loss, the RNN encoder should also encode utterances in a manner where the real posterior can be more easily modelled by the distribution defined by in the CVAE phase. To do this, we add a KL divergence constraint to the RNN encoder in the AE phase. The RNN encoder has to keep within a specific range. It is also possible to constrain the value of the whole CVAE objective of Equation. 6, but we find constraining only the KL divergence is enough when the alternating step is not too large. Note that in the encoder phase, the model can only adjust the RNN encoder-decoders to control the KL divergence, the generating parameters for latent variables are fixed.
In the AE phase, we also find it useful to initially use the ground-truth encoding then gradually change to noisier CVAE output . We apply the scheduled sampling strategy proposed in [Bengio et al.2015]. Before decoding, a coin is flipped to decide whether to feed the real hidden vector or the noisy . In the beginning, to make it easy, we mostly pick the real . As the training proceeds, we gradually improve the difficulty by increasing the chance of selecting noisy until finally all inputs are replaced with the . We decide the chance of selecting the real with a linear decay function as:
is the step number and is a constant controlling the decaying speed. Other decaying functions are also applicable like exponential decay or inverse sigmoid decay.
Our model contains a CVAE phase and an AE phase. These two phases are trained iteratively until an equilibrium is achieved.
In the CVAE phase, A sample is obtained from the AE by transforming dialogue texts into a continuous embedding and is used as a target for the maximum likelihood training of the CVAE. We assume the generative model
, the loss function is:
is the RNN encoder and is fixed as part of the AE module during training.
In the AE phase, An observation x is sampled from the training data and fed into the transform function to get a continuous vector representation . The corresponding latent variable is sampled from the posterior distribution provided by the CVAE part. The sampled latent variable , together with , forms a target for training the AE. The objective function is:
The first item is used to control KL divergence in a reasonable range such that the transformed can be more easily modelled by the CVAE phase. can be used to adjust the leverage between the reconstruction loss and KL divergence, where a lower value will lead to a lower KL divergence in the end. is the keeping rate defined in Equation. 7. The detailed architecture is depicted in Figure 1. We refer to this framework as collaborative VED where the AE and CVAE phase collaborate with each other to achieve a better generating performance.
In summary, we replace the GAN phase of AED with a CVAE alternative. The output of the CVAE part are latent variables, which can represent a much broader distribution family than mean-field Gaussian. As CVAE is in theory less accurate than GAN because it needs to approximate the real posterior, we leverage the more powerful RNN encoder-decoders. In the AE phase, they should autoencode utterenaces to make the real posterior easily representable by the CVAE part.
We conduct our experiments on two dialogue datasets: Dailydialog [Li et al.2017] and Switchboard [Godfrey and Holliman]. Dailydialog contains 13118 daily conversations under ten different topics. This dataset is crawled from various websites for English learner to practice English in daily life. Statics show that the speaker turns are roughly 8, and the average tokens per utterance is about 15, which are appropriate for training dialog models. Switchboard has 2400 two-sided telephone conversations under 70 specified topics with manually transcribed speech and alignment. Compared with Dailydialog, the turn of every dialogue is much longer and the subject is more disperse. These two datasets are randomly separated into training/validation/test sets with the ratio of 10:1:1.
For comparison, we also implemented the hred model (seq2seq model with hierarchical RNN encoders), which is the basis of VHRED. Latent variable models are trained by standard KL-annealing with different weights [Bowman et al.2016, Higgins et al.2017], with additional BOW loss [Zhao, Zhao, and Eskenazi2017, Semeniuta, Severyn, and Barth2017], word drop-out [Bowman et al.2016], free bits [Kingma et al.2016] and our collaborative VED (CO) with the scheduled sampling trick (SS). For our framework, we use the encoder RNN as the transformation function . We tuned the parameters on the validation set and measure the performance on the test set. In all experiments, the letters are all transformed to the lower-case, the vocabulary size was set as 20,000 and all the OOV words were mapped to a special token unk
. We set word embeddings to size of 300 and initialized them with Word2Vec embeddings trained on the Google News Corpus. The first, second-layer encoder and decoder RNN in the following experiments are single-layer GRU with 512, 1024 and 512 hidden neurons. The dimension of latent variables is set to 512. The batch size is 128 and we fix the learning rate as 0.0002 for all models. Our framework is trained epochwise by alternatively training the CVAE and DAE part. The probability estimators for VAE are 2-layer feedforward neural networks. At test time, we output the most likely responses using beam search with beam size set to 5[Graves2012] and unk
tokens were prevented from being generated. We implemented all the models with the open-sourced Python library Tensorflow[Abadi et al.2016] and optimized using the Adam optimizer [Kingma and Ba2014]. Dialogs are cut into set of slices with each slice containing 80 words then fed into the GPU memory.
We compare our model with the basic HRED and several current approaches including KL-annealing (KLA), word drop-out (DO), free-bits (FB) and bag-of-words loss (BOW). The details are summarized in Table 1 and 2. For KLA, we initialize the weight with 0 and gradually increase to 1 in the first 12000 or 25000 training steps for Dailydialog and Switchboard respectively. The word drop-out rate is fixed to 25%. Words are dropped out only in the training step. We set the reserved space for every dimension as 0.01 in free bits (FB) and also try reserving 5 bits for the whole dimension space (FB-all). We use an value 5 for our collaborative model (CO) and set the scheduled sampling (SS) weight or 5000 for Dailydialog or Switchboard. We also experiment with jointly training the AE and CVAE part in our model and report the results.
Table 1 measured the perplexity (PPL), KL divergence (KL) and negative log-likelihood (NLL). NLL is averaged over all the 80-word slices within every batch. For latent variable models, NLL is computed as the ELBO, which is the lower bound of the real NLL.
As can be seen, our model CO+SS achieves the lowest NLL over both datasets. The Schedule Sampling (SS) strategy significantly helps brings down the NLL. Word drop-out (DO), though weakening the RNN decoder, improved the performance when combined with both KLA and CO, which verified the assumption that DO can function as a smoothing technique in neural network language models [Xie et al.2017]. KLA itself needs early stop, otherwise the KL divergence will vanish once the weight increases to 1. BOW avoids the KL-vanishing problem, but the overall performance will significantly decrease because adding an additional loss in theory leads to a biased result for latent variables. BOW information is encoded into the latent variable, but it prevents the decoder from stably learning the word order pattern in the training step thus sacrifices the NLL performance. FB-all performs much better than FB, which suggests most important information is concentrated on a few dimensions. Equally reserving space for every dimension is not suitable. Finally, we also testified the necessity of iteratively training our model. Jointly training the model brings recession on both the perplexity and KL divergence on the two datasets.
Figure. 3 visualizes the latent variables drawn from VHRED and our framework. We randomly pick a dialogue context “I’d like to invite you to dinner tonight , do you have time ?” and apply the information retrieval based method to gather 10 responses with similar context from the corpus. All the 10 responses are verified by humans as appropriate ones, which span over different possibilities like “Thank you for your invitation. ”, “Don’t be silly . Let’s go Dutch .” and “Are you asking me for a date ? ”. For each response, 10 samples are drawn from the posterior latent variable distribution, which forms 100 posterior latent variable samples (blue dots) in total. Likewise, 100 samples are drawn from the prior latent variable distribution (dots) given only the dialogue context. The visualization clearly indicates the superiority of our framework in modelling more flexible prior and posterior latent variable distributions. In the VHRED model, both the prior and posterior distributions are limited uni-modal Gaussians with only a little overlap. In our framework, the distributions are more diverse and samples from the prior and posterior distribution share more overlap with each other.
. Unlike the NLL, who measures the token-level match, these embedding-based metrics map responses to a vector space and compute the cosine similarly with golden answers, which can to a large extent measure the sentence-level semantic similarity.
We can see our model still achieved the highest topic similarity according to all the three metrics. This suggests our model can bring improvement for both token-level coherence and sentence-level topic match. BOW, though not good at the NLL metric, performed remarkably well on this metric, which implies BOW is beneficial for the decoder to generate the correct high-level meaning but fails to transform the meaning to a fluent sentence. In contrast, FB has a relative lower on-topic similarity score compared with its performance on the token-level likelihood.
|The taxi drivers are on strike again . __eou__ What for ? __eou__ They want the government to reduce the price of the gasoline . __eou__||
KLA: all right . i’ll take it .
KLA+BOW: well , how much is the price of the price price ?
CO-SS: that’s good . i want to buy a new car .
Gold: It is really a hot potato .
|I’d like you to do me a favor . __eou__ What is it ? __eou__ Can you mail this at the post office for me ? I have to go to the doctor for an appointment and I don’t have enough time to go to the post office . __eou__||
i’m sorry , but i’m not sure . i don’t want to have a look at the moment .
KLA+BOW: oh , i’m sorry , but it’s a pleasure to be in my room and again-b .
CO+SS: ok , i will . do you have any other questions about the company ?
Gold: Sure . And I’d like you to get some cold medicine for me when you go to the doctor’s .
|I heard you’ve found a new job . __eou__ Yes , I’m now working in a company . __eou__ So you’re quite happy with your new job ? __eou__||
KLA: to be honest , but i’m not familiar with my friends .
KLA+BOW: i’d like to . but i would like my mind .
CO+SS: not really , but how about you , sue ?
Gold: Right . I enjoy what I’m doing .
The accurate evaluation of dialogue systems is an open problem. To validate the previous metric-based results, we further conduct a human evaluation on several models. We randomly sampled 100 context from the test corpus and apply 6 different models to generate the best response with beam search. The evaluation is conducted only on the Dailydialog corpus since it is closer to our daily conversation and easier for humans to make the judgement. All the generated responses, together with the dialogue context, are then randomly shuffled and judged on the crowdsourcing website CrowdFlower. People are asked to judge the plausibility of the generated response by giving a binary score in three aspects: grammaticality, coherence with the dialogue context and diversity (ensure the response is not a dull sentence). 54 people are finally involved in evaluating the total 600 responses, each is judged by 3 different people and the score agreed by most people is adopted. We set each person can judge at most 50 responses and filter by manually-set test questions.
The results shows that our model generates highly fluent sentences compared to other approaches. KLA+BOW, as expected, receives the lowest score on fluency. Our model also achieves relative good scores on coherence and diversity, implying novel responses related to the conversation topic can be generated by our model. However, we notice the human evaluation is rather subjective and not reliable enough. If a sentence is influent, humans tend to reject it though the topic might be coherent and the content might be diverse. It is difficult to give an objective score separately for all the three aspects. We can see models with lower scores on fluency normally also receive lower scores on the other two fields like KLA+BOW and FB-all. Therefore, we consider this evaluation only as a complement to the metric-based results, indicating that humans agree with the generations of our models more than with the others.
Table 3 shows exampled generated responses. We can see the our improved collaborative VED model with scheduled sampling can more accurately identify the topic and generate more coherent responses. Standard KL-annealing tends to generate smooth sentences but irrelevant to the context. Imposing an additional BOW loss can increase the probability of correctly capturing the main topic, but the generated responses are sometimes grammatically wrong, as also has been shown from the metric-based results. In the first example, the context is about taxi drivers’ request for reducing gasoline price, the response from KLA is a fluent natural sentence but not closely related to the context. Model KLA+BOW starts with a reasonable beginning but ends up with influent continuations. Though influent, KLA+BOW model does capture the main topic about price, indicating it can successfully predict the order-insensitive bag of words but fail to establish a natural sentence. In contrast, our model is not only a fluent sentence, but also close to the topic. More importantly, it brings some new information “I want to buy a new car” and is helpful to an interactive conversation. Similar conditions can be seen in the other two examples.
Variational encoder-decoders and recurrent neural networks are powerful in representation learning and natural language processing respectively. Though recently quite a few work has started to apply them on dialogue generation, the training process is still unstable and the performance is hard to be guaranteed. In this work, we thoroughly analyze the reason of the training difficulty and compare different current approaches, then propose a new framework that allows effectively combining these two structures in dialogue generation. We split the whole structure into two parts for more flexible prior and posterior latent variable distributions. The training process is simple, efficient and scales well to large datasets.
We demonstrate the superiority of our model over other popular methods on two dialogue corpus. Experiments show that our model samples latent variables with more flexible distributions without sacrificing recurrent neural network’s capability of synthesizing coherent sentences. Without losing generality, our model should be able to apply on any se2seq tasks, which we leave for future work.
Xiaoyu Shen is supported by IMPRS-CS fellowship. The work is partially funded by DFG collaborative research center SFB 1102 and the National Natural Science of China under Grant No. 61602451.
Journal of machine learning research3(Feb):1137–1155.