VariationalSeq2Seq
A pytorch implementation of "Latent Variable Dialogue Models and their Diversity"
view repo
We present a dialogue generation model that directly captures the variability in possible responses to a given input, which reduces the `boring output' issue of deterministic dialogue models. Experiments show that our model generates more diverse outputs than baseline models, and also generates more consistently acceptable output than sampling from a deterministic encoder-decoder model.
READ FULL TEXT VIEW PDF
Developing a dialogue agent that is capable of making autonomous decisio...
read it
Sequential data often possesses a hierarchical structure with complex
de...
read it
Neural conversation models such as encoder-decoder models are easy to
ge...
read it
Recently, improving the relevance and diversity of dialogue system has
a...
read it
In this work we explore a deep learning-based dialogue system that gener...
read it
The sequence-to-sequence (Seq2Seq) model generates target words iterativ...
read it
While recent neural encoder-decoder models have shown great promise in
m...
read it
A pytorch implementation of "Latent Variable Dialogue Models and their Diversity"
The task of open-domain dialogue generation is an area of active development, with neural sequence-to-sequence models dominating the recently published literature (Shang et al., 2015; Vinyals and Le, 2015; Li et al., 2016b, a; Serban et al., 2016). Most previously published models train to minimise the negative log-likelihood of the training data, and then at generation time either perform beam search to find the output which maximises (Shang et al., 2015; Vinyals and Le, 2015; Serban et al., 2016) (ML decoding), or sample from the resulting distribution (Serban et al., 2016).
A notorious issue with ML decoding is that this tends to generate short, boring responses to a wide range of inputs, such as “I don’t know”. These responses are common in the training data, and can be replies to a wide range of inputs (Li et al., 2016a; Serban et al., 2016). In addition, shorter responses typically have higher likelihoods, and so wide beam sizes often result in very short responses (Tu et al., 2017; Belz, 2007). To resolve this problem, Li et al. (2016a) propose instead using maximum mutual information with a length boost as a decoding objective, and report more interesting generated responses.
Further, natural dialogue is not deterministic; for example, the replies to “What’s your name and where do you come from?” will vary from person to person. Li et al. (2016b) have proposed learning representations of personas to account for inter-person variation, but there can be variation even among a single person’s responses to certain questions.
Recently, Serban et al. (2017) have introduced latent variables to the dialogue modelling framework, to model the underlying distribution over possible responses directly. These models have the benefit that, at generation time, we can sample a response from the distribution by first sampling an assignment of the latent variables, and then decoding deterministically. In this way, we introduce stochasticity without resorting to sampling from the decoder, which can lead to incoherent output – see Table 1 for examples.
In this paper, we present a latent variable model for one-shot dialogue response, and investigate what kinds of diversity the latent variables capture. Our experiments show that our model has higher lexical as well as sentential diversity than baseline models. We also show that our model generates more acceptable diverse output than sampling from a deterministic decoder. We end by noting that all three methods proposed above to combat the ‘maximum likelihood response’ are ways of decreasing the probability of the generated output, and report some preliminary results for how response probability interacts with grammaticality and interestingness.
Prompt | Dial-LV | Sampling () |
---|---|---|
0.3the infection’s spreading. | he’s dead. | he was lee . |
nothing but no more. | leave it a way. | |
i know what you’re doing | there’s something going on in the phone rickshaw and put the afloat continuously… | |
0.3we only have fourteen hours to save the earth! | i’ve got no choice . | can i win? |
and then there’s something that’s what it’s like. | go peacekeeper go. | |
well, we need the help of what’s happening. | go to say hello again in this conversation. | |
0.3why don’t i do some research and send you an email? | well, we’re gonna have to make it. | no disrespect team. |
because i know how much it is. | excuse me. | |
because it ’s not right. | because i’m hotel. |
Our task is to model the true probability of a response given an input . We denote our model distribution by . We introduce a latent variable with a standard Gaussian prior – i.e. – and factor as:
(1) |
To motivate this model, we point out that existing encoder-decoder models encode an input as a single fixed representation. Hence, all of the possible replies to
must be stored within the decoder’s probability distribution
, and during decoding it is hard to disentangle these possible replies.However, our model contains a stochastic component in the decoder , and so by sampling different and then performing ML decoding on , we hope to tease apart the replies stored in the probability distribution , without resorting to sampling from the decoder. This has the benefit that we use the decoder at generation time in a similar way to how we train it, making it more likely that the output of our model is grammatical and coherent. Further, as we do not marginalize out when decoding, we no longer perform exact maximum likelihood search for a reply , and so we hope to avoid the boring reply problem.
At training time, we follow the variational autoencoder framework
(Kingma and Welling, 2014; Kingma et al., 2014; Sohn et al., 2015; Miao et al., 2016) , and approximate the posterior with a proposal distribution , which in our case is a diagonal Gaussian whose parameters depend on and . We thus have the following evidence lower bound (ELBO) for the log-likelihood of the data:(2) |
Note that this loss decomposes into two parts: the KL divergence between the approximate posterior and the prior, and the cross-entropy loss between the model distribution and the data distribution. If the model can encode useful information into , then the KL divergence term will be non-zero (Bowman et al., 2016). As our model decoder is given a deterministic representation of already, will then encode information about the variation in replies to .
Given an input sentence and a response , we run two separate bidirectional RNNs over their word embeddings and . We concatenate the final states of each and pass them through a single nonlinear layer to obtain our representations and of and . We use GRUs (Cho et al., 2014) as our RNN cell as a compromise between expressive power and computational cost.
We calculate the mean and variance of
as:(3) |
where denotes the concatenation of and , and denotes inserting along the diagonal of a matrix.
We take a single sample from using the reparametrization trick (Kingma and Welling, 2014), concatenate and , and initialize the hidden state of the decoder GRU with . We then train the decoder GRU to minimize the negative log-likelihood of the response .
While training this model, we noted the same difficulties as Bowman et al. (2016)
– as RNNs are powerful density estimators, the model will prefer to ignore the latent variables and instead optimize the data reconstruction term of the ELBO, while forcing the KL term to 0. We overcome this using similar techniques by gradually annealing the KL term weight over the course of model training and using word dropout in the decoder with a drop rate of
.We compare our model, Dial-LV, to three baselines. The first is an encoder-decoder dialogue model with ML decoding (Dial-MLE). The second baseline model implements the anti-LM decoder of Li et al. (2016a) (Dial-MMI) on top of the encoder-decoder, with no length normalization. For these models, we use beam search with a width of 2 to find the sentence which maximises the decoding objective (either ML or MMI).
The final baseline uses the encoder-decoder model, but instead samples from the decoder to find (Dial-Samp). We found that naively sampling from the decoder resulted in meaningless jumbles of words. To solve this, we introduced a temperature parameter , which scales the probability of each word of the decoder as . This parameter serves to sharpen the word distribution of the decoder. We found to be a reasonable balance between preserving stochasticity while also improving the coherence of the generated output.
We used the OpenSubtitles dataset of movie subtitles to train our models (Tiedemann, 2012)
. We took a random sample of 100K files from the full dataset to train our models on, and then pruned this of repeated files to leave roughly 95K files and capped sentence length to 50. The total size of the resulting corpus was around 731M tokens. Please see the supplementary material for model hyperparameters and training details.
As seeds for our replies, we used a list of 50 prompts: 150 lines from the OpenSubtitles dataset outside of our training set which we judged to make sense as independent sentences and 50 questions chosen from a list of suggested conversation starters^{1}^{1}1Obtained from http://conversationstartersworld.com/250-conversation-starters/.
Model | Zipf parameter | NLL | Unique % |
---|---|---|---|
Dial-Lv | 1.39 | 15.54 | 76 |
Dial-MLE | 1.43 | 12.15 | 35 |
Dial-MMI | 1.60 | 15.12 | 62 |
Dial-Samp | 1.53 | 16.66 | 78 |
Previous work (e.g. Li et al. (2016a)) used type-token ratio (TTR) to measure the diversity of the generated output. However, as language follows a Zipf distribution, TTR is affected by the length of the generated replies (Mitchell, 2015). Hence, we use the estimated parameter of a Zipf distribution fitted to our replies as a proxy for the lexical diversity of generated output, with more diverse output having smaller scores. As ML decoding is known to give the same few replies repeatedly, we also report the percentage of unique replies, as a coarser measure of sentential diversity compared to lexical diversity. Further, we give the negative log-likelihood (NLL) as predicted by the deterministic encoder-decoder model, to see what regions of the probability space the replies occupy. We present these statistics in Table 2.
We note that Dial-LV generates more diverse replies than the other deterministic models, measured in terms of percentage of unique responses. Interestingly, the lexical diversity of Dial-LV is almost identical to Dial-MLE, suggesting that the latent variables help Dial-LV avoid the boring output problem and generate more diverse outputs. We note that Dial-LV even rivals Dial-Samp in terms of sentential diversity, and beats Dial-Samp in terms of lexical diversity. This could be because Dial-Samp chooses words greedily, and so is biased towards choosing high-probability words at each timestep. This suggests that maintaining a beam of hypotheses while sampling could help sampling-based methods escape the trap of having to make near-greedy local decisions.
Model | NLL | Zipf | Unique % | ||
---|---|---|---|---|---|
Dial-LV | 1.183 | 0.402 | 15.51 | 1.32 | 76.4 |
Dial-Samp | 1.196 | 0.577 | 16.91 | 1.56 | 73.6 |
We also tested whether Dial-LV could generate a greater number of acceptable replies to a prompt than Dial-Samp. We randomly selected 50 prompts from our list of 200, and generated 5 replies at random to each one using both models. We then asked human annotators^{2}^{2}2We used 50 in total, 25 for each model to judge how many replies were appropriate replies, taking into account grammaticality, coherence and relevance. The results are shown in Table 3.
Interestingly, even though Dial-LV has a lower NLL score, both models generate roughly the same number of acceptable replies. Dial-LV also has less variance in the number of acceptable replies, suggesting that the outputs it generates are more consistent than responses from Dial-Samp. Finally, we note that Dial-LV generates more diverse output than Dial-Samp in this scenario, even thought its replies are judged equally acceptable, suggesting that it is managing to produce a wide range of coherent, fluent and appropriate output.
Shell radius | Zipf parameter | NLL | Unique % |
---|---|---|---|
0 | 1.49 | 13.12 | 7 |
4 | 1.62 | 14.02 | 42.1 |
8 | 1.59 | 15.72 | 63.1 |
12 | 1.56 | 17.65 | 67.7 |
16 | 1.78 | 18.16 | 67.1 |
We next explored the effect of sampling from different regions of the latent space. For each prompt in the test set, we took 5 uniform samples from shells of radius 0 (which collapses to deterministic decoding), 4, 8, 12 and 16 in the latent space^{3}^{3}3For a -dim standard Gaussian, , and as . Here . by sampling from and then scaling the sample by the appropriate amount. We then generated a response to the prompt using each value of , and measured some statistics of the replies. The results are shown in Table 4.
As expected, samples with small radius show less diversity in terms of unique outputs. Further, we see a consistent trend that samples with greater radius have a higher NLL score, showing the influence of the prior in Eqn. 1. However, at the highest radius, we observe the highest NLLs, but also the lowest lexical diversities, suggesting that it manages to combine the words it produces in many different ways.
Taken together, our experiments show that ML decoding does not seem to be the best objective for generating diverse dialogue, and so corroborates the inadequacy of perplexity as an evaluation metric for dialogue models
(Liu et al., 2016). Indeed, all three models which show a diversity gain over the vanilla encoder-decoder with MLE decoding try to instead sample responses from a lower-probability region of the response space. However, if the response probability is too low, it runs the risk of being nonsensical. Hence, there appears to be a ‘Goldilocks’ region of the probability space, where the responses are interesting and coherent. Finding ways of concentrating model samples to this region is thus a potentially promising area of research for open-domain dialogue agents.We also note that our proposed model can be combined with MMI decoding or temperature-based sampling to get the benefits of both worlds. While we did not do this in our experiments in order to isolate the impact of our model, doing so improves the diversity of our generated output even more.
In this paper, we present a latent variable model to generate responses to input utterances. We investigate the diversity of output generated from this model, and show that it improves both lexical and sentential diversity. It also generates more consistently acceptable output as judged by humans compared to sampling from a decoder.
KC is supported by an EPSRC doctoral award. SC is supported by ERC Starting Grant DisCoTex (306920) and ERC Proof of Concept Grant GroundForce (693579). The authors would like to thank everyone who helped prototype the human evaluation experiments. The authors would also like to thank the anonymous reviewers for all their insightful comments.
On the properties of neural machine translation: Encoder-decoder approaches.
Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages 2122–2132, Austin, Texas, November 2016. Association for Computational Linguistics. URL https://aclweb.org/anthology/D16-1230.Hierarchical probabilistic neural network language model.
Tenth International Workshop on Artificial Intelligence and Statistics
, 2005.We implemented all of our models using Keras (Chollet, 2015) running on Theano (Theano Development Team, 2016). As vocabulary, we took all words appearing at least 1000 times in the whole corpus. As this amounted to 30K words, we used a 2-level hierarchical approximation to the full softmax to speed up model training (Morin and Bengio, 2005)
, with random clustering. We trained all our models for 3 epochs using the Adadelta optimizer
(Zeiler, 2012), with default values for the optimizer parameters.We used 512 dimensional word embeddings and encoder hidden state sizes across all of our models. We used 64 latent dimensional latent variables, and so the decoder RNN for the Dial-LV model had hidden state size 576. The decoder RNN for the Dial-MLE model also had hidden state size 576, to keep the capacity of the decoder comparable across the two models. We used tanh non-linearities throughout our model. For training the vanilla encoder-decoder, we also used word dropout on the decoder input with a drop rate of to prevent overfitting. Each epoch took roughly 4 days on a Titan Black.
For the MMI decoding, we used a LM penalty weight of and applied this for the first words.
Comments
There are no comments yet.