Tensorflow Implementation of Knowledge-Guided CVAE for dialog generation. It is released by Tiancheng Zhao (Tony) from Dialog Research Center, LTI, CMU
While recent neural encoder-decoder models have shown great promise in modeling open-domain conversations, they often generate dull and generic responses. Unlike past work that has focused on diversifying the output of the decoder at word-level to alleviate this problem, we present a novel framework based on conditional variational autoencoders that captures the discourse-level diversity in the encoder. Our model uses latent variables to learn a distribution over potential conversational intents and generates diverse responses using only greedy decoders. We have further developed a novel variant that is integrated with linguistic prior knowledge for better performance. Finally, the training procedure is improved by introducing a bag-of-word loss. Our proposed models have been validated to generate significantly more diverse responses than baseline approaches and exhibit competence in discourse-level decision-making.READ FULL TEXT VIEW PDF
Tensorflow Implementation of Knowledge-Guided CVAE for dialog generation. It is released by Tiancheng Zhao (Tony) from Dialog Research Center, LTI, CMU
The dialog manager is one of the key components of dialog systems, which is responsible for modeling the decision-making process. Specifically, it typically takes a new utterance and the dialog context as input, and generates discourse-level decisions Bohus and Rudnicky (2003); Williams and Young (2007). Advanced dialog managers usually have a list of potential actions that enable them to have diverse behavior during a conversation, e.g. different strategies to recover from non-understanding Yu et al. (2016). However, the conventional approach of designing a dialog manager Williams and Young (2007) does not scale well to open-domain conversation models because of the vast quantity of possible decisions. Thus, there has been a growing interest in applying encoder-decoder models Sutskever et al. (2014) for modeling open-domain conversation Vinyals and Le (2015); Serban et al. (2016a)
. The basic approach treats a conversation as a transduction task, in which the dialog history is the source sequence and the next response is the target sequence. The model is then trained end-to-end on large conversation corpora using the maximum-likelihood estimation (MLE) objective without the need for manual crafting.
However recent research has found that encoder-decoder models tend to generate generic and dull responses (e.g., I don’t know), rather than meaningful and specific answers Li et al. (2015); Serban et al. (2016b). There have been many attempts to explain and solve this limitation, and they can be broadly divided into two categories (see Section 2 for details): (1) the first category argues that the dialog history is only one of the factors that decide the next response. Other features should be extracted and provided to the models as conditionals in order to generate more specific responses Xing et al. (2016); Li et al. (2016a); (2) the second category aims to improve the encoder-decoder model itself, including decoding with beam search and its variations Wiseman and Rush (2016), encouraging responses that have long-term payoff Li et al. (2016b), etc.
Building upon the past work in dialog managers and encoder-decoder models, the key idea of this paper is to model dialogs as a one-to-many problem at the discourse level. Previous studies indicate that there are many factors in open-domain dialogs that decide the next response, and it is non-trivial to extract all of them. Intuitively, given a similar dialog history (and other observed inputs), there may exist many valid responses (at the discourse-level), each corresponding to a certain configuration of the latent variables that are not presented in the input. To uncover the potential responses, we strive to model a probabilistic distribution over the distributed utterance embeddings of the potential responses using a latent variable (Figure 1
). This allows us to generate diverse responses by drawing samples from the learned distribution and reconstruct their words via a decoder neural network.
, which introduces a latent variable that can capture discourse-level variations as described above 2. We propose Knowledge-Guided CVAE (kgCVAE), which enables easy integration of expert knowledge and results in performance improvement and model interpretability. 3. We develop a training method in addressing the difficulty of optimizing CVAE for natural language generationBowman et al. (2015). We evaluate our models on human-human conversation data and yield promising results in: (a) generating appropriate and discourse-level diverse responses, and (b) showing that the proposed training method is more effective than the previous techniques.
Our work is related to both recent advancement in encoder-decoder dialog models and generative models based on CVAE.
Since the emergence of the neural dialog model, the problem of output diversity has received much attention in the research community. Ideal output responses should be both coherent and diverse. However, most models end up with generic and dull responses. To tackle this problem, one line of research has focused on augmenting the input of encoder-decoder models with richer context information, in order to generate more specific responses. Li et al., Li et al. (2016a) captured speakers’ characteristics by encoding background information and speaking style into the distributed embeddings, which are used to re-rank the generated response from an encoder-decoder model. Xing et al., Xing et al. (2016) maintain topic encoding based on Latent Dirichlet Allocation (LDA) Blei et al. (2003) of the conversation to encourage the model to output more topic coherent responses.
On the other hand, many attempts have also been made to improve the architecture of encoder-decoder models. Li et al,. Li et al. (2015)
proposed to optimize the standard encoder-decoder by maximizing the mutual information between input and output, which in turn reduces generic responses. This approach penalized unconditionally high frequency responses, and favored responses that have high conditional probability given the input. Wiseman and RushWiseman and Rush (2016) focused on improving the decoder network by alleviating the biases between training and testing. They introduced a search-based loss that directly optimizes the networks for beam search decoding. The resulting model achieves better performance on word ordering, parsing and machine translation. Besides improving beam search, Li et al., Li et al. (2016b)
pointed out that the MLE objective of an encoder-decoder model is unable to approximate the real-world goal of the conversation. Thus, they initialized a encoder-decoder model with MLE objective and leveraged reinforcement learning to fine tune the model by optimizing three heuristic rewards functions: informativity, coherence, and ease of answering.
into a probability distributioninstead of a point encoding in the autoencoder. Then VAE applies a decoder network to reconstruct the original input using samples from . To generate images, VAE first obtains a sample of from the prior distribution, e.g. , and then produces an image via the decoder network. A more advanced model, the conditional VAE (CVAE), is a recent modification of VAE to generate diverse images conditioned on certain attributes, e.g. generating different human faces given skin color Yan et al. (2015); Sohn et al. (2015). Inspired by CVAE, we view the dialog contexts as the conditional attributes and adapt CVAE to generate diverse responses instead of images.
Although VAE/CVAE has achieved impressive results in image generation, adapting this to natural language generators is non-trivial. Bowman et al., Bowman et al. (2015)
have used VAE with Long-Short Term Memory (LSTM)-based recognition and decoder networks to generate sentences from a latent Gaussian variable. They showed that their model is able to generate diverse sentences with even a greedy LSTM decoder. They also reported the difficulty of training because the LSTM decoder tends to ignore the latent variable. We refer to this issue as thevanishing latent variable problem. Serban et al., Serban et al. (2016b) have applied a latent variable hierarchical encoder-decoder dialog model to introduce utterance-level variations and facilitate longer responses. To improve upon the past models, we firstly introduce a novel mechanism to leverage linguistic knowledge in training end-to-end neural dialog models, and we also propose a novel training technique that mitigates the vanishing latent variable problem.
Each dyadic conversation can be represented via three random variables: the dialog context(context window size ), the response utterance (the utterance) and a latent variable , which is used to capture the latent distribution over the valid responses. Further, is composed of the dialog history: the preceding k-1 utterances; conversational floor (1 if the utterance is from the same speaker of , otherwise 0) and meta features (e.g. the topic). We then define the conditional distribution and our goal is to use deep neural networks (parametrized by ) to approximate and . We refer to as the prior network and as the response decoder. Then the generative process of is (Figure 2 (a)):
Sample a latent variable from the prior network .
Generate through the response decoder .
CVAE is trained to maximize the conditional log likelihood of given , which involves an intractable marginalization over the latent variable . As proposed in Sohn et al. (2015); Yan et al. (2015), CVAE can be efficiently trained with the Stochastic Gradient Variational Bayes (SGVB) framework Kingma and Welling (2013) by maximizing the variational lower bound of the conditional log likelihood. We assume the
follows multivariate Gaussian distribution with a diagonal covariance matrix and introduce arecognition network to approximate the true posterior distribution . Sohn and et al,. Sohn et al. (2015) have shown that the variational lower bound can be written as:
demonstrates an overview of our model. The utterance encoder is a bidirectional recurrent neural network (BRNN)Schuster and Paliwal (1997)
with a gated recurrent unit (GRU)Chung et al. (2014) to encode each utterance into fixed-size vectors by concatenating the last hidden states of the forward and backward RNN . is simply . The context encoder is a 1-layer GRU network that encodes the preceding k-1 utterances by taking and the corresponding conversation floor as inputs. The last hidden state of the context encoder is concatenated with meta features and . Since we assume follows isotropic Gaussian distribution, the recognition network and the prior network , and then we have:
We then use the reparametrization trick Kingma and Welling (2013) to obtain samples of either from predicted by the recognition network (training) or predicted by the prior network (testing). Finally, the response decoder is a 1-layer GRU network with initial state . The response decoder then predicts the words in sequentially.
In practice, training CVAE is a challenging optimization problem and often requires large amount of data. On the other hand, past research in spoken dialog systems and discourse analysis has suggested that many linguistic cues capture crucial features in representing natural conversation. For example, dialog acts Poesio and Traum (1998) have been widely used in the dialog managers Litman and Allen (1987); Raux et al. (2005); Zhao and Eskenazi (2016) to represent the propositional function of the system. Therefore, we conjecture that it will be beneficial for the model to learn meaningful latent if it is provided with explicitly extracted discourse features during the training.
In order to incorporate the linguistic features into the basic CVAE model, we first denote the set of linguistic features as . Then we assume that the generation of depends on , and . relies on and as shown in Figure 2. Specifically, during training the initial state of the response decoder is and the input at every step is where is the word embedding of word in . In addition, there is an MLP to predict based on and . In the testing stage, the predicted is used by the response decoder instead of the oracle decoders. We denote the modified model as knowledge-guided CVAE (kgCVAE) and developers can add desired discourse features that they wish the latent variable to capture. KgCVAE model is trained by maximizing:
Since now the reconstruction of
is a part of the loss function, kgCVAE can more efficiently encode-related information into than discovering it only based on the surface-level and . Another advantage of kgCVAE is that it can output a high-level label (e.g. dialog act) along with the word-level responses, which allows easier interpretation of the model’s outputs.
A straightforward VAE with RNN decoder fails to encode meaningful information in due to the vanishing latent variable problem Bowman et al. (2015). Bowman et al., Bowman et al. (2015) proposed two solutions: (1) KL annealing: gradually increasing the weight of the KL term from 0 to 1 during training; (2) word drop decoding: setting a certain percentage of the target words to 0. We found that CVAE suffers from the same issue when the decoder is an RNN. Also we did not consider word drop decoding because Bowman et al,. Bowman et al. (2015) have shown that it may hurt the performance when the drop rate is too high.
As a result, we propose a simple yet novel technique to tackle the vanishing latent variable problem: bag-of-word loss. The idea is to introduce an auxiliary loss that requires the decoder network to predict the bag-of-words in the response as shown in Figure 3(b). We decompose into two variables: with word order and without order, and assume that and are conditionally independent given and : . Due to the conditional independence assumption, the latent variable is forced to capture global information about the target response. Let where is vocabulary size, and we have:
where is the length of and is the word index of word in . The modified variational lower bound for CVAE with bag-of-word loss is (see Appendix A for kgCVAE):
We will show that the bag-of-word loss in Equation 3.3 is very effective against the vanishing latent variable and it is also complementary to the KL annealing technique.
We chose the Switchboard (SW) 1 Release 2 Corpus Godfrey and Holliman (1997) to evaluate the proposed models. SW has 2400 two-sided telephone conversations with manually transcribed speech and alignment. In the beginning of the call, a computer operator gave the callers recorded prompts that define the desired topic of discussion. There are 70 available topics. We randomly split the data into dialogs for train/validate/test. The pre-processing includes (1) tokenize using the NLTK tokenizer Bird et al. (2009); (2) remove non-verbal symbols and repeated words due to false starts; (3) keep the top 10K frequent word types as the vocabulary. The final data have pairs for train/validate/test. Furthermore, a subset of SW was manually labeled with dialog acts Stolcke et al. (2000). We extracted dialog act labels based on the dialog act recognizer proposed in Ribeiro et al. (2015)
. The features include the uni-gram and bi-gram of the utterance, and the contextual features of the last 3 utterances. We trained a Support Vector Machine (SVM)Suykens and Vandewalle (1999) with linear kernel on the subset of SW with human annotations. There are 42 types of dialog acts and the SVM achieved 77.3% accuracy on held-out data. Then the rest of SW data are labelled with dialog acts using the trained SVM dialog act recognizer.
We trained with the following hyperparameters (according to the loss on the validate dataset): word embedding has size 200 and is shared across everywhere. We initialize the word embedding from Glove embedding pre-trained on TwitterPennington et al. (2014). The utterance encoder has a hidden size of 300 for each direction. The context encoder has a hidden size of 600 and the response decoder has a hidden size of 400. The prior network and the MLP for predicting both have 1 hidden layer of size 400 and non-linearity. The latent variable has a size of 200. The context window
is 10. All the initial weights are sampled from a uniform distribution [-0.08, 0.08]. The mini-batch size is 30. The models are trained end-to-end using the Adam optimizerKingma and Ba (2014)
with a learning rate of 0.001 and gradient clipping at 5. We selected the best models based on the variational lower bound on the validate data. Finally, we use the BOW loss along withKL annealing of 10,000 batches to achieve the best performance. Section 5.4 gives a detailed argument for the importance of the BOW loss.
We compared three neural dialog models: a strong baseline model, CVAE, and kgCVAE. The baseline model is an encoder-decoder neural dialog model without latent variables similar to Serban et al. (2016a). The baseline model’s encoder uses the same context encoder to encode the dialog history and the meta features as shown in Figure 3. The encoded context is directly fed into the decoder networks as the initial state. The hyperparameters of the baseline are the same as the ones reported in Section 4.2 and the baseline is trained to minimize the standard cross entropy loss of the decoder RNN model without any auxiliary loss.
Also, to compare the diversity introduced by the stochasticity in the proposed latent variable versus the softmax of RNN at each decoding step, we generate responses from the baseline by sampling from the softmax. For CVAE/kgCVAE, we sample times from the latent and only use greedy decoders so that the randomness comes entirely from the latent variable .
Automatically evaluating an open-domain generative dialog model is an open research challenge Liu et al. (2016). Following our one-to-many hypothesis, we propose the following metrics. We assume that for a given dialog context , there exist reference responses , . Meanwhile a model can generate hypothesis responses , . The generalized response-level precision/recall for a given dialog context is:
where is a distance function which lies between 0 to 1 and measures the similarities between and . The final score is averaged over the entire test dataset and we report the performance with 3 types of distance functions in order to evaluate the systems from various linguistic points of view:
Cosine Distance of Bag-of-word Embedding: a simple method to obtain sentence embeddings is to take the average or extrema of all the word embeddings in the sentences Forgues et al. (2014); Adi et al. (2016). The is the cosine distance of the two embedding vectors. We used Glove embedding described in Section 4 and denote the average method as A-bow and extrema method as E-bow. The score is normalized to [0, 1].
Dialog Act Match: to measure the similarity at the discourse level, the same dialog-act tagger from 4.1 is applied to label all the generated responses of each model. We set if and have the same dialog acts, otherwise .
One challenge of using the above metrics is that there is only one, rather than multiple reference responses/contexts. This impacts reliability of our measures. Inspired by Sordoni et al. (2015), we utilized information retrieval techniques (see Appendix A
) to gather 10 extra candidate reference responses/context from other conversations with the same topics. Then the 10 candidate references are filtered by two experts, which serve as the ground truth to train the reference response classifier. The result is 6.69 extra references in average per context. The average number of distinct reference dialog acts is 4.2. Table1 shows the results.
|perplexity (KL)||35.4 (n/a)||20.2 (11.36)||16.02 (13.08)|
The proposed models outperform the baseline in terms of recall in all the metrics with statistical significance. This confirms our hypothesis that generating responses with discourse-level diversity can lead to a more comprehensive coverage of the potential responses than promoting only word-level diversity. As for precision, we observed that the baseline has higher or similar scores than CVAE in all metrics, which is expected since the baseline tends to generate the mostly likely and safe responses repeatedly in the
hypotheses. However, kgCVAE is able to achieve the highest precision and recall in the 4 metrics at the same time (BLEU1-4, A-BOW). One reason for kgCVAE’s good performance is that the predicted dialog act label in kgCVAE can regularize the generation process of its RNN decoder by forcing it to generate more coherent and precise words. We further analyze the precision/recall of BLEU-4 by looking at the average score versus the number of distinct reference dialog acts. A low number of distinct dialog acts represents the situation where the dialog context has a strong constraint on the range of the next response (low entropy), while a high number indicates the opposite (high-entropy). Figure4 shows that CVAE/kgCVAE achieves significantly higher recall than the baseline in higher entropy contexts. Also it shows that CVAE suffers from lower precision, especially in low entropy contexts. Finally, kgCVAE gets higher precision than both the baseline and CVAE in the full spectrum of context entropy.
|Example 1-Topic: Recycling Context: A: are they doing a lot of recycling out in Georgia? Target-B (statement): well at my workplace we have palaces for aluminium cans and we have a separate trash can for recyclable paper|
|1. well I’m a graduate student and have two kids||1. (non-understand) pardon|
|2. well I was in last year and so we’ve had lots of recycling||2. (statement) oh you’re not going to have a curbside pick up here|
|3. I’m not sure||3. (statement) okay I am sure about a recycling center|
|4. well I don’t know I just moved here in new york||4. (yes-answer) yeah so|
|Example 2-Topic: Child Care Context: A: you know a private home to take their children to when they’re young until they hit the preschool age and they Target-B (backchannel): uh-huh|
|1. um - hum||1. (backchannel) uh-huh|
|2. yeah||2. (turn-exit) um-hum|
|3. um - hum||3. (backchannel) yeah|
|4. uh-huh||4. (statement) oh yeah I think that’s part of the problem|
Table 2 shows the outputs generated from the baseline and kgCVAE. In example 1, caller A begins with an open-ended question. The kgCVAE model generated highly diverse answers that cover multiple plausible dialog acts. Further, we notice that the generated text exhibits similar dialog acts compared to the ones predicted separately by the model, implying the consistency of natural language generation based on . On the contrary, the responses from the baseline model are limited to local n-gram variations and share a similar prefix, i.e. ”I’m”. Example 2 is a situation where caller A is telling B stories. The ground truth response is a back-channel and the range of valid answers is more constrained than example 1 since B is playing the role of a listener. The baseline successfully predicts ”uh-huh”. The kgCVAE model is also able to generate various ways of back-channeling. This implies that the latent is able to capture context-sensitive variations, i.e. in low-entropy dialog contexts modeling lexical diversity while in high-entropy ones modeling discourse-level diversity. Moreover, kgCVAE is occasionally able to generate more sophisticated grounding (sample 4) beyond a simple back-channel, which is also an acceptable response given the dialog context.
In addition, past work Kingma and Welling (2013) has shown that the recognition network is able to learn to cluster high-dimension data, so we conjecture that posterior outputted from the recognition network should cluster the responses into meaningful groups. Figure 5 visualizes the posterior of responses in the test dataset in 2D space using t-SNE Maaten and Hinton (2008). We found that the learned latent space is highly correlated with the dialog act and length of responses, which confirms our assumption.
Finally, we evaluate the effectiveness of bag-of-word (BOW) loss for training VAE/CVAE with the RNN decoder. To compare with past work Bowman et al. (2015), we conducted the same language modelling (LM) task on Penn Treebank using VAE. The network architecture is same except we use GRU instead of LSTM. We compared four different training setups: (1) standard VAE without any heuristics; (2) VAE with KL annealing (KLA); (3) VAE with BOW loss; (4) VAE with both BOW loss and KLA. Intuitively, a well trained model should lead to a low reconstruction loss and small but non-trivial KL cost. For all models with KLA, the KL weight increases linearly from 0 to 1 in the first 5000 batches.
Table 3 shows the reconstruction perplexity and the KL cost on the test dataset. The standard VAE fails to learn a meaningful latent variable by having a KL cost close to and a reconstruction perplexity similar to a small LSTM LM Zaremba et al. (2014). KLA helps to improve the reconstruction loss, but it requires early stopping since the models will fall back to the standard VAE after the KL weight becomes 1. At last, the models with BOW loss achieved significantly lower perplexity and larger KL cost.
Figure 6 visualizes the evolution of the KL cost. We can see that for the standard model, the KL cost crashes to 0 at the beginning of training and never recovers. On the contrary, the model with only KLA learns to encode substantial information in latent when the KL cost weight is small. However, after the KL weight is increased to 1 (after 5000 batch), the model once again decides to ignore the latent and falls back to the naive implementation. The model with BOW loss, however, consistently converges to a non-trivial KL cost even without KLA, which confirms the importance of BOW loss for training latent variable models with the RNN decoder. Last but not least, our experiments showed that the conclusions drawn from LM using VAE also apply to training CVAE/kgCVAE, so we used BOW loss together with KLA for all previous experiments.
In conclusion, we identified the one-to-many nature of open-domain conversation and proposed two novel models that show superior performance in generating diverse and appropriate responses at the discourse level. While the current paper addresses diversifying responses in respect to dialogue acts, this work is part of a larger research direction that targets leveraging both past linguistic findings and the learning power of deep neural networks to learn better representation of the latent factors in dialog. In turn, the output of this novel neural dialog model will be easier to explain and control by humans. In addition to dialog acts, we plan to apply our kgCVAE model to capture other different linguistic phenomena including sentiment, named entities,etc. Last but not least, the recognition network in our model will serve as the foundation for designing a data-driven dialog manager, which automatically discovers useful high-level intents. All of the above suggest a promising research direction.
This work was funded by NSF grant CNS-1512973. The opinions expressed in this paper do not necessarily reflect those of NSF.
Journal of machine Learning research3(Jan):993–1022.
Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI-16).
Partially observable markov decision processes for spoken dialog systems.Computer Speech & Language 21(2):393–422.
We assume that even with the presence of linguistic feature regarding , the prediction of still only depends on the and . Therefore, we have:
We collected multiple reference responses for each dialog context in the test set by information retrieval techniques combining with traditional a machine learning method. First, we encode the dialog history using Term Frequency-Inverse Document Frequency (TFIDF) Salton and Buckley (1988) weighted bag-of-words into vector representation . Then we denote the topic of the conversation as and denote as the conversation floor, i.e. if the speakers of the last utterance in the dialog history and response utterance are the same otherwise . Then we computed the similarity between two dialog contexts using:
Unlike past work Sordoni et al. (2015)
, this similarity function only cares about the distance in the context and imposes no constraints on the response, therefore is suitbale for finding diverse responses regarding to the same dialog context. Secondly, for each dialog context in the test set, we retrieved the 10 nearest neighbors from the training set and treated the responses from the training set as candidate reference responses. Thirdly, we further sampled 240 context-responses pairs from 5481 pairs in the total test set and post-processed the selected candidate responses by two human computational linguistic experts who were told to give a binary label for each candidate response about whether the response is appropriate regarding its dialog context. The filtered lists then served as the ground truth to train our reference response classifier. For the next step, we extracted bigrams, part-of-speech bigrams and word part-of-speech pairs from both dialogue contexts and candidate reference responses with rare threshold for feature extraction being set to 20. Then L2-regularized logistic regression with 10-fold cross validation was applied as the machine learning algorithm. Cross validation accuracy on the human-labelled data was 71%. Finally, we automatically annotated the rest of test set with this trained classifier and the resulting data were used for model evaluation.