is a rich domain that challenges machine learning models to express fluent natural language and to successfully interact with other agents. Chit-chat stands in contrast to goal-oriented dialogue, such as when a customer has the explicit goal of booking a flight ticket. When agents communicate, they each have internal state (e.g., their knowledge, intent) and typically have limited knowledge of the state of other agentsChen et al. (2017). As a result, human-like chit-chat requires agents to be fluent, engaging and consistent with what has been said and their persona Zhang et al. (2018).
These requirements make learning generative chit-chat models a complex task. First, given an existing conversation history, there may be a large number of valid responses Vinyals and Le (2015)
. Hence, supervised learning of chit-chat models that cover a large number of topics and styles requires a significant amount of dataZhou et al. (2018). Second, as conversations progress and more opportunities for contradiction arise, maintaining consistency becomes more difficult Serban et al. (2016, 2017). Third, engaging chit-chat responses follow conversational structures that are not captured well by perplexity Dinan et al. (2019). Indeed, our human user studies show that both consistency and engagingness are only weakly correlated with perplexity, and fluency is not at all.
We propose Sketch-Fill-A-R, a dialogue agent framework that can learn to generate fluent, consistent and engaging chit-chat responses. Our key motivation is the hypothesis that human-like chit-chat responses often 1) follow common conversational patterns with insertions of agent-specific traits, and 2) condition explicitly on those persona traits.
Sketch-Fill-A-R decomposes response generation into three phases: sketching, filling and ranking, see Figure 1. First, Sketch-Fill-A-R dynamically generates a sketch response with slots, which enables it to learn response patterns that are compatible with many specific persona traits. Second, it generates candidate responses by filling in slots with words stored in memory. This enables Sketch-Fill-A-R’s responses to adhere to its persona. Third, the candidate responses are ranked by perplexity under a pre-trained language model (LM), which encourages the final response (with lowest LM perplexity) to be fluent.
In sum, our contributions are as follows:
We describe Sketch-Fill-A-R and how its multi-phase generation process encourages fluency, consistency and engagingness.
We show that Sketch-Fill-A-R significantly improves hold-out perplexity by points on the Persona-Chat dataset over state-of-the-art baselines.
We show Sketch-Fill-A-R is rated higher on conversational metrics and preferred over baselines in single and multi-turn user studies.
We extensively analyze Sketch-Fill-A-R’s response statistics and human feedback, and show that it is more consistent by using a narrower set of responses, and more engaging, by asking more questions than baselines.
2 Related Work
Dialogue agents such as Amazon Alexa, Apple Siri, and Google Home are commonplace today, and are mainly task-oriented: they help users achieve specific tasks. On the other hand, Microsoft XiaoIce Zhou et al. (2018) is an example of an undirected chit-chat dialogue agent.
Historically task-oriented dialogue systems are composed via components such as dialogue state tracking and natural language generationJurafsky and Martin (2009)
. Even now, the natural language generation component often uses hand-crafted templates and rules defined by domain experts that are filled via heuristicsGao et al. (2019). More recently task-oriented dialogue systems have been trained end-to-end Bordes et al. (2016), but these systems have specific user intents they aim to fulfill, and so represent a more constrained task. Early conversational dialogue systems such as ELIZA Weizenbaum and others (1966) and Alice Wallace (2009) were also based on hand-crafted rules and thus brittle. To alleviate this rigidity, more recent neural seq2seq models Sutskever et al. (2014) are trained end-to-end Vinyals and Le (2015); Sordoni et al. (2015); Serban et al. (2017); Li et al. (2016). To help guide conversation Ghazvininejad et al. (2018); Dinan et al. (2018) incorporated knowledge-grounded datasets, while Zhang et al. (2018) created the Persona-Chat dataset used in this work. Sketch-Fill-A-R dynamically generates slot sketches and bears resemblance to Wu et al. (2019) which assumed data are structured domain-specific triplets and contexts follow templates. However, Sketch-Fill-A-R does not assume the personas and responses have rigid syntactic structure, and introduces a ranking procedure. Converse to our sketch-and-fill procedure, Qian et al. (2017) train a model to select a persona trait and decode around the trait. Finally, Welleck et al. (2018) also re-rank by scoring utterances with Natural Language Inference to improve consistency.
Neural Sequence Models
, which auto-regressively embed and generate sequences. Hence, our framework is general and is compatible with non-recurrent encoders and decoders, such as Transformer networks with non-recurrent self-attentionVaswani et al. (2017); Devlin et al. (2018).
Sketch-Fill-A-R uses a simple memory module to store words from personas, which act as context for generation. Weston et al. (2014); Sukhbaatar et al. (2015) introduced learned Key-Value Memory Networks, while Kumar et al. (2016) introduced Dynamic Memory Nets for question-answering via an iterative attention over memory. Also, Sketch-Fill-A-R decodes responses using a re-ranking strategy based on language model scores, which complements strategies in Kulikov et al. (2018).
Our key motivation is to generate human-like chit-chat responses that are conditioned on persona-relevant information. Sketch-Fill-A-R generates chit-chat using a persona-memory to dynamically generate sketches that capture conversational patterns, and inserting persona-relevant information.
To set notation: capitals denote matrices, are vector-matrix indices and denote vectors. The model input at time is and the output at time is . We denote the conversation by and persona trait words by . Both input and output words are 1-hot vectors, where denotes the vocabulary size. The vocabulary contains all unique words, punctuation and special symbols (e.g., EOS, @persona). denotes a sequence .
Formally, we aim to learn a response generation model that predicts words
using a probability distributionover sequences of words and persona traits with rare words. Here is the output sequence length and
are the model weights. We mainly focus on deep neural networks, a model class that has recently seen great success in language generation tasksSutskever et al. (2014); Bahdanau et al. (2014).
Sketch-Fill-A-R composes several components to generate sketch responses:
that computes hidden representationsof the input.
A memory module that stores all rare words from persona traits (constructed by removing stop words).
A language model that computes a distribution over next words.
A sketch decoder
that synthesizes both the encoded input and memory readouts, and predicts the next word in the sketch response.
3.1 Sketch Response Generation
We instantiate both encoder and decoder using recurrent neural networks. In this work, we use LSTMsHochreiter and Schmidhuber (1997), although other choices are possible Elman (1990); Cho et al. (2014). The encoder computes hidden states auto-regressively:
where are word-embedding representations of the raw input tokens . As such, Sketch-Fill-A-R encodes both conversation history and individual persona traits into hidden states and . We denote final hidden states for all personas as .
Sketch-Fill-A-R selects a subset of rare words, from the persona traits by removing stop-words, punctuation, and other symbols. After encoding the input dialogue, Sketch-Fill-A-R does a memory readout using the final conversation encoder hidden state as a query:
where is a vector index over the rare word memory,
is a softmax activation function creating attention weights, and are trainable embedding matrices where .
The decoder is an LSTM which recursively computes hidden states that are mapped into a distribution over output words:
At decoding time the decoder computes the next hidden state using the previous predicted word and decoder hidden state , in addition to attention over the context of the response, i.e., previous utterances and the agent’s persona traits. projects down to the initial hidden state of the decoder and is the transpose of the encoding embedding matrix. The decoding context augments decoder hidden state with attention vectors over encoded hidden states and over encoded persona hidden states :
3.2 Inference Reranking Strategy
Sketch-Fill-A-R trains the sketch-decoder outputs (Equation 7) by minimizing cross-entropy loss with ground truths . However, during inference, Sketch-Fill-A-R uses an iterative generate-and-score approach to produce the final response:
Perform beam search with beam size to generate sketch responses that may contain @persona tags.
For each sketch with tags, select the persona with the highest attention weight from the first sketch tag location , and construct candidate responses by filling each @persona slot with words selected from .
Compute the perplexity of all candidate responses using a pre-trained language model:
The final response is the response with the lowest LM-likelihood score.
For models that do not use reranking to fill slots, we follow the methodology of Wu et al. (2019) in using a global-to-local memory pointer network in order to fill slots. For detail, see the Appendix.
|Sequence size||KVMemNet||Sketch-Fill-A-R (ours)|
|Bigram||32.65 %||7.32 %|
|Trigram||54.95 %||13.97 %|
|Full responses||70.16 %||50.60 %|
4 Empirical Validation
To validate Sketch-Fill-A-R, we first show that it achieves better supervised learning performance than baselines on a chit-chat dialogue dataset.
We trained Sketch-Fill-A-R to generate single-turn agent responses on the Persona-Chat dataset Zhang et al. (2018), which contains 10,907 dialogues. Here, a dialogue consists of multiple turns: a single turn contains the utterance of a single agent. We processed this dataset into training examples that each consist of the conversation history , set of persona traits of the model, and the ground truth sketch response . This process yielded 131,438 training examples. Rare words were identified by removing all punctuation and stop words from the set of persona traits (see Appendix for more information). Ground truth sketch responses were then constructed by replacing all rare word instances in ground truth responses with @persona tags.
Language Model Pre-training
We compared 4 variations of Sketch-Fill-A-R with a strong baseline: 111 A number of chit-chat models posted results in the ConvAI2 competition. However, we could not reproduce these, as all competitive methods rely on extensive pre-training with large models, or do not have code or trained models available.
Key-Value Memory Network (KVMemNet) Zhang et al. (2018),
Sketch-Fill-A: SF + attention
Sketch-Fill-R: SF + reranking
Sketch-Fill-A-R: SF + attention + reranking
Zhang et al. (2018) showed not only that models trained on Persona-Chat outperform models trained on other dialogue datasets (movies, Twitter) in engagingness but also that KVMemNet outperforms vanilla Seq2Seq on Persona-Chat. As a result we omit comparison with vanilla Seq2Seq. Further KVMemNet is the strongest of the few public baselines available to compare against on chitchat with personas.
All Sketch-Fill-A-R models use language model reranking (see Section 3.2). All input tokens were first encoded using 300-dimensional GLoVe word embeddings Pennington et al. (2014). All models were trained by minimizing loss on the ground truth sketch response :
For training details, see the Appendix. The results are shown in Table 2. Sketch-Fill models outperform KVMemNet on validation perplexity, while using significantly fewer weights than KVMemNet. This suggests the structure of Sketch-Fill models fits well with chit-chat dialogue.
User study ratings of single-turn responses (score range 1 (lowest) - 5 (highest)). Each experiment showed generated responses from a Sketch-Fill-A-R-variation and KVMemNet on 100 conversations to 5 human raters. Each row shows ratings from a single heads-up experiment. Sketch-Fill with reranking show a small gain over KVMemNet on all qualitative metrics, but the variance in the ratings is high. Sketch-Fill without reranking perform much worse, due to their responses not being fluent, despite achieving low perplexity (see Figure2).
|A/B Experiment||KVMemNet||Sketch-Fill- (ours)|
5 User Study and Qualitative Analysis
Although Sketch-Fill models perform well quantitatively, a crucial test is to evaluate how well they perform when judged by human users on conversational quality, which is not explicitly captured by perplexity. We performed single and multi-turn dialogue user studies to assess the quality of Sketch-Fill-A-R, rated along several dimensions:
Fluency: whether responses are grammatically correct and sound natural.
Consistency: whether responses do not contradict the previous conversation.
Engagingness: how well responses fit the previous conversation and how likely the conversation would continue.
Our definition of engagingness includes relevance, defined in pragmatics and relevance theory (Wilson and Sperber, 2002; Grice, 1991) as a statement leading to positive cognitive effect. However an engaging statement may be ironic (Sperber and Wilson, 1981), humorous, or further specific to individuals.
We also explore which qualities of Sketch-Fill-A-R’s outputs are correlated with human ratings and perplexity scores. Our results suggest that:
Conditioning on persona-memory provides more consistency.
Sketch-Fill-A-R poses more questions, which correlates with higher engagingness.
Responses need to be fluent in order to be consistent or engaging. In addition, more consistent responses are more likely to be engaging.
Perplexity is not correlated with high-quality responses.
|Human: hi there . how are you|
|Model: hi good just writing some music and you|
|Human: i am well . just took my dog for a walk|
|KVMemNet: sorry , i trying to become the next big thing|
|in music , i sing|
|Sketch-Fill-A-R: what kind of music do you like to do ?|
5.1 Single-turn Experiments
The studies were completed on 100 random examples sampled from the validation set, where each example was rated by 5 judges. Judges hired for the study came from English speaking countries. As a calibration step, they were shown examples of good and bad responses in all of the measured dimensions, before proceeding with the study.
The study was executed in two settings, fine-grained, where the judges were asked to rate the responses on a scale from 1 (lowest) to 5 (highest) for each of the mentioned dimensions, and binary, where they were asked to choose a response that would best fit the conversation.
The results of the fine-grained survey are presented in Table 3, where each row corresponds to a separate heads-up experiments in which the KVMemNet model was paired with one of the versions of Sketch-Fill-A-R. The study showed small gains on all metrics for all Sketch-Fill-A-R variations, however, the variance of results was high. We believe that this artifact could be caused by a number of factors, including subjective preferences of raters and potential ambiguities in the experiments description. We notice that Sketch-Fill and Sketch-Fill-A reach lower perplexity values than KVMemNet, but comparatively have lower evaluations across the board. Conversely, ranking models like Sketch-Fill-R and Sketch-Fill-A-R have higher scores on all metrics. We observe that the difference is due to the ranker giving more fluent outputs via better selection of persona words to use.
Table 4 shows the results of the human study in a binary setting. In these experiments the base and attention-augmented versions of Sketch-Fill-A-R outperformed KVMemNet by a clear margin.
The following subsections present in-depth analysis of the human study. The analysis focuses on the Sketch-Fill-A-R model, since it yielded the best perplexity and user study results.
Correlation between ratings
To study and better understand the reasoning behind the ratings assigned by annotators, we look at the correlation between the different dimensions in which responses where scored. Figure 5
shows Kernel-Density-Estimation plots of the data points and associated Pearson correlation coefficients. The data shows weak () to moderate () correlation between fluency and consistency, and fluency and engagingness ratings respectively. The data shows value of between engagingness and consistency ratings, suggesting strong correlation between those dimensions. See appendix for more detailed information. The numbers were obtained on human ratings of the Sketch-Fill-A-R model, but comparable numbers were also obtained for the KVMemNet model. The mentioned results follow intuition, as fluency of a response is a notion that can be easily defined and identified. On the other hand consistency and engagingness are ambiguous, and (possibly) partially overlapping, concepts.
To associate quantitative metrics from Table 2 with human ratings, we computed correlation between perplexity values from the sketch decoder of the Sketch-Fill-A-R model with human scores across different dimensions. The study showed no correlation for fluency (), and weak correlations for consistency () and engagingness ().
Model vocabulary analysis
To assess the diversity of responses generated by the models, we calculated the percentage of unique -grams and full responses present in the model outputs. Table 2 presents these values for KVMemNet and Sketch-Fill-A-R computed on the full validation set. The numbers show that the KVMemNet model clearly outperforms our model in terms of generating diverse and unique outputs by a factor of 3-4x. However, we hypothesize that this additional diversity may lead to lower engagingness scores.
Consistency over time
In order to evaluate the models capacity to stay consistent with its previous statements, and thus implicitly its ability to utilize information present in the chat history, we compared how the consistency rating changed as the number of lines of the conversation increased. Figure 4 visualizes this metric both for our model and KVMemNet. In the case of both models, the consistency decreases as the chat history get longer, indicating that models have problems keeping track of their previous statements. When analyzing the linear trend we noticed that the decrease in performance is slower for the Sketch-Fill-A-R model. We hypothesize that this effect can be partially caused by the high diversity of sequences generated by the KVMemNet, which in turn affects the models ability to generate consistent conversation.
Effect of question responses
We hypothesize that for a conversation to be engaging, responses in chit-chat dialogue should be a mix of statements, where the model shares its persona information, and questions, where the model inquires about certain traits and information of the other agent. To confirm this intuition, we evaluated the effect that presence of a question in the response has on the ratings coming from the judges. The results are presented in Figure 3(c). The study showed that there is a strong correlation between the model asking a question and the users rating the response as more engaging. Asking questions has a small, but positive influence on engagingness and fluency.
To further analyze this aspect, we measured the frequency of questions in the set of 100 responses coming from the Sketch-Fill-A-R and KVMemNet models. We found that our model produced 49 question responses out of which 25 had both a statement and a question. In the same setting the KVMemNet produced 15 questions out of which only 1 contained a statement and a question. This insight could explain the gains on the engagingness ratings found by our human study.
5.2 Multi-turn User Study
To evaluate both models in the more challenging multi-turn setting, we collected conversations that lasted 20 turns, between each model and human users. Users were asked to score their conversations with the models on a scale from 1 (lowest) to 5 (highest) across the same dimensions as in the single-turn experiments. Table 8 shows the human ratings for both Sketch-Fill-A-R and KVMemNet. Both were judged as less fluent (scores ) than in the single-turn case (scores ). This is likely due to the models having to respond to a range of conversation histories unseen during training.
Notably, Sketch-Fill-A-R outperformed KVMemNet on consistency, by a significantly larger margin (3.72 vs 2.15) than in the single-turn setting. This suggests that Sketch-Fill-A-R benefits from conditioning response generation on its persona-memory and so adheres more closely to responses that are compatible with its persona.
Further, Sketch-Fill-A-R is more engaging. This suggests that in the multi-turn setting, there also is a positive correlation between engagingness and consistency as in the single-turn case (see Appendix): consistent models can be more engaging as well.
Table 7 shows an example of KVMemNet’s inconsistency. While every model utterance is fluent individually, KVMemNet noticeably contradicts itself in the context of previous utterances and frequently ignores the human responses (e.g ”i do not have any myself” after ”my little girl”). We believe the lack of structure inherent in models built on vanilla Seq2Seq make KVMemNet prone to this mistake. Table 7 shows Sketch-Fill-A-R conducts a more engaging conversation, with pertinent responses and questions. However, this structure can restrict Sketch-Fill-A-R, as sketches may be filled with incorrect persona traits (e.g ”i love papaya food .”). See the Appendix for more examples.
6 Discussion and Future Work
In our study we have identified several paths for future work. First, our results show that perplexity does not strongly correlate with human judgment of the quality of responses. Developing an automated metric that correlates well with human judgment is crucial as human evaluation is expensive, time consuming, and prone to inconsistencies. Secondly, despite outperforming other models in the multi-turn dialogue setting on consistency and engagement, our model has not reached human-like fluency. In order to demonstrate higher-level complex traits such as empathy, models must first master these lower-level abilities. Finally, correct usage of rare words and proper nouns leads to higher human scores. Existing models are unable to deal with out-of-vocabulary tokens and rare words gracefully, and incorporation of commonsense via external knowledge bases or other methods will be useful.
During experiments, we identified a number of ethical implications for future work. The Persona-Chat dataset was noted by some raters to contain potentially inappropriate statements (e.g., ”my wife spends all my money”) and is based in US culture (e.g., food, music, cars, names). It also lacked content to fail gracefully when it didn’t have an appropriate response (e.g., ”I’m sorry I don’t understand,” ”I don’t know”). As such, learned model responses were occasionally insensitive and confusing to human users.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2, §3.
- Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683. Cited by: §2.
- A survey on dialogue systems: recent advances and new frontiers. ACM SIGKDD Explorations Newsletter 19 (2), pp. 25–35. Cited by: §1.
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv e-prints, pp. arXiv:1406.1078. External Links: Cited by: §3.1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
- The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098. Cited by: §1.
Wizard of wikipedia: knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241. Cited by: §2.
- Finding structure in time. Cognitive science 14 (2), pp. 179–211. Cited by: §3.1.
- Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval 13 (2-3), pp. 127–298. Cited by: §2.
A knowledge-grounded neural conversation model.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
- Studies in the way of words. Harvard University Press. Cited by: §5.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.
Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd edition. Prentice Hall series in artificial intelligence, Prentice Hall, Pearson Education International. Cited by: §2.
- Importance of a search strategy in neural dialogue modelling. arXiv preprint arXiv:1811.00907. Cited by: §2.
- Ask me anything: dynamic memory networks for natural language processing. In International Conference on Machine Learning, pp. 1378–1387. Cited by: §2.
- A persona-based neural conversation model. arXiv preprint arXiv:1603.06155. Cited by: §2.
- Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §4.
- Assigning personality/identity to a chatting machine for coherent conversation generation. External Links: Cited by: §2.
- Improving language understanding by generative pre-training. Note: OpenAI Cited by: §4.
- Generative deep neural networks for dialogue: a short review. arXiv preprint arXiv:1611.06216. Cited by: §1.
- A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1, §2.
- A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714. Cited by: §2.
- Irony and the use-mention distinction. Philosophy 3, pp. 143–184. Cited by: §5.
- End-to-end memory networks. In Advances in neural information processing systems, pp. 2440–2448. Cited by: §2.
- Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2, §2, §3.
- Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §2.
- A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §1, §2.
- The anatomy of alice. In Parsing the Turing Test, pp. 181–210. Cited by: §2.
- ELIZA—a computer program for the study of natural language communication between man and machine. Communications of the ACM 9 (1), pp. 36–45. Cited by: §2.
- Dialogue natural language inference. arXiv preprint arXiv:1811.00671. Cited by: §2.
- Memory networks. arXiv preprint arXiv:1410.3916. Cited by: §2.
- Relevance theory. Blackwell. Cited by: §5.
- Global-to-local memory pointer networks for task-oriented dialogue. In International Conference on Learning Representations, External Links: Cited by: §2, §3.2.
- Personalizing dialogue agents: i have a dog, do you have pets too?. arXiv preprint arXiv:1801.07243. Cited by: §1, §2, 1st item, §4, §4.
- The design and implementation of xiaoice, an empathetic social chatbot. arXiv preprint arXiv:1812.08989. Cited by: §1, §2.
- Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724, Cited by: §4.