Sketch-Fill-A-R: A Persona-Grounded Chit-Chat Generation Framework

by   Michael Shum, et al.

Human-like chit-chat conversation requires agents to generate responses that are fluent, engaging and consistent. We propose Sketch-Fill-A-R, a framework that uses a persona-memory to generate chit-chat responses in three phases. First, it generates dynamic sketch responses with open slots. Second, it generates candidate responses by filling slots with parts of its stored persona traits. Lastly, it ranks and selects the final response via a language model score. Sketch-Fill-A-R outperforms a state-of-the-art baseline both quantitatively (10-point lower perplexity) and qualitatively (preferred by 55 heads-up in single-turn and 20 studies) on the Persona-Chat dataset. Finally, we extensively analyze Sketch-Fill-A-R's responses and human feedback, and show it is more consistent and engaging by using more relevant responses and questions.


page 1

page 2

page 3

page 4


Conditional Response Generation Using Variational Alignment

Generating relevant/conditioned responses in dialog is challenging, and ...

Structuring Latent Spaces for Stylized Response Generation

Generating responses in a targeted style is a useful yet challenging tas...

Bilateral Personalized Dialogue Generation with Dynamic Persona-Aware Fusion

Generating personalized responses is one of the major challenges in natu...

DoodleFormer: Creative Sketch Drawing with Transformers

Creative sketching or doodling is an expressive activity, where imaginat...

WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue

An intelligent dialogue system in a multi-turn setting should not only g...

RefNet: A Reference-aware Network for Background Based Conversation

Existing conversational systems tend to generate generic responses. Rece...

Incorporating Casual Analysis into Diversified and Logical Response Generation

Although the Conditional Variational AutoEncoder (CVAE) model can genera...

1 Introduction


is a rich domain that challenges machine learning models to express fluent natural language and to successfully interact with other agents. Chit-chat stands in contrast to goal-oriented dialogue, such as when a customer has the explicit goal of booking a flight ticket. When agents communicate, they each have internal state (e.g., their knowledge, intent) and typically have limited knowledge of the state of other agents

Chen et al. (2017). As a result, human-like chit-chat requires agents to be fluent, engaging and consistent with what has been said and their persona Zhang et al. (2018).

These requirements make learning generative chit-chat models a complex task. First, given an existing conversation history, there may be a large number of valid responses Vinyals and Le (2015)

. Hence, supervised learning of chit-chat models that cover a large number of topics and styles requires a significant amount of data

Zhou et al. (2018). Second, as conversations progress and more opportunities for contradiction arise, maintaining consistency becomes more difficult Serban et al. (2016, 2017). Third, engaging chit-chat responses follow conversational structures that are not captured well by perplexity Dinan et al. (2019). Indeed, our human user studies show that both consistency and engagingness are only weakly correlated with perplexity, and fluency is not at all.

Figure 1: Chit-chat generation with Sketch-Fill-A-R.

We propose Sketch-Fill-A-R, a dialogue agent framework that can learn to generate fluent, consistent and engaging chit-chat responses. Our key motivation is the hypothesis that human-like chit-chat responses often 1) follow common conversational patterns with insertions of agent-specific traits, and 2) condition explicitly on those persona traits.

Sketch-Fill-A-R decomposes response generation into three phases: sketching, filling and ranking, see Figure 1. First, Sketch-Fill-A-R dynamically generates a sketch response with slots, which enables it to learn response patterns that are compatible with many specific persona traits. Second, it generates candidate responses by filling in slots with words stored in memory. This enables Sketch-Fill-A-R’s responses to adhere to its persona. Third, the candidate responses are ranked by perplexity under a pre-trained language model (LM), which encourages the final response (with lowest LM perplexity) to be fluent.

In sum, our contributions are as follows:

  • We describe Sketch-Fill-A-R and how its multi-phase generation process encourages fluency, consistency and engagingness.

  • We show that Sketch-Fill-A-R significantly improves hold-out perplexity by points on the Persona-Chat dataset over state-of-the-art baselines.

  • We show Sketch-Fill-A-R is rated higher on conversational metrics and preferred over baselines in single and multi-turn user studies.

  • We extensively analyze Sketch-Fill-A-R’s response statistics and human feedback, and show that it is more consistent by using a narrower set of responses, and more engaging, by asking more questions than baselines.

2 Related Work

Chit-chat Dialogue

Dialogue agents such as Amazon Alexa, Apple Siri, and Google Home are commonplace today, and are mainly task-oriented: they help users achieve specific tasks. On the other hand, Microsoft XiaoIce Zhou et al. (2018) is an example of an undirected chit-chat dialogue agent.

Historically task-oriented dialogue systems are composed via components such as dialogue state tracking and natural language generation

Jurafsky and Martin (2009)

. Even now, the natural language generation component often uses hand-crafted templates and rules defined by domain experts that are filled via heuristics

Gao et al. (2019). More recently task-oriented dialogue systems have been trained end-to-end Bordes et al. (2016), but these systems have specific user intents they aim to fulfill, and so represent a more constrained task. Early conversational dialogue systems such as ELIZA Weizenbaum and others (1966) and Alice Wallace (2009) were also based on hand-crafted rules and thus brittle. To alleviate this rigidity, more recent neural seq2seq models Sutskever et al. (2014) are trained end-to-end Vinyals and Le (2015); Sordoni et al. (2015); Serban et al. (2017); Li et al. (2016). To help guide conversation Ghazvininejad et al. (2018); Dinan et al. (2018) incorporated knowledge-grounded datasets, while Zhang et al. (2018) created the Persona-Chat dataset used in this work. Sketch-Fill-A-R dynamically generates slot sketches and bears resemblance to Wu et al. (2019) which assumed data are structured domain-specific triplets and contexts follow templates. However, Sketch-Fill-A-R does not assume the personas and responses have rigid syntactic structure, and introduces a ranking procedure. Converse to our sketch-and-fill procedure, Qian et al. (2017) train a model to select a persona trait and decode around the trait. Finally, Welleck et al. (2018) also re-rank by scoring utterances with Natural Language Inference to improve consistency.

Neural Sequence Models

Sketch-Fill-A-R extends a neural encoder-decoder structure Sutskever et al. (2014) and is agnostic to the chosen form of encoder-decoder. We use recurrent models and attention Bahdanau et al. (2014)

, which auto-regressively embed and generate sequences. Hence, our framework is general and is compatible with non-recurrent encoders and decoders, such as Transformer networks with non-recurrent self-attention

Vaswani et al. (2017); Devlin et al. (2018).

Sketch-Fill-A-R uses a simple memory module to store words from personas, which act as context for generation. Weston et al. (2014); Sukhbaatar et al. (2015) introduced learned Key-Value Memory Networks, while Kumar et al. (2016) introduced Dynamic Memory Nets for question-answering via an iterative attention over memory. Also, Sketch-Fill-A-R decodes responses using a re-ranking strategy based on language model scores, which complements strategies in Kulikov et al. (2018).

Figure 2: Sketch-Fill-A-R generates sketch responses in 4 steps. First, it encodes the conversation history and personas into hidden states . It stores word-embeddings for selected rare words from persona traits in a persona-memory. The final encoder hidden state

produces a read-out vector

. Lastly, the decoder outputs a sketch response with @persona slots using , encoder hidden states and attention over personas and conversation.
Figure 3: Sketch-Fill-A-R inference strategy. During inference, Sketch-Fill-A-R uses beam search to generate sketch responses (step 1, depicted ). In step 2, each beam with @persona slots produces candidate responses by filling it with rare words selected from the persona traits. Finally, a language model scores each candidate and Sketch-Fill-A-R returns the lowest-perplexity candidate.

3 Sketch-Fill-A-R

Our key motivation is to generate human-like chit-chat responses that are conditioned on persona-relevant information. Sketch-Fill-A-R generates chit-chat using a persona-memory to dynamically generate sketches that capture conversational patterns, and inserting persona-relevant information.

To set notation: capitals denote matrices, are vector-matrix indices and denote vectors. The model input at time is and the output at time is . We denote the conversation by and persona trait words by . Both input and output words are 1-hot vectors, where denotes the vocabulary size. The vocabulary contains all unique words, punctuation and special symbols (e.g., EOS, @persona). denotes a sequence .

Formally, we aim to learn a response generation model that predicts words

using a probability distribution

over sequences of words and persona traits with rare words. Here is the output sequence length and

are the model weights. We mainly focus on deep neural networks, a model class that has recently seen great success in language generation tasks

Sutskever et al. (2014); Bahdanau et al. (2014).

Sketch-Fill-A-R composes several components to generate sketch responses:

  • An encoder

    that computes hidden representations

    of the input.

  • A memory module that stores all rare words from persona traits (constructed by removing stop words).

  • A language model that computes a distribution over next words.

  • A sketch decoder

    that synthesizes both the encoded input and memory readouts, and predicts the next word in the sketch response.

3.1 Sketch Response Generation


We instantiate both encoder and decoder using recurrent neural networks. In this work, we use LSTMs

Hochreiter and Schmidhuber (1997), although other choices are possible Elman (1990); Cho et al. (2014). The encoder computes hidden states auto-regressively:


where are word-embedding representations of the raw input tokens . As such, Sketch-Fill-A-R encodes both conversation history and individual persona traits into hidden states and . We denote final hidden states for all personas as .

Memory Module

Sketch-Fill-A-R selects a subset of rare words, from the persona traits by removing stop-words, punctuation, and other symbols. After encoding the input dialogue, Sketch-Fill-A-R does a memory readout using the final conversation encoder hidden state as a query:


where is a vector index over the rare word memory,

is a softmax activation function creating attention weights

, and are trainable embedding matrices where .

Attention Decoder

The decoder is an LSTM which recursively computes hidden states that are mapped into a distribution over output words:


At decoding time the decoder computes the next hidden state using the previous predicted word and decoder hidden state , in addition to attention over the context of the response, i.e., previous utterances and the agent’s persona traits. projects down to the initial hidden state of the decoder and is the transpose of the encoding embedding matrix. The decoding context augments decoder hidden state with attention vectors over encoded hidden states and over encoded persona hidden states :


where is a tanh, , and is the softmax activation function. In Equations 9 and 11 the softmax is over the encoder time dimension and is an inner product.

3.2 Inference Reranking Strategy

Sketch-Fill-A-R trains the sketch-decoder outputs (Equation 7) by minimizing cross-entropy loss with ground truths . However, during inference, Sketch-Fill-A-R uses an iterative generate-and-score approach to produce the final response:

  1. Perform beam search with beam size to generate sketch responses that may contain @persona tags.

  2. For each sketch with tags, select the persona with the highest attention weight from the first sketch tag location , and construct candidate responses by filling each @persona slot with words selected from .

  3. Compute the perplexity of all candidate responses using a pre-trained language model:

  4. The final response is the response with the lowest LM-likelihood score.

For models that do not use reranking to fill slots, we follow the methodology of Wu et al. (2019) in using a global-to-local memory pointer network in order to fill slots. For detail, see the Appendix.

Model Parameters Perplexity
KVMemNet 46.3M
Sketch-Fill 26.6M
Sketch-Fill-R 26.6M
Sketch-Fill-A 26.9M
Sketch-Fill-A-R 26.9M
Table 1: Sketch-Fill-A-R achieves significantly lower out-of-sample perplexity than KVMemNet. Perplexity for Sketch-Fill-A-R is measured over the sketch template responses. The number of persona tag occurrences is very small, constituting 8% of the total words. See Appendix for more information.
Sequence size KVMemNet Sketch-Fill-A-R (ours)
Unigram 5.39% 1.72%
Bigram 32.65 % 7.32 %
Trigram 54.95 % 13.97 %
Full responses 70.16 % 50.60 %
Table 2: Percentage of novel -grams and full responses generated by the KVMemNet and Sketch-Fill-A-R models computed on the full validation set.

4 Empirical Validation

To validate Sketch-Fill-A-R, we first show that it achieves better supervised learning performance than baselines on a chit-chat dialogue dataset.

Persona-Chat Dataset

We trained Sketch-Fill-A-R to generate single-turn agent responses on the Persona-Chat dataset Zhang et al. (2018), which contains 10,907 dialogues. Here, a dialogue consists of multiple turns: a single turn contains the utterance of a single agent. We processed this dataset into training examples that each consist of the conversation history , set of persona traits of the model, and the ground truth sketch response . This process yielded 131,438 training examples. Rare words were identified by removing all punctuation and stop words from the set of persona traits (see Appendix for more information). Ground truth sketch responses were then constructed by replacing all rare word instances in ground truth responses with @persona tags.

Language Model Pre-training

Sketch-Fill-A-R uses a Transformer-based GPT-network Radford et al. (2018) pre-trained on the Books text corpus Zhu et al. (2015) to rank candidate responses with filled @persona slots according to their LM-perplexity scores. For model details, see the Appendix.

Experimental Setup

We compared 4 variations of Sketch-Fill-A-R with a strong baseline: 111 A number of chit-chat models posted results in the ConvAI2 competition. However, we could not reproduce these, as all competitive methods rely on extensive pre-training with large models, or do not have code or trained models available.

  • Key-Value Memory Network (KVMemNet) Zhang et al. (2018),

  • Sketch-Fill (SF)

  • Sketch-Fill-A: SF + attention

  • Sketch-Fill-R: SF + reranking

  • Sketch-Fill-A-R: SF + attention + reranking

Zhang et al. (2018) showed not only that models trained on Persona-Chat outperform models trained on other dialogue datasets (movies, Twitter) in engagingness but also that KVMemNet outperforms vanilla Seq2Seq on Persona-Chat. As a result we omit comparison with vanilla Seq2Seq. Further KVMemNet is the strongest of the few public baselines available to compare against on chitchat with personas.

All Sketch-Fill-A-R models use language model reranking (see Section 3.2). All input tokens were first encoded using 300-dimensional GLoVe word embeddings Pennington et al. (2014). All models were trained by minimizing loss on the ground truth sketch response :


For training details, see the Appendix. The results are shown in Table 2. Sketch-Fill models outperform KVMemNet on validation perplexity, while using significantly fewer weights than KVMemNet. This suggests the structure of Sketch-Fill models fits well with chit-chat dialogue.

Baseline Consistency Engagingness Fluency Ours Consistency Engagingness Fluency
Test I KVMemNet Sketch-Fill
Test II KVMemNet Sketch-Fill-A
Test III KVMemNet Sketch-Fill-R
Test IV KVMemNet Sketch-Fill-A-R
Table 3:

User study ratings of single-turn responses (score range 1 (lowest) - 5 (highest)). Each experiment showed generated responses from a Sketch-Fill-A-R-variation and KVMemNet on 100 conversations to 5 human raters. Each row shows ratings from a single heads-up experiment. Sketch-Fill with reranking show a small gain over KVMemNet on all qualitative metrics, but the variance in the ratings is high. Sketch-Fill without reranking perform much worse, due to their responses not being fluent, despite achieving low perplexity (see Figure

A/B Experiment KVMemNet Sketch-Fill- (ours)
vs Sketch-Fill 380 120
vs Sketch-Fill-A 396 103
vs Sketch-Fill-R 225 275
vs Sketch-Fill-A-R 232 266
Table 4: Human user A/B-preferences on 100 conversations, each shown to 5 users. Two Sketch-Fill-A-R variations are preferred over KVMemNet.
Fluency Consistency Engagingness Perplexity
Fluency 1 0.40 0.46 -0.01
Consistency - 1 0.67 -0.20
Engagingness - - 1 -0.15
Perplexity - - - 1
Table 5: Pearson’s correlation between human ratings and perplexity of user study examples. For visual KDE-plots of the data, see the Appendix.

5 User Study and Qualitative Analysis

Although Sketch-Fill models perform well quantitatively, a crucial test is to evaluate how well they perform when judged by human users on conversational quality, which is not explicitly captured by perplexity. We performed single and multi-turn dialogue user studies to assess the quality of Sketch-Fill-A-R, rated along several dimensions:

  • Fluency: whether responses are grammatically correct and sound natural.

  • Consistency: whether responses do not contradict the previous conversation.

  • Engagingness: how well responses fit the previous conversation and how likely the conversation would continue.

Our definition of engagingness includes relevance, defined in pragmatics and relevance theory (Wilson and Sperber, 2002; Grice, 1991) as a statement leading to positive cognitive effect. However an engaging statement may be ironic (Sperber and Wilson, 1981), humorous, or further specific to individuals.

We also explore which qualities of Sketch-Fill-A-R’s outputs are correlated with human ratings and perplexity scores. Our results suggest that:

  • Conditioning on persona-memory provides more consistency.

  • Sketch-Fill-A-R poses more questions, which correlates with higher engagingness.

  • Responses need to be fluent in order to be consistent or engaging. In addition, more consistent responses are more likely to be engaging.

  • Perplexity is not correlated with high-quality responses.

Model Trait 1: I married a super model from Italy Model Trait 2: I’ve zero family that I’m close to Model Trait 3: My name is George Model Trait 4: I’m a bee farmer Model Trait 5: My favorite food is papaya Human: hi what’s up Model: do you like papaya ? Human: eh, i‘m more into mango Model: what is your favorite papaya ? Human: probably just the normal ones. Human: what do you like to do? Model: i am a bee farmer . Human: woah , that‘s cool Model: how are you ? Human: i‘m doing well , Human: just trying to get some work done Model: do you have any hobbies ? Human: i like to play basketball , what about you? Model: i love papaya food . Human: what is your name? Model: george . what is your favorite name ?
Table 6: Example multi-turn dialog between a human and Sketch-Fill-AR. Model persona traits are shown above; the human was not given specific persona traits.
Model Trait 1: I work for our local supermarket Model Trait 2: My favorite band is The Who Model Trait 3: I have never been out of Ohio Model Trait 4: My favorite food is pizza with black olives Human: what are you doing today? Model: i am ok what do you do for fun ? Human: I like biking. Do you bike?. Model: cool ! i am watching a movie with my little girl Human: You have a daughter? How old is she?. Model: oh that is cool . i do not have any myself . Human: Oh. do you have a girlfriend then?. Model: she is an ex singer from the early 2000s , Model: mostly does pop songs and the like Human: What kind of movies do you like? Model: little girl and my mother lives with us Model: she‘s a concert pianist, she in between jobs Human: who is this little girl? Model: she also likes karate Model: she is why i got a brown belt ! Human: what is her name? Model: she had a green thumb and loved flowers
Table 7: Example multi-turn dialog between a human and KVMemNet. Model persona traits are shown above; the human was not given specific persona traits.
Metric KVMemNet Sketch-Fill-A-R (ours)
Table 8: Multi-turn user study ratings (score range 1 (lowest) - 5 (highest)). We collected conversations with 20 turns between human raters and models. KVMemNet is more fluent, but Sketch-Fill-A-R is more engaging and significantly more consistent.
Human: hi there . how are you
Model: hi good just writing some music and you
Human: i am well . just took my dog for a walk
KVMemNet: sorry , i trying to become the next big thing
in music , i sing
Sketch-Fill-A-R: what kind of music do you like to do ?
Table 9: An example single-turn response for KVMemNet and Sketch-Fill-A-R given the same context.

5.1 Single-turn Experiments

The studies were completed on 100 random examples sampled from the validation set, where each example was rated by 5 judges. Judges hired for the study came from English speaking countries. As a calibration step, they were shown examples of good and bad responses in all of the measured dimensions, before proceeding with the study.

The study was executed in two settings, fine-grained, where the judges were asked to rate the responses on a scale from 1 (lowest) to 5 (highest) for each of the mentioned dimensions, and binary, where they were asked to choose a response that would best fit the conversation.

The results of the fine-grained survey are presented in Table 3, where each row corresponds to a separate heads-up experiments in which the KVMemNet model was paired with one of the versions of Sketch-Fill-A-R. The study showed small gains on all metrics for all Sketch-Fill-A-R variations, however, the variance of results was high. We believe that this artifact could be caused by a number of factors, including subjective preferences of raters and potential ambiguities in the experiments description. We notice that Sketch-Fill and Sketch-Fill-A reach lower perplexity values than KVMemNet, but comparatively have lower evaluations across the board. Conversely, ranking models like Sketch-Fill-R and Sketch-Fill-A-R have higher scores on all metrics. We observe that the difference is due to the ranker giving more fluent outputs via better selection of persona words to use.

Table 4 shows the results of the human study in a binary setting. In these experiments the base and attention-augmented versions of Sketch-Fill-A-R outperformed KVMemNet by a clear margin.

The following subsections present in-depth analysis of the human study. The analysis focuses on the Sketch-Fill-A-R model, since it yielded the best perplexity and user study results.

Correlation between ratings

To study and better understand the reasoning behind the ratings assigned by annotators, we look at the correlation between the different dimensions in which responses where scored. Figure 5

shows Kernel-Density-Estimation plots of the data points and associated Pearson correlation coefficients

. The data shows weak () to moderate () correlation between fluency and consistency, and fluency and engagingness ratings respectively. The data shows value of between engagingness and consistency ratings, suggesting strong correlation between those dimensions. See appendix for more detailed information. The numbers were obtained on human ratings of the Sketch-Fill-A-R model, but comparable numbers were also obtained for the KVMemNet model. The mentioned results follow intuition, as fluency of a response is a notion that can be easily defined and identified. On the other hand consistency and engagingness are ambiguous, and (possibly) partially overlapping, concepts.

To associate quantitative metrics from Table 2 with human ratings, we computed correlation between perplexity values from the sketch decoder of the Sketch-Fill-A-R model with human scores across different dimensions. The study showed no correlation for fluency (), and weak correlations for consistency () and engagingness ().

Model vocabulary analysis

To assess the diversity of responses generated by the models, we calculated the percentage of unique -grams and full responses present in the model outputs. Table 2 presents these values for KVMemNet and Sketch-Fill-A-R computed on the full validation set. The numbers show that the KVMemNet model clearly outperforms our model in terms of generating diverse and unique outputs by a factor of 3-4x. However, we hypothesize that this additional diversity may lead to lower engagingness scores.

Consistency over time

In order to evaluate the models capacity to stay consistent with its previous statements, and thus implicitly its ability to utilize information present in the chat history, we compared how the consistency rating changed as the number of lines of the conversation increased. Figure 4 visualizes this metric both for our model and KVMemNet. In the case of both models, the consistency decreases as the chat history get longer, indicating that models have problems keeping track of their previous statements. When analyzing the linear trend we noticed that the decrease in performance is slower for the Sketch-Fill-A-R model. We hypothesize that this effect can be partially caused by the high diversity of sequences generated by the KVMemNet, which in turn affects the models ability to generate consistent conversation.

(a) KVMemNet
(b) Sketch-Fill-A-R
(c) Sketch-Fill-A-R: Human ratings vs question/no-question responses
Figure 4: Impact of conversation length on the consistency of outputs generated by the KVMemNet (left) and Sketch-Fill-A-R (middle). As conversation length increases (more dialogue turns) both models become less consistent, but KVMemNet degrades faster than Sketch-Fill-A-R. Right: impact of response containing a question on human ratings. Responses including questions tend to receive higher human ratings.

Effect of question responses

We hypothesize that for a conversation to be engaging, responses in chit-chat dialogue should be a mix of statements, where the model shares its persona information, and questions, where the model inquires about certain traits and information of the other agent. To confirm this intuition, we evaluated the effect that presence of a question in the response has on the ratings coming from the judges. The results are presented in Figure 3(c). The study showed that there is a strong correlation between the model asking a question and the users rating the response as more engaging. Asking questions has a small, but positive influence on engagingness and fluency.

To further analyze this aspect, we measured the frequency of questions in the set of 100 responses coming from the Sketch-Fill-A-R and KVMemNet models. We found that our model produced 49 question responses out of which 25 had both a statement and a question. In the same setting the KVMemNet produced 15 questions out of which only 1 contained a statement and a question. This insight could explain the gains on the engagingness ratings found by our human study.

5.2 Multi-turn User Study

To evaluate both models in the more challenging multi-turn setting, we collected conversations that lasted 20 turns, between each model and human users. Users were asked to score their conversations with the models on a scale from 1 (lowest) to 5 (highest) across the same dimensions as in the single-turn experiments. Table 8 shows the human ratings for both Sketch-Fill-A-R and KVMemNet. Both were judged as less fluent (scores ) than in the single-turn case (scores ). This is likely due to the models having to respond to a range of conversation histories unseen during training.

Notably, Sketch-Fill-A-R outperformed KVMemNet on consistency, by a significantly larger margin (3.72 vs 2.15) than in the single-turn setting. This suggests that Sketch-Fill-A-R benefits from conditioning response generation on its persona-memory and so adheres more closely to responses that are compatible with its persona.

Further, Sketch-Fill-A-R is more engaging. This suggests that in the multi-turn setting, there also is a positive correlation between engagingness and consistency as in the single-turn case (see Appendix): consistent models can be more engaging as well.

Table 7 shows an example of KVMemNet’s inconsistency. While every model utterance is fluent individually, KVMemNet noticeably contradicts itself in the context of previous utterances and frequently ignores the human responses (e.g ”i do not have any myself” after ”my little girl”). We believe the lack of structure inherent in models built on vanilla Seq2Seq make KVMemNet prone to this mistake. Table 7 shows Sketch-Fill-A-R conducts a more engaging conversation, with pertinent responses and questions. However, this structure can restrict Sketch-Fill-A-R, as sketches may be filled with incorrect persona traits (e.g ”i love papaya food .”). See the Appendix for more examples.

6 Discussion and Future Work

In our study we have identified several paths for future work. First, our results show that perplexity does not strongly correlate with human judgment of the quality of responses. Developing an automated metric that correlates well with human judgment is crucial as human evaluation is expensive, time consuming, and prone to inconsistencies. Secondly, despite outperforming other models in the multi-turn dialogue setting on consistency and engagement, our model has not reached human-like fluency. In order to demonstrate higher-level complex traits such as empathy, models must first master these lower-level abilities. Finally, correct usage of rare words and proper nouns leads to higher human scores. Existing models are unable to deal with out-of-vocabulary tokens and rare words gracefully, and incorporation of commonsense via external knowledge bases or other methods will be useful.

Ethical Implications

During experiments, we identified a number of ethical implications for future work. The Persona-Chat dataset was noted by some raters to contain potentially inappropriate statements (e.g., ”my wife spends all my money”) and is based in US culture (e.g., food, music, cars, names). It also lacked content to fail gracefully when it didn’t have an appropriate response (e.g., ”I’m sorry I don’t understand,” ”I don’t know”). As such, learned model responses were occasionally insensitive and confusing to human users.


  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2, §3.
  • A. Bordes, Y. Boureau, and J. Weston (2016) Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683. Cited by: §2.
  • H. Chen, X. Liu, D. Yin, and J. Tang (2017) A survey on dialogue systems: recent advances and new frontiers. ACM SIGKDD Explorations Newsletter 19 (2), pp. 25–35. Cited by: §1.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv e-prints, pp. arXiv:1406.1078. External Links: 1406.1078 Cited by: §3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek, D. Kiela, A. Szlam, I. Serban, R. Lowe, et al. (2019) The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098. Cited by: §1.
  • E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2018)

    Wizard of wikipedia: knowledge-powered conversational agents

    arXiv preprint arXiv:1811.01241. Cited by: §2.
  • J. L. Elman (1990) Finding structure in time. Cognitive science 14 (2), pp. 179–211. Cited by: §3.1.
  • J. Gao, M. Galley, L. Li, et al. (2019) Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval 13 (2-3), pp. 127–298. Cited by: §2.
  • M. Ghazvininejad, C. Brockett, M. Chang, B. Dolan, J. Gao, W. Yih, and M. Galley (2018) A knowledge-grounded neural conversation model. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • H. P. Grice (1991) Studies in the way of words. Harvard University Press. Cited by: §5.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.
  • D. Jurafsky and J. H. Martin (2009)

    Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd edition

    Prentice Hall series in artificial intelligence, Prentice Hall, Pearson Education International. Cited by: §2.
  • I. Kulikov, A. H. Miller, K. Cho, and J. Weston (2018) Importance of a search strategy in neural dialogue modelling. arXiv preprint arXiv:1811.00907. Cited by: §2.
  • A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher (2016) Ask me anything: dynamic memory networks for natural language processing. In International Conference on Machine Learning, pp. 1378–1387. Cited by: §2.
  • J. Li, M. Galley, C. Brockett, G. P. Spithourakis, J. Gao, and B. Dolan (2016) A persona-based neural conversation model. arXiv preprint arXiv:1603.06155. Cited by: §2.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §4.
  • Q. Qian, M. Huang, H. Zhao, J. Xu, and X. Zhu (2017) Assigning personality/identity to a chatting machine for coherent conversation generation. External Links: 1706.02861 Cited by: §2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Note: OpenAI Cited by: §4.
  • I. V. Serban, R. Lowe, L. Charlin, and J. Pineau (2016) Generative deep neural networks for dialogue: a short review. arXiv preprint arXiv:1611.06216. Cited by: §1.
  • I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1, §2.
  • A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, and B. Dolan (2015) A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714. Cited by: §2.
  • D. Sperber and D. Wilson (1981) Irony and the use-mention distinction. Philosophy 3, pp. 143–184. Cited by: §5.
  • S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end memory networks. In Advances in neural information processing systems, pp. 2440–2448. Cited by: §2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2, §2, §3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §2.
  • O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §1, §2.
  • R. S. Wallace (2009) The anatomy of alice. In Parsing the Turing Test, pp. 181–210. Cited by: §2.
  • J. Weizenbaum et al. (1966) ELIZA—a computer program for the study of natural language communication between man and machine. Communications of the ACM 9 (1), pp. 36–45. Cited by: §2.
  • S. Welleck, J. Weston, A. Szlam, and K. Cho (2018) Dialogue natural language inference. arXiv preprint arXiv:1811.00671. Cited by: §2.
  • J. Weston, S. Chopra, and A. Bordes (2014) Memory networks. arXiv preprint arXiv:1410.3916. Cited by: §2.
  • D. Wilson and D. Sperber (2002) Relevance theory. Blackwell. Cited by: §5.
  • C. Wu, R. Socher, and C. Xiong (2019) Global-to-local memory pointer networks for task-oriented dialogue. In International Conference on Learning Representations, External Links: Link Cited by: §2, §3.2.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: i have a dog, do you have pets too?. arXiv preprint arXiv:1801.07243. Cited by: §1, §2, 1st item, §4, §4.
  • L. Zhou, J. Gao, D. Li, and H. Shum (2018) The design and implementation of xiaoice, an empathetic social chatbot. arXiv preprint arXiv:1812.08989. Cited by: §1, §2.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724, Cited by: §4.