What makes a good conversation? How controllable attributes affect human judgments

02/22/2019 ∙ by Abigail See, et al. ∙ Facebook Stanford University 0

A good conversation requires balance -- between simplicity and detail; staying on topic and changing it; asking questions and answering them. Although dialogue agents are commonly evaluated via human judgments of overall quality, the relationship between quality and these individual factors is less well-studied. In this work, we examine two controllable neural text generation methods, conditional training and weighted decoding, in order to control four important attributes for chitchat dialogue: repetition, specificity, response-relatedness and question-asking. We conduct a large-scale human evaluation to measure the effect of these control parameters on multi-turn interactive conversations on the PersonaChat task. We provide a detailed analysis of their relationship to high-level aspects of conversation, and show that by controlling combinations of these variables our models obtain clear improvements in human quality judgments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 13

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural generation models for dialogue, despite their ubiquity, are still poorly understood. Well known problems, such as the genericness and repetitiveness of responses (Serban et al., 2016a), remain without a de facto solution. Strikingly, the factors that determine overall human judgments of conversation quality are almost entirely unexplored. Most works have been limited to the next utterance prediction problem, whereas a multi-turn evaluation is necessary to evaluate the quality of a full conversation.

In this work we both (i) conduct a large-scale study to identify the fine-grained factors governing human judgments of full conversations, and (ii) develop models that apply our findings in practice, leading to state-of-the-art performance. Specifically, we identify and study eight aspects of conversation that can be measured by human judgments, while varying four types of low-level attributes that can be algorithmically controlled in neural models; see Figure 1. To control the low-level model attributes, we consider two general algorithms: conditional training, whereby control features are added to the sequence representations, and weighted decoding, where features are added to the decoding scoring function only at test time.

Figure 1: We manipulate four low-level attributes and measure their effect on human judgments of individual conversational aspects, as well as overall quality.

One major result of our findings is that existing work has ignored the importance of conversational flow, as standard models (i) repeat or contradict previous statements, (ii) fail to balance asking questions with other dialogue acts, and (iii) fail to balance specificity with brevity. Conducting experiments on the PersonaChat task (Zhang et al., 2018b), we obtain significantly higher engagingness scores than the baseline by optimizing control of repetition, specificity and question-asking over multiple turns. Using these findings, our best model matches the performance of the winning entry in the recent NeurIPS ConvAI2 competition, which used a lot more data but had no control. All of our code, pretrained models, and full chatlogs, will be released open-source.

2 Related Work

Dialogue

Dialogue evaluation is relatively well understood in goal-oriented tasks, where automated approaches can be coded by measuring task completion (Hastie, 2012; El Asri et al., 2017; Henderson et al., 2014; Wen et al., 2017; Bordes et al., 2017). Task success combined with dialogue cost can be linked to human judgments like user satisfaction via the PARADISE framework (Walker et al., 1997). However, this leaves unaddressed the large portion of human social communication where there is no explicit goal.

In chitchat tasks, which we study in this work, automatic metrics and their relation to human ratings are less well-understood. While word-overlap metrics are effective for question-answering and machine translation, for dialogue they have little to no correlation with human judgments (Venkatesh et al., 2017) – this is due to the open-ended nature of dialogue. There are more recent attempts to find better automatic approaches, such as adversarial evaluation (Li et al., 2017b) and learning a scoring model (Lowe et al., 2017), but their value is still unclear.

Nevertheless, a number of studies only use automated metrics, with no human study at all (Lowe et al., 2015; Serban et al., 2016b; Parthasarathi and Pineau, 2018). Other works do use human evaluations (Vinyals and Le, 2015; Venkatesh et al., 2017; Li et al., 2016b, a; Zhang et al., 2018b; Dinan et al., 2018), typically reporting just one type of judgment (either quality or appropriateness) either using a Likert scale or pairwise comparisons. Most of those works only consider single turn evaluations, often with a shortened dialogue history, rather than full multi-turn dialogue. Further, they offer no link between controllable factors of their models and improved results.

A more comprehensive evaluation strategy has been studied within the scope of the Alexa prize (Venkatesh et al., 2017; Guo et al., 2018) by considering multiple metrics designed to correlate well with human judgment. However, the metrics they use (engagement, domain coverage, coherence, topical diversity and conversational depth) are themselves higher-level factors of a conversation, and not easily controllable by a model. For this reason, they are less helpful in the direct construction of a model that is able to optimize them.

Controllable neural text generation

Researchers have proposed several approaches to control aspects of RNN-based natural language generation such as sentiment, length, speaker style and tense (Ficler and Goldberg, 2017; Ghazvininejad et al., 2017; Hu et al., 2017; Peng et al., 2018; Fan et al., 2018; Kikuchi et al., 2016). In particular, several works use control to tackle the same common sequence-to-sequence problems we address here (repetition, genericness and unrelated output), in the context of single-turn response generation (Li et al., 2017a; Shen et al., 2017; Xing et al., 2017; Wang et al., 2017; Zhang et al., 2018a; Zhou et al., 2017). However, in this work we focus on developing controls for, and human evaluation of, multi-turn interactive dialogue – this includes a new method (described in Section 5) to control attributes at the dialogue level rather than the utterance level.

In this work, we require a control method that is both general-purpose (one technique to simultaneously control many attributes) and easily tunable (the control setting is adjustable after training). Given these constraints, we study two control methods: conditional training (Fan et al., 2018; Kikuchi et al., 2016; Peng et al., 2018) and weighted decoding (Ghazvininejad et al., 2017). To our knowledge, this work is the first to systematically compare the effectiveness of two general-purpose control methods across several attributes.

3 The PersonaChat dataset

PersonaChat (Zhang et al., 2018b) is a chitchat dialogue task involving two participants (two humans or a human and a bot). Each participant is given a persona – a short collection of personal traits such as I’m left handed or My favorite season is spring – and are instructed to get to know each other by chatting naturally using their designated personas, for 6–8 turns. The training set contains 8939 conversations and 955 personas, collected via crowdworkers, plus 1000 conversations and 100 personas for validation, and a similar number in the hidden test set. The PersonaChat task was the subject of the NeurIPS 2018 ConvAI2 Challenge111http://convai.io/, in which competitors were first evaluated with respect to automatic metrics (perplexity, hits@1 and F1 score), and then with respect to human judgment via the question “How much did you enjoy talking to this user?” on a scale of 1–4.

4 Baseline model

Our baseline model is a 2-layer LSTM sequence-to-sequence model with attention. On any dialogue turn, the input to the encoder is the entire dialogue history (separated using unique speaker-identifying tokens), with the model’s own persona prepended. Conditioned on this input sequence, the decoder generates a response. Except when stated otherwise, all our models decode using beam search with beam size 20.

The word embedding matrix was initialized with 300-dimensional GloVe embeddings (Pennington et al., 2014), and the model was pretrained on a dataset of 2.5 million Twitter message-response pairs before being fine-tuned on PersonaChat, all trained within the ParlAI framework (Miller et al., 2017). On the PersonaChat validation set, the baseline model has a perplexity of 26.83 and F1 of 17.02, which would have placed us 4th out of 26 models in the competition. We attempt to improve over this baseline using control.

5 Controllable text generation methods

Suppose we have a sequence-to-sequence model which learns

, the conditional probability of a response

(the model’s next utterance) given input (the context, which in our case includes the model’s own persona and the dialogue history).

Contrary to most previous work, which controls at the sentence level, we wish to control attributes of the output at the dialogue level. For example, to control question-asking, we provide a control setting at the beginning of each dialogue (e.g. 20% questions or 70% questions) rather than providing a control setting for each utterance (e.g. is a question or isn’t a question). With this approach, the sequence-to-sequence model is able to choose what value the controlled attribute should take for any particular utterance, but we are able to choose the overall distribution. We find that this approach works well – the sequence-to-sequence model is generally good at detecting when to ask a question, for example – and works better than the alternative of developing a separate method to decide, for each utterance, whether to ask a question. In what follows, we describe the two methods we use to control attributes of the output at the dialogue level.

5.1 Conditional Training (CT)

Conditional Training (Fan et al., 2018; Kikuchi et al., 2016; Peng et al., 2018) is a method to learn a sequence-to-sequence model , where is a discrete control variable. If the control attribute is naturally continuous (for example in our work, repetitiveness, specificity and response-relatedness), we use to represent bucketed ranges. For a binary attribute like question-asking, represents an overall probability.

To train a CT model, we first automatically annotate every pair in the training set with the attribute we wish to control (for example, whether contains a question mark). For each example we determine the corresponding value during training (for continuous attributes, this simply means sorting into the correct bucket; for question-asking, see Section 6.4). The model takes and as input, and learns to produce .

Each of the possible values of the control variable is represented via an embedding. For all our experiments, the embedding is of length 10. There are several possible ways to condition the sequence-to-sequence model on – for example, append to the end of the input sequence, or use as the START symbol for the decoder. We find it most effective to concatenate to the decoder’s input on every step. To train a CT model conditioned on multiple controls , we can simply concatenate the control embeddings.

Our CT models are initialized with the baseline parameters (the new decoder parameters are initialized with small random values), then fine-tuned on the control-conditional PersonaChat task until validation perplexity converges.

5.2 Weighted Decoding (WD)

Weighted Decoding (Ghazvininejad et al., 2017) is a decoding method that increases or decreases the probability of words with certain features. The technique is applied only at test time, requiring no change to the training method. A disadvantage of WD is that it can only control attributes at the word-level; any utterance-level attribute must be broken down into word-level decoding features.

On the step of decoding, a partial hypothesis is expanded by computing the score for each possible next word :

Here, is the log-probability of the word calculated by the RNN, is the accumulated score of the already-generated words in the hypothesis , and are decoding features with associated weights .

A decoding feature assigns a real value to the word , in the context of the text generated so far and the context . The feature can be continuous (e.g. the unigram probability of ), discrete (e.g. the number of characters in ), or binary (e.g. whether starts with a given letter). A positive weight increases the probability of words that score highly with respect to ; a negative weight decreases their probability.

Input:

Yes, I’m studying law at the moment

Baseline Response: That sounds like a lot of fun!
(a) NIDF Weighted Decoding Response
0.6% Oh………………………………..
17.1% That sounds like a lot of fun!
18.3% That sounds like a lot of fun. How
long have you been studying?
38.5% I majored in practising my
spiritual full time philosophy test
71.9% Oh wow! Merna jean isa paino yi
hao hui bu acara sya gila […]


(b)
NIDF Conditional Training Response
16.8% Sounds like you are a great person!
18.3% So you are a law student?
18.4% That sounds like a lot of fun
22.8% That sounds like a rewarding job!
24.4% That sounds like a rewarding career!
Table 1: (a) Example of controlling specificity via Weighted Decoding. At the extremes, the model produces only the most rare or the most common tokens. (b) Example of controlling specificity via Conditional Training. A range of control settings is shown.

6 Controlling conversational attributes

In this section, we describe how we use weighted decoding and conditional training to control four attributes: repetition, specificity, response-relatedness and question-asking. We evaluate the effectiveness of both control methods via automatic metrics (i.e., measuring how well the attribute was controlled), and use our findings to select control methods and control settings to be explored further in the human evaluation (Section 8).

6.1 Repetition

Our baseline exhibits three common types of repetition, which we call external repetition (self-repetition across utterances), internal repetition (self-repetition within utterances), and partner repetition (repeating the conversational partner).

To control repetition with weighted decoding, we define five n-gram based decoding features (see Appendix

D). Three of these features (extrep_bigram, intrep_bigram and partnerrep_bigram) identify repeating bigrams for the three repetition types. The other two features (extrep_unigram and intrep_unigram) identify repeating content words. By applying a negative weight to these features, we can reduce undesired repetition.222We also tried controlling repetition with conditional training, defining as the (bucketed) maximum ROUGE-L precision between the response and the bot’s previous utterances. However, this method was unsuccessful because there are not enough repetitive examples in the training data for the model to learn the control. Experimenting with data augmentation to solve this problem is an area for future work. As we observe that repetition control is very important, for all other control experiments we control both repetition and the other variable concurrently.

6.2 Specificity

Input: Do you go get coffee often
Baseline Response: I do, when I am not playing the piano.
Wt Sim Weighted Decoding Response
-10.0 -0.05 I am a musician.
-5.0 -0.02 I do, when I am not playing the piano.
0.0 -0.02 I do, when I am not playing the piano.
5.0 0.40 I do, usually at starbucks.
8.0 0.59 Not usually, especially when you drink latte.
11.0 0.72 Not often, usually with drinks,
espresso, latte, tea, etc.
Table 2:

Example of controlling response-relatedness (cosine similarity to input) via Weighted Decoding.

Like many sequence-to-sequence models, our baseline frequently asks generic questions such as What music do you like? and gives dull, unspecific responses, e.g. I like all kinds of music.

We control specificity using Normalized Inverse Document Frequency (NIDF) as a measure of word rareness. The Inverse Document Frequency of a word is where is the number of responses in the dataset, and is the number of those responses that contain . Normalized IDF (which ranges from 0 to 1) is

(1)

where min_idf and max_idf are the minimum and maximum IDFs, taken over all words. To control specificity with weighted decoding, we use NIDF as a decoding feature. As shown in Table 1(a), this method produces reasonable outputs for weights within a certain range, but at the extremes the outputs are nonsensical. The boundary for nonsensical output differs from example to example.

To control specificity with conditional training, we define the specificity of an utterance to be the mean NIDF of the words in . Thus our control variable is mean NIDF (discretized into 10 equal-sized buckets). As shown in Table 1(b), this method gives outputs with a narrower NIDF range, but overall produces less nonsensical outputs.

6.3 Response-relatedness

In conversation, it’s generally desirable to produce a response that is related to the partner’s last utterance; for example it is inappropriate to say Do you have any pets? in response to My grandfather died last month. To control response-relatedness with weighted decoding, we use the decoding feature , the cosine similarity between the GloVe embedding for the word , and the sentence embedding for the partner’s last utterance (which is a part of the context ). Here the sentence embedding for an utterance is a weighted average of the GloVe embeddings of the words in , with the first principal component projected out; for full details, see Arora et al. (2017). We find that weighted decoding is effective to control the semantic relatedness of the model’s response to the partner’s last utterance (see Table 2). As before, we find that extreme weights lead to nonsensical output.

To control response-relatedness with conditional training, we define the control variable to be , the overall cosine similarity between the partner’s last utterance and the model’s response (again, we discretize ). However, we find this method ineffective – the CT model learns only a very weak connection between and the semantic relatedness of the output (see Section 7 for more details).

6.4 Question-asking

Considerate chitchat requires a reciprocal asking and answering of questions – asking too few or too many can appear self-centered or nosy. We control question-asking in order to study these trade-offs.

To control question-asking with weighted decoding, we use the binary decoding feature , which is equal to 1 if and only if the word is in a pre-defined list of interrogative words (how, what, when, where, which, who, whom, whose, why, ?). We find this is a reasonably effective method to encourage or discourage questions, but with unintended side-effects: a negative weight can discourage valid non-question utterances that happen to contain interrogative words (such as I’m learning how to knit) and a positive weight can result in degenerate utterances (such as What??????? or Who? When? How?).

For conditional training, we regard an utterance as containing a question if and only if contains a question mark. We train our CT model on a control variable with 11 possible values: . As discussed in Section 5, we wish to control question-asking at the distributional, dialogue level, rather than at the binary, utterance level. Thus the setting means that the model should produce, on average, an utterance containing ‘?’ with probability . During training we randomly assign examples to buckets such that each bucket is trained on examples with the correct proportion of questions (), and all buckets have the same amount of training examples.

For controlling question-asking, conditional training is preferable to weighted decoding for two reasons. Firstly, it allows us to achieve (close to) 0% questions, 100% questions, or anything in between, without introducing the risk of degenerate output. Secondly, directly controlling presence-of-a-question-mark avoids the pollution of our control variable that occurs when increasing or decreasing the probability of interrogative words. For these reasons, only the CT method is considered in the human evaluation.

7 Comparison of control methods

The previous section shows that conditional training and weighted decoding are both useful techniques, with different strengths and weaknesses.

The primary disadvantage of conditional training is that it’s only as effective as the underlying learning algorithm, which must learn the connection between the control variable

and the target output

. In practice, we find the model can learn simple attributes of the output (such as the presence of ‘?’, and overall genericness), but not relationships between the input and output (such as semantic relatedness). By contrast, weighted decoding can force the desired feature to appear in the output by raising the weight arbitrarily high (though this may have unintended side-effects).

The primary disadvantage of weighted decoding is that it risks going off-distribution when the weight is too strong. By contrast, conditional training produces mostly well-formed, in-distribution outputs. This highlights the importance of learned control – it is safer to learn to produce output that both satisfies the control variable and is appropriate, than to alter the decoding process to satisfy the control variable, potentially trading off appropriateness in the process.

Other practical considerations include convenience (conditional training requires retraining; weighted decoding doesn’t, but is slower at test time), data (conditional training requires sufficient examples of the controllable attribute; weighted decoding can control any computable feature) and attribute definition (conditional training can control sentence-level attributes, but they must be discrete; weighted decoding requires word-level features, but they can be continuous).

8 Human evaluation results

In order to study the relationship between our controllable attributes and human judgments of conversational quality, we conduct a large-scale human evaluation of 28 model configurations, plus human-human conversations for comparison.

Approach

In our evaluation, a crowdworker chats with a model (or in the human-human case, another crowdworker) for six conversational turns, then answers eight multiple-choice questions which each capture different aspects of conversational quality: avoiding repetition, interestingness, making sense, fluency, listening, inquisitiveness, humanness and engagingness. We also add a persona retrieval question, to match ConvAI2. Our evaluation is the same as the ConvAI2 Challenge evaluation, but more detailed (ConvAI2 includes only engagingness and persona retrieval).333There are two other differences between our evaluation and ConvAI2’s: (1) We fix capitalization and spacing before showing the chatbot’s utterances to crowdworkers, while ConvAI2 show the raw lowercase tokenized form. We found the latter interferes with fluency evaluations. (2) We conduct 6 dialogue turns, while ConvAI2 conducts 4-6. This was necessary to evaluate repetitiveness. The eight questions are Likert questions on a 1-4 scale, where higher is better;444Exceptions: Avoiding repetition is a 1-3 scale, as this was deemed to give clearer instructions. Inquisitiveness has an optimal score of 3; 1 and 2 represent too little question-asking, and 4 represents too much. For full details of the evaluation design, see Appendix B. For persona retrieval, the crowdworker is asked to select which of two possible personas was the model’s persona. In designing these questions, we aimed to capture the four aspects we expected to directly improve via control (avoiding repetition, interestingness, listening, inquisitiveness), two important error classes that we anticipated would be affected by our controls (fluency, making sense), and two overall quality measures (engagingness, humanness).

As in the ConvAI2 challenge, each of our 28 model configurations was evaluated by over 100 crowdworkers, and the results were adjusted for annotator variance via a Bayesian calibration

(Kulikov et al., 2018). Full results for all our models are provided in Appendices G and H, and sample conversations are provided in Appendix C.

Figure 2: Human judgments of engagingness for the baselines and best controlled models (left); for different specificity control settings (middle); and for different question-asking control settings (right).
Figure 3: Human judgments of conversational aspects for the baselines and best controlled models.
Model Win% Top 3 reasons for preferring model
Specificity WD () 84.1% More information; Better flow; More descriptive
Specificity WD () 75.5% More information; They describe their life in more detail; Funny
Specificity CT () 56.2% More information; Better flow; Seems more interested
Table 3: A/B tests comparing models to repetition-controlled baseline on interestingness. We find all comparisons are significant (; binomial test).

8.1 Main findings

We summarize the main findings of our human evaluation. As Figure 2 shows, controlling for repetition, specificity and question-asking all lead to large improvements over the greedy and beam-search baseline models for engagingness. We find that controlling for multi-turn (self) repetition is essential and should be incorporated alongside other attribute control methods. We found no improvement for response-relatedness (shown in Appendix H). Our best controlled model matches the engagingness of the winning entry in the ConvAI2 competition (both achieving a score of ), though ours was trained on significantly less data.

To understand why we get these improvements, we consider the wider set of human judgments, shown in Figure 3. We find that repetition leads to improvements across all our aspects of conversational quality. Specificity control shows improvements in interestingness and listening ability over the repetition baseline, which might explain the increased engagingness. Question-asking is shown to be more inquisitive, which may explain why it is more interesting and engaging as well.

Altogether, our evaluation clearly shows that controlling low-level attributes over multiple turns leads to improved overall quality.

8.2 Effect of controlled attributes

Repetition

We observe that self-repetition across utterances (external repetition) is by far the most severe form of repetition in our baseline. We evaluate several settings of the extrep_bigram weighted decoding feature, and find that an aggressive repetition-reduction setting (reducing bigram repetition rate to below gold data levels) is rated best. We also find that blocking repeated content words improves the avoiding repetition score (see Appendices E, F and G for full details).

As shown in Figure 2 (left) and Figure 3, our repetition-controlled model improves hugely over the beam search baseline in all metrics, and achieves close-to-human scores on all metrics except humanness. This striking result demonstrates that repetition is by far the biggest limiting quality factor for naive sequence-to-sequence dialogue agents. The result also emphasizes the importance of multi-turn dialogue evaluation to detect the problem. We refer to this model as the repetition-controlled baseline, and use it as a basis to control repetition in all remaining experiments.

Specificity

For our weighted decoding models, we find that the extreme settings (very generic and very specific) score poorly in engagingness due to the frequent presence of degenerate output – see Figure 2 (middle). We find that the setting (which is more specific than the repetition-controlled baseline and about as specific as the gold data) maximizes engagingness. As shown in Figure 2 (left) and Figure 3, this more-specific model is rated more interesting, engaging, and a better listener than the repetition-controlled baseline, but at the cost of reduced fluency and making sense. For further discussion on the interestingness of our specificity models, see Section 8.3.

Response-relatedness

We evaluated several response-relatedness control settings and found that none scored better than (no response-relatedness control) – see Appendix H. This is surprising – prior to running the human evaluation, we annotated 100 examples ourselves to determine the best control settings. While we identified a more responsive setting () as less likely than the uncontrolled model to ignore the user, crowdworkers rated it as a slightly worse listener than the uncontrolled model. One explanation for this discrepancy is that the more responsive model takes more risks, using more rare words (.197 NIDF, up from .177), and thus receiving a lower makes-sense score (3.41, down from 3.70). We hypothesize that, compared to us, the crowdworkers are less tolerant of slightly nonsensical output, and more tolerant of generic unrelated utterances.

Question-asking

As shown in Figure 2 (right), a question-asking rate of 65.7% (setting ) maximizes engagingness. This model, which asks more questions than both the repetition-controlled baseline (50.0%) and humans (36.8%) – brings us even closer to human-level engagingness – see Figure 2 (left). Although we find that a rate of approximately 65.7% question-asking is the most engaging, a lower level (48.9%, or ) is rated the best listener. Lastly, we find that although asking too many questions is less engaging, most crowdworkers will not criticize a chatbot that asks questions on every turn555Though this conclusion may hold true for the PersonaChat task – a synthetic chatting task that instructs participants to get to know each other – in real-life social conversations, incessant question-asking may be less tolerated. (only 11.9% of crowdworkers judged the (boost) setting, which asks 99% questions, as asking too many questions). For full details of these scores, see Appendix F and H.

8.3 A/B tests for interestingness

Though our more-specific models yielded significant improvements in engagingness, we were surprised that they did not yield clearer improvements in interestingness. To investigate further, we conducted an A/B interestingness evaluation of three specificity-controlled models, compared to the repetition-controlled baseline.

Crowdworkers were shown two conversations (from the main human evaluation) and asked to choose which model was more interesting (see Appendix A for details). We collected 500 samples per comparison, plus 200 additional human vs repetition-controlled samples, which were used to filter for quality control. After discarding low-quality crowdworkers, we have roughly 300 evaluations per comparison, with an average Cohen’s .

As shown in Table 3, all three models were rated significantly more interesting than the repetition-controlled baseline. This convincingly shows that producing utterances with more rare words is a valid strategy to improve interestingness.

We have two explanations for why these interestingness differences did not materialize in our main evaluation. Firstly, interestingness is a particularly subjective metric (unlike more tangible metrics such as avoiding repetition and making sense) – this makes it hard to calibrate across crowdworkers. Secondly, we suspect that in our original evaluation, the crowdworkers may have evaluated the interestingness of the task rather than the chatbot. This could account for why subtle increases in conversational ability did not result in higher interestingness ratings – the PersonaChat task itself has a natural interestingness limit.

9 Conclusion

What makes a good conversation?

A good conversation is about trade-offs: we showed in our large-scale evaluation that appropriate control of repetition, specificity and question-asking all lead to large improvements in human judgments. Modeling these aspects explicitly, contrary to black box neural generation systems, allows us to precisely study and control the relationship between low-level attributes and high-level dialogue quality.

While neural models are usually trained to predict the next utterance only, humans judge a conversation as a whole; not just the sum of its turns. Optimizing for human judgments of a good conversation is made viable through control variables.

Outlook

While humanness has long been perceived as the end-goal of dialogue research, engagingness is arguably the more important metric. The Turing test is essentially proof-by-contradiction, where models can be caught out on a single turn. Engagingness on the other hand, is a more forgiving metric – involving interestingness, listening, inquisitiveness, fluency and making sense – measured over all the turns.

While our models do well with engagingness and our other measured aspects, they fall markedly short in humanness judgments. How to bridge that gap, using controllable aspects of dialogue, is an open problem that constitutes an interesting direction for future work in dialogue research.

References

Supplementary Material

Appendix A Screenshots of human evaluation interface

Figure 4: Screenshot of the Task Description
Figure 5: Screenshot of the chat UI, talking with the beam search baseline model.
Figure 6: Screenshot of the A/B test UI, comparing a human-human conversation (left) and a Repetition-controlled baseline model (right).

Appendix B Human evaluation questionnaire design

Here are the questions and multiple-choice options used in the human evaluation, in the order presented:

[Engagingness] How much did you enjoy talking to this user? Not at all A little Somewhat A lot [Interestingness] How interesting or boring did you find this conversation? Very boring A little boring A little interesting Very interesting [Inquisitiveness] How much did the user try to get to know you? Didn’t ask about me at all Asked about me some Asked about me a good amount Asked about me too much [Listening] How much did the user seem to pay attention to what you said? Always ignored what I said Mostly ignored what I said Mostly paid attention to what I said Always paid attention to what I said [Avoiding Repetition] How repetitive was this user? Repeated themselves over and over Sometimes said the same thing twice Always said something new [Fluency] How naturally did this user speak English? Very unnatural Mostly unnatural Mostly natural Very natural [Making sense] How often did this user say something which did NOT make sense? Never made any sense Most responses didn’t make sense Some responses didn’t make sense Everything made perfect sense [Humanness] Do you think this user is a bot or a human? Definitely a bot Probably a bot Probably a human Definitely a human [Persona retrieval] Which prompt (character) do you think the other user was given for this conversation? Respondent chooses one of two provided personas

Appendix C Example conversations from human evaluation

(a) (b)

(c) (d)
Figure 7: Example conversation with (a) Baseline (b) Repetition-controlled baseline (c) Question-controlled CT (), (d) Specificity-controlled WD ()

Appendix D Repetition-control decoding features

Feature Condition
Adding to the hypothesis would create a 2-gram
that appears in a previous utterance by the model
is a non-stopword and
appears in a previous utterance by the model
Adding to the hypothesis would create a 2-gram
that appears earlier in the hypothesis
is a non-stopword and
appears earlier in the hypothesis
Adding to the hypothesis would create a 2-gram
that appears in a previous utterance by the partner
Table 4: We define five binary features for controlling different types of repetition via weighted decoding. Each feature depends on the word , the partial hypothesis , and the dialogue context . Each of these features is equal to 1 if and only if the condition on the right is true; otherwise 0.

Appendix E Control settings for all configurations

Repetition Specificity Response-rel Questions
External Internal Partner Rep.
Bigram Unigram Bigram Unigram Bigram NIDF cos sim Has ‘?’
Baselines:
Greedy Search
Beam Search (beam size 20)
Repetition control (WD):
Extrep bigram -0.5 wt -0.5
Extrep bigram -1.25 wt -1.25
Extrep bigram -3.5 wt -3.5
Extrep bigram -inf wt -
Repetition-Controlled Baseline wt -3.5 wt - wt -
Specificity control (WD)
Specificity-controlled WD -10 wt -3.5 wt - wt - wt -10
Specificity-controlled WD -4 wt -3.5 wt - wt - wt -4
Specificity-controlled WD 4 wt -3.5 wt - wt - wt 4
Specificity-controlled WD 6 wt -3.5 wt - wt - wt 6
Specificity-controlled WD 8 wt -3.5 wt - wt - wt 8
Specificity control (CT)
Specificity-controlled CT 0 wt -3.5 wt - wt -
Specificity-controlled CT 2 wt -3.5 wt - wt -
Specificity-controlled CT 4 wt -3.5 wt - wt -
Specificity-controlled CT 7 wt -3.5 wt - wt -
Specificity-controlled CT 9 wt -3.5 wt - wt -
Response-rel control (WD)
Response-controlled WD -10 wt -3.5 wt - wt - wt - wt - wt -10
Response-controlled WD 5 wt -3.5 wt - wt - wt - wt - wt 5
Response-controlled WD 10 wt -3.5 wt - wt - wt - wt - wt 10
Response-controlled WD 13 wt -3.5 wt - wt - wt - wt - wt 13
Question control (CT)
Question-controlled CT 0 wt -3.5 wt - wt -
Question-controlled CT 1 wt -3.5 wt - wt -
Question-controlled CT 4 wt -3.5 wt - wt -
Question-controlled CT 7 wt -3.5 wt - wt -
Question-controlled CT 10 wt -3.5 wt - wt -
Question-controlled CT 10 (boost) wt 0* wt - wt -
Table 5: Control settings for all configurations that were human-evaluated. ‘wt’ means the weight used for a weighted decoding feature and ‘z=’ means the setting (i.e. bucket) for the control variable in conditional training.

* In the setting Question-controlled CT 10 (boost), extrep_bigram is not used for decoding during beam search, but it is used to rerank the candidates after beam search.

Note that the Response-controlled models additionally introduce internal bigram and partner bigram blocks on emitted utterances. We found that without these additional controls, models tended to parrot their partner’s last utterance. In Table 8, we find this outperforms our canonical Repetition-controlled baseline, but the initial decision on which to use for other control methods was made in an early pilot study.

Appendix F Automatic metrics for all configurations

Repetition Specificity Response-rel Questions
External Internal Partner Rep.
Bigram Unigram Bigram Unigram Bigram NIDF cos sim Has ‘?’
Gold data and baselines:
Gold Data 4.65% 9.62% 0.38% 0.97% 5.10% 0.2119 0.1691 28.80%
Greedy search 35.88% 36.31% 8.08% 10.59% 12.20% 0.1688 0.1850 6.46%
Beam search (beam size 20) 46.85% 44.15% 0.32% 0.61% 12.90% 0.1662 0.0957 80.87%
Repetition control (WD):
Extrep bigram WD -0.5 19.70% 16.85% 0.26% 0.62% 11.93% 0.1730 0.1348 73.04%
Extrep bigram WD -1.25 4.62% 4.79% 0.40% 0.89% 10.61% 0.1763 0.1504 61.22%
Extrep bigram WD -3.5 0.75% 4.61% 0.47% 0.94% 9.89% 0.1771 0.1681 48.89%
Extrep bigram WD -inf 0.00% 4.74% 0.51% 1.05% 9.56% 0.1780 0.1711 45.98%
Repetition-controlled baseline 0.73% 0.00% 0.17% 0.00% 9.55% 0.1766 0.1676 49.98%
Specificity control (WD)
Specificity-controlled WD -10 0.14% 0.00% 10.59% 0.00% 8.70% 0.1107 0.0994 33.55%
Specificity-controlled WD -4 0.65% 0.00% 1.98% 0.00% 9.95% 0.1501 0.1398 44.92%
Specificity-controlled WD 4 0.15% 0.00% 0.19% 0.00% 7.54% 0.2121 0.1972 45.53%
Specificity-controlled WD 6 0.07% 0.00% 0.13% 0.00% 6.50% 0.2546 0.2040 39.37%
Specificity-controlled WD 8 0.01% 0.00% 0.10% 0.00% 3.40% 0.4035 0.1436 26.68%
Specificity control (CT)
Specificity-controlled CT 0 0.60% 0.00% 0.20% 0.00% 9.05% 0.1478 0.1522 48.75%
Specificity-controlled CT 2 0.28% 0.00% 0.10% 0.00% 8.37% 0.1772 0.1833 50.57%
Specificity-controlled CT 4 0.12% 0.00% 0.08% 0.00% 7.90% 0.1921 0.1877 29.46%
Specificity-controlled CT 7 0.02% 0.00% 0.14% 0.00% 8.17% 0.2156 0.1955 16.51%
Specificity-controlled CT 9 0.01% 0.00% 0.11% 0.00% 8.01% 0.2462 0.1990 8.50%
Response-rel control (WD)
Response-controlled WD -10 0.13% 0.00% 0.00% 0.00% 0.00% 0.1914 -0.0921 25.71%
Response-controlled WD 5 0.15% 0.00% 0.00% 0.00% 0.00% 0.1973 0.4360 39.78%
Response-controlled WD 10 0.05% 0.00% 0.00% 0.00% 0.00% 0.2535 0.6653 27.56%
Response-controlled WD 13 0.02% 0.00% 0.00% 0.00% 0.00% 0.2999 0.7251 20.47%
Question control (CT)
Question-controlled CT 0 0.06% 0.00% 0.19% 0.00% 9.20% 0.1871 0.1753 2.01%
Question-controlled CT 1 0.09% 0.00% 0.19% 0.00% 8.66% 0.1844 0.1722 17.33%
Question-controlled CT 4 0.40% 0.00% 0.25% 0.00% 8.53% 0.1794 0.1713 48.88%
Question-controlled CT 7 0.80% 0.00% 0.17% 0.00% 8.48% 0.1771 0.1724 65.65%
Question-controlled CT 10 1.27% 0.00% 0.16% 0.00% 8.48% 0.1761 0.1728 79.67%
Question-controlled CT 10 (boost)* 7.64% 0.00% 0.03% 0.00% 10.76% 0.1701 0.1651 99.54%
Table 6: Automatic metrics (computed over validation set) for all model configurations that were human-evaluated.

*The purpose of the Question-controlled CT 10 (boost) setting is to achieve 100% question-asking rate. This is necessary because the Question-controlled CT 10 setting achieves only 79.67% questions, due to the interference of the extrep_bigram control. The 10 (boost) setting relaxes the repetition control in order to achieve 99.54% question-asking at the cost of slightly increased external bigram repetition.

Appendix G Human evaluation results for all configurations

Model Avoiding Rep. Engage Fluency Humanness Inquisitive Interesting Listening Make Sense Persona
Human 2.90 0.39 3.31 0.90 3.66 0.71 3.40 0.80 2.63 0.63 3.23 0.83 3.64 0.63 3.84 0.52 0.92 0.27
Greedy search baseline 2.16 0.72 2.31 1.08 3.20 0.81 1.78 0.90 2.00 0.81 2.36 0.98 2.78 0.84 3.33 0.75 0.87 0.34
Beam search baseline 2.14 0.72 2.35 1.01 3.23 0.93 1.81 0.87 2.50 0.72 2.35 0.98 2.63 0.85 3.40 0.77 0.77 0.42
Extrep bigram -0.5 2.66 0.56 2.56 0.92 3.57 0.64 2.19 0.94 2.67 0.62 2.61 0.87 3.08 0.78 3.60 0.57 0.75 0.43
Extrep bigram -1.25 2.84 0.39 2.91 0.90 3.59 0.64 2.32 0.98 2.63 0.60 2.86 0.89 3.21 0.71 3.64 0.62 0.72 0.45
Extrep bigram -3.5 2.90 0.30 2.95 0.86 3.73 0.50 2.45 1.03 2.55 0.61 2.88 0.80 3.27 0.79 3.68 0.49 0.80 0.40
Extrep bigram -inf 2.82 0.43 2.96 0.86 3.64 0.58 2.40 0.96 2.65 0.69 2.86 0.82 3.31 0.69 3.66 0.59 0.91 0.29
Repetition-controlled baseline 2.89 0.39 2.89 0.89 3.66 0.56 2.50 0.99 2.70 0.64 2.96 0.92 3.25 0.71 3.68 0.54 0.87 0.34
Question-controlled CT 0 2.95 0.25 2.92 0.90 3.70 0.54 2.49 0.97 2.48 0.72 2.85 0.93 3.29 0.69 3.56 0.66 0.86 0.35
Question-controlled CT 1 2.88 0.33 2.94 0.93 3.59 0.66 2.47 0.95 2.52 0.69 2.85 0.90 3.32 0.73 3.63 0.55 0.85 0.36
Question-controlled CT 4 2.88 0.38 2.88 0.94 3.59 0.73 2.42 1.07 2.55 0.66 2.82 0.85 3.37 0.74 3.63 0.59 0.84 0.37
Question-controlled CT 7 2.88 0.37 3.07 0.90 3.67 0.54 2.42 0.98 2.75 0.58 2.97 0.84 3.23 0.76 3.53 0.76 0.80 0.40
Question-controlled CT 10 2.74 0.46 2.90 0.93 3.70 0.50 2.43 1.04 2.71 0.57 2.72 0.88 3.12 0.73 3.59 0.66 0.79 0.41
Question-controlled CT 10 (boost) 2.76 0.49 2.84 0.94 3.60 0.64 2.26 0.97 2.94 0.57 2.83 0.94 3.18 0.80 3.52 0.67 0.72 0.45
Specificity-controlled CT 0 2.83 0.40 2.96 0.93 3.62 0.58 2.42 0.99 2.60 0.56 2.86 0.89 3.29 0.70 3.66 0.60 0.72 0.45
Specificity-controlled CT 2 2.90 0.36 2.78 1.00 3.60 0.64 2.37 0.93 2.66 0.66 2.80 0.96 3.14 0.77 3.50 0.63 0.81 0.39
Specificity-controlled CT 4 2.92 0.27 2.81 0.88 3.65 0.59 2.34 1.02 2.57 0.62 2.80 0.78 3.25 0.78 3.50 0.66 0.86 0.35
Specificity-controlled CT 7 2.89 0.32 3.00 0.94 3.64 0.67 2.53 1.03 2.56 0.66 2.90 0.90 3.34 0.70 3.59 0.60 0.82 0.39
Specificity-controlled CT 9 2.90 0.35 2.83 0.87 3.61 0.62 2.40 0.97 2.31 0.74 2.84 0.83 3.07 0.81 3.58 0.56 0.88 0.32
Specificity-controlled WD -10 2.85 0.43 2.43 0.99 3.34 0.83 2.15 0.91 2.31 0.69 2.38 0.94 3.03 0.75 3.33 0.70 0.71 0.45
Specificity-controlled WD -4 2.90 0.30 2.78 0.95 3.55 0.63 2.41 0.92 2.52 0.66 2.64 0.93 3.28 0.73 3.56 0.62 0.82 0.38
Specificity-controlled WD 4 2.95 0.21 2.99 0.86 3.65 0.55 2.49 0.90 2.65 0.55 3.00 0.78 3.37 0.59 3.63 0.50 0.93 0.25
Specificity-controlled WD 6 2.93 0.26 2.96 0.90 3.52 0.76 2.41 1.04 2.58 0.66 3.06 0.80 3.24 0.76 3.50 0.66 0.93 0.26
Specificity-controlled WD 8 2.78 0.52 2.40 1.23 2.67 1.25 1.86 0.97 2.03 0.87 2.55 1.14 2.61 1.05 2.91 0.91 0.92 0.28
Response-related controlled WD -10 2.86 0.44 2.48 0.98 3.42 0.74 2.02 0.93 2.38 0.75 2.53 0.94 2.84 0.80 3.14 0.75 0.91 0.29
Response-related controlled WD 0 2.96 0.23 3.01 0.90 3.72 0.54 2.73 1.00 2.56 0.67 2.92 0.84 3.37 0.72 3.73 0.52 0.82 0.38
Response-related controlled WD 5 2.90 0.33 2.88 0.90 3.51 0.63 2.41 1.01 2.53 0.65 2.85 0.90 3.27 0.73 3.49 0.63 0.82 0.39
Response-related controlled WD 10 2.78 0.43 2.39 1.04 3.06 0.90 1.97 0.99 2.22 0.67 2.57 1.01 3.03 0.76 3.16 0.63 0.75 0.43
Response-related controlled WD 13 2.71 0.57 2.10 1.13 2.54 1.12 1.81 1.07 2.14 0.84 2.33 1.06 2.69 0.83 2.70 0.88 0.62 0.49
Table 7: Raw scores (mean

std.) for all models and human evaluation metrics.

Model Avoiding Rep. Engage Fluency Humanness Inquisitive Interesting Listening Make Sense
Human 2.79 0.12 3.04 0.11 3.36 0.12 3.35 0.11 2.44 0.12 2.92 0.11 3.32 0.13 3.68 0.11
Greedy search baseline 2.08 0.10 2.24 0.11 3.03 0.10 1.75 0.12 1.95 0.10 2.29 0.13 2.62 0.10 3.23 0.10
Beam search baseline 2.08 0.11 2.29 0.11 3.09 0.13 1.71 0.13 2.42 0.11 2.29 0.14 2.47 0.12 3.35 0.13
Extrep bigram -0.5 2.62 0.10 2.54 0.12 3.35 0.12 2.13 0.11 2.63 0.11 2.56 0.11 2.93 0.11 3.48 0.11
Extrep bigram -1.25 2.78 0.09 2.82 0.13 3.40 0.12 2.27 0.12 2.54 0.09 2.76 0.10 3.05 0.11 3.53 0.14
Extrep bigram -3.5 2.83 0.11 2.93 0.10 3.56 0.10 2.43 0.11 2.47 0.11 2.83 0.10 3.14 0.10 3.62 0.12
Extrep bigram -inf 2.74 0.11 2.87 0.14 3.49 0.12 2.32 0.13 2.56 0.11 2.75 0.12 3.13 0.12 3.59 0.12
Repetition-controlled baseline 2.86 0.12 2.82 0.12 3.53 0.10 2.40 0.11 2.62 0.13 2.84 0.12 3.10 0.11 3.58 0.14
Question-controlled CT 0 2.87 0.12 2.84 0.13 3.51 0.10 2.46 0.11 2.36 0.09 2.76 0.09 3.10 0.10 3.49 0.12
Question-controlled CT 1 2.82 0.11 2.88 0.11 3.42 0.10 2.46 0.12 2.47 0.11 2.79 0.13 3.14 0.11 3.55 0.10
Question-controlled CT 4 2.78 0.12 2.88 0.10 3.47 0.11 2.40 0.09 2.53 0.13 2.83 0.13 3.24 0.11 3.59 0.10
Question-controlled CT 7 2.81 0.10 2.99 0.11 3.54 0.09 2.35 0.11 2.66 0.12 2.92 0.12 3.11 0.10 3.47 0.10
Question-controlled CT 10 2.67 0.13 2.87 0.11 3.52 0.12 2.35 0.12 2.63 0.12 2.66 0.10 2.94 0.11 3.53 0.12
Question-controlled CT 10 (boost) 2.68 0.12 2.74 0.09 3.42 0.12 2.19 0.13 2.79 0.11 2.74 0.11 3.00 0.12 3.45 0.13
Specificity-controlled WD -10 2.76 0.11 2.41 0.12 3.19 0.12 2.15 0.11 2.28 0.13 2.35 0.12 2.89 0.11 3.28 0.12
Specificity-controlled WD -4 2.83 0.10 2.76 0.12 3.37 0.10 2.36 0.11 2.46 0.11 2.62 0.12 3.14 0.09 3.52 0.11
Specificity-controlled WD 0 2.86 0.12 2.82 0.12 3.53 0.10 2.40 0.11 2.62 0.13 2.84 0.12 3.10 0.11 3.58 0.14
Specificity-controlled WD 4 2.84 0.10 2.96 0.12 3.45 0.13 2.44 0.12 2.56 0.09 2.94 0.11 3.20 0.10 3.54 0.11
Specificity-controlled WD 6 2.81 0.09 2.91 0.10 3.34 0.09 2.31 0.11 2.53 0.12 2.93 0.12 3.09 0.10 3.41 0.12
Specificity-controlled WD 8 2.70 0.11 2.39 0.12 2.54 0.12 1.80 0.13 2.00 0.10 2.49 0.12 2.47 0.10 2.87 0.11
Specificity-controlled CT 0 2.79 0.10 2.93 0.09 3.44 0.12 2.38 0.11 2.56 0.12 2.84 0.12 3.12 0.13 3.61 0.11
Specificity-controlled CT 2 2.78 0.12 2.74 0.11 3.39 0.13 2.31 0.13 2.56 0.13 2.74 0.12 2.99 0.11 3.47 0.10
Specificity-controlled CT 4 2.82 0.10 2.80 0.13 3.44 0.14 2.32 0.13 2.51 0.12 2.78 0.15 3.09 0.13 3.46 0.13
Specificity-controlled CT 7 2.81 0.12 2.91 0.13 3.43 0.11 2.45 0.10 2.49 0.11 2.81 0.12 3.15 0.12 3.55 0.11
Specificity-controlled CT 9 2.80 0.13 2.78 0.10 3.41 0.12 2.35 0.13 2.28 0.11 2.79 0.11 2.91 0.11 3.51 0.12
Response-related controlled WD -10 2.77 0.12 2.45 0.12 3.26 0.11 1.96 0.10 2.31 0.12 2.47 0.12 2.73 0.11 3.12 0.12
Response-related controlled WD 0 2.87 0.12 2.97 0.11 3.55 0.09 2.62 0.11 2.48 0.10 2.88 0.12 3.21 0.09 3.70 0.10
Response-related controlled WD 5 2.79 0.10 2.83 0.09 3.35 0.12 2.40 0.12 2.51 0.13 2.80 0.13 3.13 0.12 3.41 0.12
Response-related controlled WD 10 2.74 0.11 2.42 0.12 2.93 0.11 1.95 0.12 2.20 0.12 2.56 0.12 2.90 0.12 3.12 0.10
Response-related controlled WD 13 2.63 0.12 2.06 0.11 2.40 0.09 1.74 0.11 2.07 0.11 2.25 0.12 2.49 0.14 2.63 0.10
Table 8: Calibrated scores (mean std.) for each of the models and human evaluation metrics.

Appendix H Plots of human evaluation results for all configurations

Figure 8: Calibrated human evaluation scores for all models.