Neural generation models for dialogue, despite their ubiquity, are still poorly understood. Well known problems, such as the genericness and repetitiveness of responses (Serban et al., 2016a), remain without a de facto solution. Strikingly, the factors that determine overall human judgments of conversation quality are almost entirely unexplored. Most works have been limited to the next utterance prediction problem, whereas a multi-turn evaluation is necessary to evaluate the quality of a full conversation.
In this work we both (i) conduct a large-scale study to identify the fine-grained factors governing human judgments of full conversations, and (ii) develop models that apply our findings in practice, leading to state-of-the-art performance. Specifically, we identify and study eight aspects of conversation that can be measured by human judgments, while varying four types of low-level attributes that can be algorithmically controlled in neural models; see Figure 1. To control the low-level model attributes, we consider two general algorithms: conditional training, whereby control features are added to the sequence representations, and weighted decoding, where features are added to the decoding scoring function only at test time.
One major result of our findings is that existing work has ignored the importance of conversational flow, as standard models (i) repeat or contradict previous statements, (ii) fail to balance asking questions with other dialogue acts, and (iii) fail to balance specificity with brevity. Conducting experiments on the PersonaChat task (Zhang et al., 2018b), we obtain significantly higher engagingness scores than the baseline by optimizing control of repetition, specificity and question-asking over multiple turns. Using these findings, our best model matches the performance of the winning entry in the recent NeurIPS ConvAI2 competition, which used a lot more data but had no control. All of our code, pretrained models, and full chatlogs, will be released open-source.
2 Related Work
Dialogue evaluation is relatively well understood in goal-oriented tasks, where automated approaches can be coded by measuring task completion (Hastie, 2012; El Asri et al., 2017; Henderson et al., 2014; Wen et al., 2017; Bordes et al., 2017). Task success combined with dialogue cost can be linked to human judgments like user satisfaction via the PARADISE framework (Walker et al., 1997). However, this leaves unaddressed the large portion of human social communication where there is no explicit goal.
In chitchat tasks, which we study in this work, automatic metrics and their relation to human ratings are less well-understood. While word-overlap metrics are effective for question-answering and machine translation, for dialogue they have little to no correlation with human judgments (Venkatesh et al., 2017) – this is due to the open-ended nature of dialogue. There are more recent attempts to find better automatic approaches, such as adversarial evaluation (Li et al., 2017b) and learning a scoring model (Lowe et al., 2017), but their value is still unclear.
Nevertheless, a number of studies only use automated metrics, with no human study at all (Lowe et al., 2015; Serban et al., 2016b; Parthasarathi and Pineau, 2018). Other works do use human evaluations (Vinyals and Le, 2015; Venkatesh et al., 2017; Li et al., 2016b, a; Zhang et al., 2018b; Dinan et al., 2018), typically reporting just one type of judgment (either quality or appropriateness) either using a Likert scale or pairwise comparisons. Most of those works only consider single turn evaluations, often with a shortened dialogue history, rather than full multi-turn dialogue. Further, they offer no link between controllable factors of their models and improved results.
A more comprehensive evaluation strategy has been studied within the scope of the Alexa prize (Venkatesh et al., 2017; Guo et al., 2018) by considering multiple metrics designed to correlate well with human judgment. However, the metrics they use (engagement, domain coverage, coherence, topical diversity and conversational depth) are themselves higher-level factors of a conversation, and not easily controllable by a model. For this reason, they are less helpful in the direct construction of a model that is able to optimize them.
Controllable neural text generation
Researchers have proposed several approaches to control aspects of RNN-based natural language generation such as sentiment, length, speaker style and tense (Ficler and Goldberg, 2017; Ghazvininejad et al., 2017; Hu et al., 2017; Peng et al., 2018; Fan et al., 2018; Kikuchi et al., 2016). In particular, several works use control to tackle the same common sequence-to-sequence problems we address here (repetition, genericness and unrelated output), in the context of single-turn response generation (Li et al., 2017a; Shen et al., 2017; Xing et al., 2017; Wang et al., 2017; Zhang et al., 2018a; Zhou et al., 2017). However, in this work we focus on developing controls for, and human evaluation of, multi-turn interactive dialogue – this includes a new method (described in Section 5) to control attributes at the dialogue level rather than the utterance level.
In this work, we require a control method that is both general-purpose (one technique to simultaneously control many attributes) and easily tunable (the control setting is adjustable after training). Given these constraints, we study two control methods: conditional training (Fan et al., 2018; Kikuchi et al., 2016; Peng et al., 2018) and weighted decoding (Ghazvininejad et al., 2017). To our knowledge, this work is the first to systematically compare the effectiveness of two general-purpose control methods across several attributes.
3 The PersonaChat dataset
PersonaChat (Zhang et al., 2018b) is a chitchat dialogue task involving two participants (two humans or a human and a bot). Each participant is given a persona – a short collection of personal traits such as I’m left handed or My favorite season is spring – and are instructed to get to know each other by chatting naturally using their designated personas, for 6–8 turns. The training set contains 8939 conversations and 955 personas, collected via crowdworkers, plus 1000 conversations and 100 personas for validation, and a similar number in the hidden test set. The PersonaChat task was the subject of the NeurIPS 2018 ConvAI2 Challenge111http://convai.io/, in which competitors were first evaluated with respect to automatic metrics (perplexity, hits@1 and F1 score), and then with respect to human judgment via the question “How much did you enjoy talking to this user?” on a scale of 1–4.
4 Baseline model
Our baseline model is a 2-layer LSTM sequence-to-sequence model with attention. On any dialogue turn, the input to the encoder is the entire dialogue history (separated using unique speaker-identifying tokens), with the model’s own persona prepended. Conditioned on this input sequence, the decoder generates a response. Except when stated otherwise, all our models decode using beam search with beam size 20.
The word embedding matrix was initialized with 300-dimensional GloVe embeddings (Pennington et al., 2014), and the model was pretrained on a dataset of 2.5 million Twitter message-response pairs before being fine-tuned on PersonaChat, all trained within the ParlAI framework (Miller et al., 2017). On the PersonaChat validation set, the baseline model has a perplexity of 26.83 and F1 of 17.02, which would have placed us 4th out of 26 models in the competition. We attempt to improve over this baseline using control.
5 Controllable text generation methods
Suppose we have a sequence-to-sequence model which learns
, the conditional probability of a response(the model’s next utterance) given input (the context, which in our case includes the model’s own persona and the dialogue history).
Contrary to most previous work, which controls at the sentence level, we wish to control attributes of the output at the dialogue level. For example, to control question-asking, we provide a control setting at the beginning of each dialogue (e.g. 20% questions or 70% questions) rather than providing a control setting for each utterance (e.g. is a question or isn’t a question). With this approach, the sequence-to-sequence model is able to choose what value the controlled attribute should take for any particular utterance, but we are able to choose the overall distribution. We find that this approach works well – the sequence-to-sequence model is generally good at detecting when to ask a question, for example – and works better than the alternative of developing a separate method to decide, for each utterance, whether to ask a question. In what follows, we describe the two methods we use to control attributes of the output at the dialogue level.
5.1 Conditional Training (CT)
Conditional Training (Fan et al., 2018; Kikuchi et al., 2016; Peng et al., 2018) is a method to learn a sequence-to-sequence model , where is a discrete control variable. If the control attribute is naturally continuous (for example in our work, repetitiveness, specificity and response-relatedness), we use to represent bucketed ranges. For a binary attribute like question-asking, represents an overall probability.
To train a CT model, we first automatically annotate every pair in the training set with the attribute we wish to control (for example, whether contains a question mark). For each example we determine the corresponding value during training (for continuous attributes, this simply means sorting into the correct bucket; for question-asking, see Section 6.4). The model takes and as input, and learns to produce .
Each of the possible values of the control variable is represented via an embedding. For all our experiments, the embedding is of length 10. There are several possible ways to condition the sequence-to-sequence model on – for example, append to the end of the input sequence, or use as the START symbol for the decoder. We find it most effective to concatenate to the decoder’s input on every step. To train a CT model conditioned on multiple controls , we can simply concatenate the control embeddings.
Our CT models are initialized with the baseline parameters (the new decoder parameters are initialized with small random values), then fine-tuned on the control-conditional PersonaChat task until validation perplexity converges.
5.2 Weighted Decoding (WD)
Weighted Decoding (Ghazvininejad et al., 2017) is a decoding method that increases or decreases the probability of words with certain features. The technique is applied only at test time, requiring no change to the training method. A disadvantage of WD is that it can only control attributes at the word-level; any utterance-level attribute must be broken down into word-level decoding features.
On the step of decoding, a partial hypothesis is expanded by computing the score for each possible next word :
Here, is the log-probability of the word calculated by the RNN, is the accumulated score of the already-generated words in the hypothesis , and are decoding features with associated weights .
A decoding feature assigns a real value to the word , in the context of the text generated so far and the context . The feature can be continuous (e.g. the unigram probability of ), discrete (e.g. the number of characters in ), or binary (e.g. whether starts with a given letter). A positive weight increases the probability of words that score highly with respect to ; a negative weight decreases their probability.
Yes, I’m studying law at the moment
|Baseline Response: That sounds like a lot of fun!|
|(a)||NIDF||Weighted Decoding Response|
|17.1%||That sounds like a lot of fun!|
|18.3%||That sounds like a lot of fun. How|
|long have you been studying?|
|38.5%||I majored in practising my|
|spiritual full time philosophy test|
|71.9%||Oh wow! Merna jean isa paino yi|
|hao hui bu acara sya gila […]|
|NIDF||Conditional Training Response|
|16.8%||Sounds like you are a great person!|
|18.3%||So you are a law student?|
|18.4%||That sounds like a lot of fun|
|22.8%||That sounds like a rewarding job!|
|24.4%||That sounds like a rewarding career!|
6 Controlling conversational attributes
In this section, we describe how we use weighted decoding and conditional training to control four attributes: repetition, specificity, response-relatedness and question-asking. We evaluate the effectiveness of both control methods via automatic metrics (i.e., measuring how well the attribute was controlled), and use our findings to select control methods and control settings to be explored further in the human evaluation (Section 8).
Our baseline exhibits three common types of repetition, which we call external repetition (self-repetition across utterances), internal repetition (self-repetition within utterances), and partner repetition (repeating the conversational partner).
To control repetition with weighted decoding, we define five n-gram based decoding features (see AppendixD). Three of these features (extrep_bigram, intrep_bigram and partnerrep_bigram) identify repeating bigrams for the three repetition types. The other two features (extrep_unigram and intrep_unigram) identify repeating content words. By applying a negative weight to these features, we can reduce undesired repetition.222We also tried controlling repetition with conditional training, defining as the (bucketed) maximum ROUGE-L precision between the response and the bot’s previous utterances. However, this method was unsuccessful because there are not enough repetitive examples in the training data for the model to learn the control. Experimenting with data augmentation to solve this problem is an area for future work. As we observe that repetition control is very important, for all other control experiments we control both repetition and the other variable concurrently.
|Input: Do you go get coffee often|
|Baseline Response: I do, when I am not playing the piano.|
|Wt||Sim||Weighted Decoding Response|
|-10.0||-0.05||I am a musician.|
|-5.0||-0.02||I do, when I am not playing the piano.|
|0.0||-0.02||I do, when I am not playing the piano.|
|5.0||0.40||I do, usually at starbucks.|
|8.0||0.59||Not usually, especially when you drink latte.|
|11.0||0.72||Not often, usually with drinks,|
|espresso, latte, tea, etc.|
Example of controlling response-relatedness (cosine similarity to input) via Weighted Decoding.
Like many sequence-to-sequence models, our baseline frequently asks generic questions such as What music do you like? and gives dull, unspecific responses, e.g. I like all kinds of music.
We control specificity using Normalized Inverse Document Frequency (NIDF) as a measure of word rareness. The Inverse Document Frequency of a word is where is the number of responses in the dataset, and is the number of those responses that contain . Normalized IDF (which ranges from 0 to 1) is
where min_idf and max_idf are the minimum and maximum IDFs, taken over all words. To control specificity with weighted decoding, we use NIDF as a decoding feature. As shown in Table 1(a), this method produces reasonable outputs for weights within a certain range, but at the extremes the outputs are nonsensical. The boundary for nonsensical output differs from example to example.
To control specificity with conditional training, we define the specificity of an utterance to be the mean NIDF of the words in . Thus our control variable is mean NIDF (discretized into 10 equal-sized buckets). As shown in Table 1(b), this method gives outputs with a narrower NIDF range, but overall produces less nonsensical outputs.
In conversation, it’s generally desirable to produce a response that is related to the partner’s last utterance; for example it is inappropriate to say Do you have any pets? in response to My grandfather died last month. To control response-relatedness with weighted decoding, we use the decoding feature , the cosine similarity between the GloVe embedding for the word , and the sentence embedding for the partner’s last utterance (which is a part of the context ). Here the sentence embedding for an utterance is a weighted average of the GloVe embeddings of the words in , with the first principal component projected out; for full details, see Arora et al. (2017). We find that weighted decoding is effective to control the semantic relatedness of the model’s response to the partner’s last utterance (see Table 2). As before, we find that extreme weights lead to nonsensical output.
To control response-relatedness with conditional training, we define the control variable to be , the overall cosine similarity between the partner’s last utterance and the model’s response (again, we discretize ). However, we find this method ineffective – the CT model learns only a very weak connection between and the semantic relatedness of the output (see Section 7 for more details).
Considerate chitchat requires a reciprocal asking and answering of questions – asking too few or too many can appear self-centered or nosy. We control question-asking in order to study these trade-offs.
To control question-asking with weighted decoding, we use the binary decoding feature , which is equal to 1 if and only if the word is in a pre-defined list of interrogative words (how, what, when, where, which, who, whom, whose, why, ?). We find this is a reasonably effective method to encourage or discourage questions, but with unintended side-effects: a negative weight can discourage valid non-question utterances that happen to contain interrogative words (such as I’m learning how to knit) and a positive weight can result in degenerate utterances (such as What??????? or Who? When? How?).
For conditional training, we regard an utterance as containing a question if and only if contains a question mark. We train our CT model on a control variable with 11 possible values: . As discussed in Section 5, we wish to control question-asking at the distributional, dialogue level, rather than at the binary, utterance level. Thus the setting means that the model should produce, on average, an utterance containing ‘?’ with probability . During training we randomly assign examples to buckets such that each bucket is trained on examples with the correct proportion of questions (), and all buckets have the same amount of training examples.
For controlling question-asking, conditional training is preferable to weighted decoding for two reasons. Firstly, it allows us to achieve (close to) 0% questions, 100% questions, or anything in between, without introducing the risk of degenerate output. Secondly, directly controlling presence-of-a-question-mark avoids the pollution of our control variable that occurs when increasing or decreasing the probability of interrogative words. For these reasons, only the CT method is considered in the human evaluation.
7 Comparison of control methods
The previous section shows that conditional training and weighted decoding are both useful techniques, with different strengths and weaknesses.
The primary disadvantage of conditional training is that it’s only as effective as the underlying learning algorithm, which must learn the connection between the control variable
and the target output. In practice, we find the model can learn simple attributes of the output (such as the presence of ‘?’, and overall genericness), but not relationships between the input and output (such as semantic relatedness). By contrast, weighted decoding can force the desired feature to appear in the output by raising the weight arbitrarily high (though this may have unintended side-effects).
The primary disadvantage of weighted decoding is that it risks going off-distribution when the weight is too strong. By contrast, conditional training produces mostly well-formed, in-distribution outputs. This highlights the importance of learned control – it is safer to learn to produce output that both satisfies the control variable and is appropriate, than to alter the decoding process to satisfy the control variable, potentially trading off appropriateness in the process.
Other practical considerations include convenience (conditional training requires retraining; weighted decoding doesn’t, but is slower at test time), data (conditional training requires sufficient examples of the controllable attribute; weighted decoding can control any computable feature) and attribute definition (conditional training can control sentence-level attributes, but they must be discrete; weighted decoding requires word-level features, but they can be continuous).
8 Human evaluation results
In order to study the relationship between our controllable attributes and human judgments of conversational quality, we conduct a large-scale human evaluation of 28 model configurations, plus human-human conversations for comparison.
In our evaluation, a crowdworker chats with a model (or in the human-human case, another crowdworker) for six conversational turns, then answers eight multiple-choice questions which each capture different aspects of conversational quality: avoiding repetition, interestingness, making sense, fluency, listening, inquisitiveness, humanness and engagingness. We also add a persona retrieval question, to match ConvAI2. Our evaluation is the same as the ConvAI2 Challenge evaluation, but more detailed (ConvAI2 includes only engagingness and persona retrieval).333There are two other differences between our evaluation and ConvAI2’s: (1) We fix capitalization and spacing before showing the chatbot’s utterances to crowdworkers, while ConvAI2 show the raw lowercase tokenized form. We found the latter interferes with fluency evaluations. (2) We conduct 6 dialogue turns, while ConvAI2 conducts 4-6. This was necessary to evaluate repetitiveness. The eight questions are Likert questions on a 1-4 scale, where higher is better;444Exceptions: Avoiding repetition is a 1-3 scale, as this was deemed to give clearer instructions. Inquisitiveness has an optimal score of 3; 1 and 2 represent too little question-asking, and 4 represents too much. For full details of the evaluation design, see Appendix B. For persona retrieval, the crowdworker is asked to select which of two possible personas was the model’s persona. In designing these questions, we aimed to capture the four aspects we expected to directly improve via control (avoiding repetition, interestingness, listening, inquisitiveness), two important error classes that we anticipated would be affected by our controls (fluency, making sense), and two overall quality measures (engagingness, humanness).
As in the ConvAI2 challenge, each of our 28 model configurations was evaluated by over 100 crowdworkers, and the results were adjusted for annotator variance via a Bayesian calibration(Kulikov et al., 2018). Full results for all our models are provided in Appendices G and H, and sample conversations are provided in Appendix C.
|Model||Win%||Top 3 reasons for preferring model|
|Specificity WD ()||84.1%||More information; Better flow; More descriptive|
|Specificity WD ()||75.5%||More information; They describe their life in more detail; Funny|
|Specificity CT ()||56.2%||More information; Better flow; Seems more interested|
8.1 Main findings
We summarize the main findings of our human evaluation. As Figure 2 shows, controlling for repetition, specificity and question-asking all lead to large improvements over the greedy and beam-search baseline models for engagingness. We find that controlling for multi-turn (self) repetition is essential and should be incorporated alongside other attribute control methods. We found no improvement for response-relatedness (shown in Appendix H). Our best controlled model matches the engagingness of the winning entry in the ConvAI2 competition (both achieving a score of ), though ours was trained on significantly less data.
To understand why we get these improvements, we consider the wider set of human judgments, shown in Figure 3. We find that repetition leads to improvements across all our aspects of conversational quality. Specificity control shows improvements in interestingness and listening ability over the repetition baseline, which might explain the increased engagingness. Question-asking is shown to be more inquisitive, which may explain why it is more interesting and engaging as well.
Altogether, our evaluation clearly shows that controlling low-level attributes over multiple turns leads to improved overall quality.
8.2 Effect of controlled attributes
We observe that self-repetition across utterances (external repetition) is by far the most severe form of repetition in our baseline. We evaluate several settings of the extrep_bigram weighted decoding feature, and find that an aggressive repetition-reduction setting (reducing bigram repetition rate to below gold data levels) is rated best. We also find that blocking repeated content words improves the avoiding repetition score (see Appendices E, F and G for full details).
As shown in Figure 2 (left) and Figure 3, our repetition-controlled model improves hugely over the beam search baseline in all metrics, and achieves close-to-human scores on all metrics except humanness. This striking result demonstrates that repetition is by far the biggest limiting quality factor for naive sequence-to-sequence dialogue agents. The result also emphasizes the importance of multi-turn dialogue evaluation to detect the problem. We refer to this model as the repetition-controlled baseline, and use it as a basis to control repetition in all remaining experiments.
For our weighted decoding models, we find that the extreme settings (very generic and very specific) score poorly in engagingness due to the frequent presence of degenerate output – see Figure 2 (middle). We find that the setting (which is more specific than the repetition-controlled baseline and about as specific as the gold data) maximizes engagingness. As shown in Figure 2 (left) and Figure 3, this more-specific model is rated more interesting, engaging, and a better listener than the repetition-controlled baseline, but at the cost of reduced fluency and making sense. For further discussion on the interestingness of our specificity models, see Section 8.3.
We evaluated several response-relatedness control settings and found that none scored better than (no response-relatedness control) – see Appendix H. This is surprising – prior to running the human evaluation, we annotated 100 examples ourselves to determine the best control settings. While we identified a more responsive setting () as less likely than the uncontrolled model to ignore the user, crowdworkers rated it as a slightly worse listener than the uncontrolled model. One explanation for this discrepancy is that the more responsive model takes more risks, using more rare words (.197 NIDF, up from .177), and thus receiving a lower makes-sense score (3.41, down from 3.70). We hypothesize that, compared to us, the crowdworkers are less tolerant of slightly nonsensical output, and more tolerant of generic unrelated utterances.
As shown in Figure 2 (right), a question-asking rate of 65.7% (setting ) maximizes engagingness. This model, which asks more questions than both the repetition-controlled baseline (50.0%) and humans (36.8%) – brings us even closer to human-level engagingness – see Figure 2 (left). Although we find that a rate of approximately 65.7% question-asking is the most engaging, a lower level (48.9%, or ) is rated the best listener. Lastly, we find that although asking too many questions is less engaging, most crowdworkers will not criticize a chatbot that asks questions on every turn555Though this conclusion may hold true for the PersonaChat task – a synthetic chatting task that instructs participants to get to know each other – in real-life social conversations, incessant question-asking may be less tolerated. (only 11.9% of crowdworkers judged the (boost) setting, which asks 99% questions, as asking too many questions). For full details of these scores, see Appendix F and H.
8.3 A/B tests for interestingness
Though our more-specific models yielded significant improvements in engagingness, we were surprised that they did not yield clearer improvements in interestingness. To investigate further, we conducted an A/B interestingness evaluation of three specificity-controlled models, compared to the repetition-controlled baseline.
Crowdworkers were shown two conversations (from the main human evaluation) and asked to choose which model was more interesting (see Appendix A for details). We collected 500 samples per comparison, plus 200 additional human vs repetition-controlled samples, which were used to filter for quality control. After discarding low-quality crowdworkers, we have roughly 300 evaluations per comparison, with an average Cohen’s .
As shown in Table 3, all three models were rated significantly more interesting than the repetition-controlled baseline. This convincingly shows that producing utterances with more rare words is a valid strategy to improve interestingness.
We have two explanations for why these interestingness differences did not materialize in our main evaluation. Firstly, interestingness is a particularly subjective metric (unlike more tangible metrics such as avoiding repetition and making sense) – this makes it hard to calibrate across crowdworkers. Secondly, we suspect that in our original evaluation, the crowdworkers may have evaluated the interestingness of the task rather than the chatbot. This could account for why subtle increases in conversational ability did not result in higher interestingness ratings – the PersonaChat task itself has a natural interestingness limit.
What makes a good conversation?
A good conversation is about trade-offs: we showed in our large-scale evaluation that appropriate control of repetition, specificity and question-asking all lead to large improvements in human judgments. Modeling these aspects explicitly, contrary to black box neural generation systems, allows us to precisely study and control the relationship between low-level attributes and high-level dialogue quality.
While neural models are usually trained to predict the next utterance only, humans judge a conversation as a whole; not just the sum of its turns. Optimizing for human judgments of a good conversation is made viable through control variables.
While humanness has long been perceived as the end-goal of dialogue research, engagingness is arguably the more important metric. The Turing test is essentially proof-by-contradiction, where models can be caught out on a single turn. Engagingness on the other hand, is a more forgiving metric – involving interestingness, listening, inquisitiveness, fluency and making sense – measured over all the turns.
While our models do well with engagingness and our other measured aspects, they fall markedly short in humanness judgments. How to bridge that gap, using controllable aspects of dialogue, is an open problem that constitutes an interesting direction for future work in dialogue research.
- Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the International Conference on Learning Representations (ICLR).
- Bordes et al. (2017) Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2017. Learning end-to-end goal-oriented dialog. In Proceedings of the International Conference on Learning Representations (ICLR).
- Dinan et al. (2018) Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of Wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241.
- El Asri et al. (2017) Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and Kaheer Suleman. 2017. Frames: a corpus for adding memory to goal-oriented dialogue systems. In Proceedings of the 18th Annual SIGDIAL Meeting on Discourse and Dialogue, pages 207–219, Saarbrücken, Germany. Association for Computational Linguistics.
Fan et al. (2018)
Angela Fan, David Grangier, and Michael Auli. 2018.
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 45–54. Association for Computational Linguistics.
- Ficler and Goldberg (2017) Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pages 94–104. Association for Computational Linguistics.
- Ghazvininejad et al. (2017) Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. 2017. Hafez: an interactive poetry generation system. In Proceedings of ACL 2017, System Demonstrations, pages 43–48. Association for Computational Linguistics.
- Guo et al. (2018) Fenfei Guo, Angeliki Metallinou, Chandra Khatri, Anirudh Raju, Anu Venkatesh, and Ashwin Ram. 2018. Topic-based evaluation for conversational bots. Advances in Neural Information Processing Systems, Conversational AI Workshop.
- Hastie (2012) Helen Hastie. 2012. Metrics and evaluation of spoken dialogue systems, pages 131–150. Springer.
- Henderson et al. (2014) Matthew Henderson, Blaise Thomson, and Jason D Williams. 2014. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 263–272.
- Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. In Thirty-fourth International Conference on Machine Learning.
Kikuchi et al. (2016)
Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu
length in neural encoder-decoders.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1328–1338. Association for Computational Linguistics.
- Kulikov et al. (2018) Ilya Kulikov, Alexander H Miller, Kyunghyun Cho, and Jason Weston. 2018. Importance of a search strategy in neural dialogue modelling. arXiv preprint arXiv:1811.00907.
- Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119. Association for Computational Linguistics.
- Li et al. (2017a) Jiwei Li, Will Monroe, and Dan Jurafsky. 2017a. Learning to decode for future success. arXiv preprint arXiv:1701.06549.
- Li et al. (2016b) Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016b. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Austin, Texas. Association for Computational Linguistics.
- Li et al. (2017b) Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017b. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547.
- Lowe et al. (2017) Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1116–1126. Association for Computational Linguistics.
- Lowe et al. (2015) Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 285–294, Prague, Czech Republic. Association for Computational Linguistics.
- Miller et al. (2017) Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. ParlAI: A dialog research software platform. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 79–84, Copenhagen, Denmark. Association for Computational Linguistics.
- Parthasarathi and Pineau (2018) Prasanna Parthasarathi and Joelle Pineau. 2018. Extending neural generative conversational model using external knowledge sources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 690–695, Brussels, Belgium. Association for Computational Linguistics.
- Peng et al. (2018) Nanyun Peng, Marjan Ghazvininejad, Jonathan May, and Kevin Knight. 2018. Towards controllable story generation. In Proceedings of the First Workshop on Storytelling, pages 43–49. Association for Computational Linguistics.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
- Serban et al. (2016a) Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, and Joelle Pineau. 2016a. Generative deep neural networks for dialogue: A short review. Advances in Neural Information Processing Systems workshop on Learning Methods for Dialogue.
- Serban et al. (2016b) Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. 2016b. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, volume 16, pages 3776–3784.
- Shen et al. (2017) Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, Shuzi Niu, Yang Zhao, Akiko Aizawa, and Guoping Long. 2017. A conditional variational framework for dialog generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 504–509. Association for Computational Linguistics.
- Venkatesh et al. (2017) Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad, Ming Cheng, Behnam Hedayatnia, Angeliki Metallinou, et al. 2017. On evaluating and comparing conversational agents. Advances in Neural Information Processing Systems, Conversational AI Workshop.
- Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. In Proceedings of the 31st International Conference on Machine Learning, Deep Learning Workshop, Lille, France.
- Walker et al. (1997) Marilyn A. Walker, Diane J. Litman, Candace A. Kamm, and Alicia Abella. 1997. PARADISE: A framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pages 271–280, Madrid, Spain. Association for Computational Linguistics.
- Wang et al. (2017) Di Wang, Nebojsa Jojic, Chris Brockett, and Eric Nyberg. 2017. Steering output style and topic in neural response generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2140–2150. Association for Computational Linguistics.
- Wen et al. (2017) Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gasic, Lina M. Rojas Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 438–449. Association for Computational Linguistics.
- Xing et al. (2017) Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2017. Topic aware neural response generation. In AAAI, volume 17, pages 3351–3357.
- Zhang et al. (2018a) Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2018a. Learning to control the specificity in neural response generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1108–1117, Melbourne, Australia. Association for Computational Linguistics.
- Zhang et al. (2018b) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018b. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
- Zhou et al. (2017) Ganbin Zhou, Ping Luo, Rongyu Cao, Fen Lin, Bo Chen, and Qing He. 2017. Mechanism-aware neural machine for dialogue response generation. In AAAI, pages 3400–3407.
Appendix A Screenshots of human evaluation interface
Appendix B Human evaluation questionnaire design
Here are the questions and multiple-choice options used in the human evaluation, in the order presented:
[Engagingness] How much did you enjoy talking to this user? Not at all A little Somewhat A lot [Interestingness] How interesting or boring did you find this conversation? Very boring A little boring A little interesting Very interesting [Inquisitiveness] How much did the user try to get to know you? Didn’t ask about me at all Asked about me some Asked about me a good amount Asked about me too much [Listening] How much did the user seem to pay attention to what you said? Always ignored what I said Mostly ignored what I said Mostly paid attention to what I said Always paid attention to what I said [Avoiding Repetition] How repetitive was this user? Repeated themselves over and over Sometimes said the same thing twice Always said something new [Fluency] How naturally did this user speak English? Very unnatural Mostly unnatural Mostly natural Very natural [Making sense] How often did this user say something which did NOT make sense? Never made any sense Most responses didn’t make sense Some responses didn’t make sense Everything made perfect sense [Humanness] Do you think this user is a bot or a human? Definitely a bot Probably a bot Probably a human Definitely a human [Persona retrieval] Which prompt (character) do you think the other user was given for this conversation? Respondent chooses one of two provided personas
Appendix C Example conversations from human evaluation
Appendix D Repetition-control decoding features
|Adding to the hypothesis would create a 2-gram|
|that appears in a previous utterance by the model|
|is a non-stopword and|
|appears in a previous utterance by the model|
|Adding to the hypothesis would create a 2-gram|
|that appears earlier in the hypothesis|
|is a non-stopword and|
|appears earlier in the hypothesis|
|Adding to the hypothesis would create a 2-gram|
|that appears in a previous utterance by the partner|
Appendix E Control settings for all configurations
|Bigram||Unigram||Bigram||Unigram||Bigram||NIDF||cos sim||Has ‘?’|
|Beam Search (beam size 20)|
|Repetition control (WD):|
|Extrep bigram -0.5||wt -0.5|
|Extrep bigram -1.25||wt -1.25|
|Extrep bigram -3.5||wt -3.5|
|Extrep bigram -inf||wt -|
|Repetition-Controlled Baseline||wt -3.5||wt -||wt -|
|Specificity control (WD)|
|Specificity-controlled WD -10||wt -3.5||wt -||wt -||wt -10|
|Specificity-controlled WD -4||wt -3.5||wt -||wt -||wt -4|
|Specificity-controlled WD 4||wt -3.5||wt -||wt -||wt 4|
|Specificity-controlled WD 6||wt -3.5||wt -||wt -||wt 6|
|Specificity-controlled WD 8||wt -3.5||wt -||wt -||wt 8|
|Specificity control (CT)|
|Specificity-controlled CT 0||wt -3.5||wt -||wt -|
|Specificity-controlled CT 2||wt -3.5||wt -||wt -|
|Specificity-controlled CT 4||wt -3.5||wt -||wt -|
|Specificity-controlled CT 7||wt -3.5||wt -||wt -|
|Specificity-controlled CT 9||wt -3.5||wt -||wt -|
|Response-rel control (WD)|
|Response-controlled WD -10||wt -3.5||wt -||wt -||wt -||wt -||wt -10|
|Response-controlled WD 5||wt -3.5||wt -||wt -||wt -||wt -||wt 5|
|Response-controlled WD 10||wt -3.5||wt -||wt -||wt -||wt -||wt 10|
|Response-controlled WD 13||wt -3.5||wt -||wt -||wt -||wt -||wt 13|
|Question control (CT)|
|Question-controlled CT 0||wt -3.5||wt -||wt -|
|Question-controlled CT 1||wt -3.5||wt -||wt -|
|Question-controlled CT 4||wt -3.5||wt -||wt -|
|Question-controlled CT 7||wt -3.5||wt -||wt -|
|Question-controlled CT 10||wt -3.5||wt -||wt -|
|Question-controlled CT 10 (boost)||wt 0*||wt -||wt -|
* In the setting Question-controlled CT 10 (boost), extrep_bigram is not used for decoding during beam search, but it is used to rerank the candidates after beam search.
Note that the Response-controlled models additionally introduce internal bigram and partner bigram blocks on emitted utterances. We found that without these additional controls, models tended to parrot their partner’s last utterance. In Table 8, we find this outperforms our canonical Repetition-controlled baseline, but the initial decision on which to use for other control methods was made in an early pilot study.
Appendix F Automatic metrics for all configurations
|Bigram||Unigram||Bigram||Unigram||Bigram||NIDF||cos sim||Has ‘?’|
|Gold data and baselines:|
|Beam search (beam size 20)||46.85%||44.15%||0.32%||0.61%||12.90%||0.1662||0.0957||80.87%|
|Repetition control (WD):|
|Extrep bigram WD -0.5||19.70%||16.85%||0.26%||0.62%||11.93%||0.1730||0.1348||73.04%|
|Extrep bigram WD -1.25||4.62%||4.79%||0.40%||0.89%||10.61%||0.1763||0.1504||61.22%|
|Extrep bigram WD -3.5||0.75%||4.61%||0.47%||0.94%||9.89%||0.1771||0.1681||48.89%|
|Extrep bigram WD -inf||0.00%||4.74%||0.51%||1.05%||9.56%||0.1780||0.1711||45.98%|
|Specificity control (WD)|
|Specificity-controlled WD -10||0.14%||0.00%||10.59%||0.00%||8.70%||0.1107||0.0994||33.55%|
|Specificity-controlled WD -4||0.65%||0.00%||1.98%||0.00%||9.95%||0.1501||0.1398||44.92%|
|Specificity-controlled WD 4||0.15%||0.00%||0.19%||0.00%||7.54%||0.2121||0.1972||45.53%|
|Specificity-controlled WD 6||0.07%||0.00%||0.13%||0.00%||6.50%||0.2546||0.2040||39.37%|
|Specificity-controlled WD 8||0.01%||0.00%||0.10%||0.00%||3.40%||0.4035||0.1436||26.68%|
|Specificity control (CT)|
|Specificity-controlled CT 0||0.60%||0.00%||0.20%||0.00%||9.05%||0.1478||0.1522||48.75%|
|Specificity-controlled CT 2||0.28%||0.00%||0.10%||0.00%||8.37%||0.1772||0.1833||50.57%|
|Specificity-controlled CT 4||0.12%||0.00%||0.08%||0.00%||7.90%||0.1921||0.1877||29.46%|
|Specificity-controlled CT 7||0.02%||0.00%||0.14%||0.00%||8.17%||0.2156||0.1955||16.51%|
|Specificity-controlled CT 9||0.01%||0.00%||0.11%||0.00%||8.01%||0.2462||0.1990||8.50%|
|Response-rel control (WD)|
|Response-controlled WD -10||0.13%||0.00%||0.00%||0.00%||0.00%||0.1914||-0.0921||25.71%|
|Response-controlled WD 5||0.15%||0.00%||0.00%||0.00%||0.00%||0.1973||0.4360||39.78%|
|Response-controlled WD 10||0.05%||0.00%||0.00%||0.00%||0.00%||0.2535||0.6653||27.56%|
|Response-controlled WD 13||0.02%||0.00%||0.00%||0.00%||0.00%||0.2999||0.7251||20.47%|
|Question control (CT)|
|Question-controlled CT 0||0.06%||0.00%||0.19%||0.00%||9.20%||0.1871||0.1753||2.01%|
|Question-controlled CT 1||0.09%||0.00%||0.19%||0.00%||8.66%||0.1844||0.1722||17.33%|
|Question-controlled CT 4||0.40%||0.00%||0.25%||0.00%||8.53%||0.1794||0.1713||48.88%|
|Question-controlled CT 7||0.80%||0.00%||0.17%||0.00%||8.48%||0.1771||0.1724||65.65%|
|Question-controlled CT 10||1.27%||0.00%||0.16%||0.00%||8.48%||0.1761||0.1728||79.67%|
|Question-controlled CT 10 (boost)*||7.64%||0.00%||0.03%||0.00%||10.76%||0.1701||0.1651||99.54%|
*The purpose of the Question-controlled CT 10 (boost) setting is to achieve 100% question-asking rate. This is necessary because the Question-controlled CT 10 setting achieves only 79.67% questions, due to the interference of the extrep_bigram control. The 10 (boost) setting relaxes the repetition control in order to achieve 99.54% question-asking at the cost of slightly increased external bigram repetition.
Appendix G Human evaluation results for all configurations
|Model||Avoiding Rep.||Engage||Fluency||Humanness||Inquisitive||Interesting||Listening||Make Sense||Persona|
|Human||2.90 0.39||3.31 0.90||3.66 0.71||3.40 0.80||2.63 0.63||3.23 0.83||3.64 0.63||3.84 0.52||0.92 0.27|
|Greedy search baseline||2.16 0.72||2.31 1.08||3.20 0.81||1.78 0.90||2.00 0.81||2.36 0.98||2.78 0.84||3.33 0.75||0.87 0.34|
|Beam search baseline||2.14 0.72||2.35 1.01||3.23 0.93||1.81 0.87||2.50 0.72||2.35 0.98||2.63 0.85||3.40 0.77||0.77 0.42|
|Extrep bigram -0.5||2.66 0.56||2.56 0.92||3.57 0.64||2.19 0.94||2.67 0.62||2.61 0.87||3.08 0.78||3.60 0.57||0.75 0.43|
|Extrep bigram -1.25||2.84 0.39||2.91 0.90||3.59 0.64||2.32 0.98||2.63 0.60||2.86 0.89||3.21 0.71||3.64 0.62||0.72 0.45|
|Extrep bigram -3.5||2.90 0.30||2.95 0.86||3.73 0.50||2.45 1.03||2.55 0.61||2.88 0.80||3.27 0.79||3.68 0.49||0.80 0.40|
|Extrep bigram -inf||2.82 0.43||2.96 0.86||3.64 0.58||2.40 0.96||2.65 0.69||2.86 0.82||3.31 0.69||3.66 0.59||0.91 0.29|
|Repetition-controlled baseline||2.89 0.39||2.89 0.89||3.66 0.56||2.50 0.99||2.70 0.64||2.96 0.92||3.25 0.71||3.68 0.54||0.87 0.34|
|Question-controlled CT 0||2.95 0.25||2.92 0.90||3.70 0.54||2.49 0.97||2.48 0.72||2.85 0.93||3.29 0.69||3.56 0.66||0.86 0.35|
|Question-controlled CT 1||2.88 0.33||2.94 0.93||3.59 0.66||2.47 0.95||2.52 0.69||2.85 0.90||3.32 0.73||3.63 0.55||0.85 0.36|
|Question-controlled CT 4||2.88 0.38||2.88 0.94||3.59 0.73||2.42 1.07||2.55 0.66||2.82 0.85||3.37 0.74||3.63 0.59||0.84 0.37|
|Question-controlled CT 7||2.88 0.37||3.07 0.90||3.67 0.54||2.42 0.98||2.75 0.58||2.97 0.84||3.23 0.76||3.53 0.76||0.80 0.40|
|Question-controlled CT 10||2.74 0.46||2.90 0.93||3.70 0.50||2.43 1.04||2.71 0.57||2.72 0.88||3.12 0.73||3.59 0.66||0.79 0.41|
|Question-controlled CT 10 (boost)||2.76 0.49||2.84 0.94||3.60 0.64||2.26 0.97||2.94 0.57||2.83 0.94||3.18 0.80||3.52 0.67||0.72 0.45|
|Specificity-controlled CT 0||2.83 0.40||2.96 0.93||3.62 0.58||2.42 0.99||2.60 0.56||2.86 0.89||3.29 0.70||3.66 0.60||0.72 0.45|
|Specificity-controlled CT 2||2.90 0.36||2.78 1.00||3.60 0.64||2.37 0.93||2.66 0.66||2.80 0.96||3.14 0.77||3.50 0.63||0.81 0.39|
|Specificity-controlled CT 4||2.92 0.27||2.81 0.88||3.65 0.59||2.34 1.02||2.57 0.62||2.80 0.78||3.25 0.78||3.50 0.66||0.86 0.35|
|Specificity-controlled CT 7||2.89 0.32||3.00 0.94||3.64 0.67||2.53 1.03||2.56 0.66||2.90 0.90||3.34 0.70||3.59 0.60||0.82 0.39|
|Specificity-controlled CT 9||2.90 0.35||2.83 0.87||3.61 0.62||2.40 0.97||2.31 0.74||2.84 0.83||3.07 0.81||3.58 0.56||0.88 0.32|
|Specificity-controlled WD -10||2.85 0.43||2.43 0.99||3.34 0.83||2.15 0.91||2.31 0.69||2.38 0.94||3.03 0.75||3.33 0.70||0.71 0.45|
|Specificity-controlled WD -4||2.90 0.30||2.78 0.95||3.55 0.63||2.41 0.92||2.52 0.66||2.64 0.93||3.28 0.73||3.56 0.62||0.82 0.38|
|Specificity-controlled WD 4||2.95 0.21||2.99 0.86||3.65 0.55||2.49 0.90||2.65 0.55||3.00 0.78||3.37 0.59||3.63 0.50||0.93 0.25|
|Specificity-controlled WD 6||2.93 0.26||2.96 0.90||3.52 0.76||2.41 1.04||2.58 0.66||3.06 0.80||3.24 0.76||3.50 0.66||0.93 0.26|
|Specificity-controlled WD 8||2.78 0.52||2.40 1.23||2.67 1.25||1.86 0.97||2.03 0.87||2.55 1.14||2.61 1.05||2.91 0.91||0.92 0.28|
|Response-related controlled WD -10||2.86 0.44||2.48 0.98||3.42 0.74||2.02 0.93||2.38 0.75||2.53 0.94||2.84 0.80||3.14 0.75||0.91 0.29|
|Response-related controlled WD 0||2.96 0.23||3.01 0.90||3.72 0.54||2.73 1.00||2.56 0.67||2.92 0.84||3.37 0.72||3.73 0.52||0.82 0.38|
|Response-related controlled WD 5||2.90 0.33||2.88 0.90||3.51 0.63||2.41 1.01||2.53 0.65||2.85 0.90||3.27 0.73||3.49 0.63||0.82 0.39|
|Response-related controlled WD 10||2.78 0.43||2.39 1.04||3.06 0.90||1.97 0.99||2.22 0.67||2.57 1.01||3.03 0.76||3.16 0.63||0.75 0.43|
|Response-related controlled WD 13||2.71 0.57||2.10 1.13||2.54 1.12||1.81 1.07||2.14 0.84||2.33 1.06||2.69 0.83||2.70 0.88||0.62 0.49|
std.) for all models and human evaluation metrics.
|Model||Avoiding Rep.||Engage||Fluency||Humanness||Inquisitive||Interesting||Listening||Make Sense|
|Human||2.79 0.12||3.04 0.11||3.36 0.12||3.35 0.11||2.44 0.12||2.92 0.11||3.32 0.13||3.68 0.11|
|Greedy search baseline||2.08 0.10||2.24 0.11||3.03 0.10||1.75 0.12||1.95 0.10||2.29 0.13||2.62 0.10||3.23 0.10|
|Beam search baseline||2.08 0.11||2.29 0.11||3.09 0.13||1.71 0.13||2.42 0.11||2.29 0.14||2.47 0.12||3.35 0.13|
|Extrep bigram -0.5||2.62 0.10||2.54 0.12||3.35 0.12||2.13 0.11||2.63 0.11||2.56 0.11||2.93 0.11||3.48 0.11|
|Extrep bigram -1.25||2.78 0.09||2.82 0.13||3.40 0.12||2.27 0.12||2.54 0.09||2.76 0.10||3.05 0.11||3.53 0.14|
|Extrep bigram -3.5||2.83 0.11||2.93 0.10||3.56 0.10||2.43 0.11||2.47 0.11||2.83 0.10||3.14 0.10||3.62 0.12|
|Extrep bigram -inf||2.74 0.11||2.87 0.14||3.49 0.12||2.32 0.13||2.56 0.11||2.75 0.12||3.13 0.12||3.59 0.12|
|Repetition-controlled baseline||2.86 0.12||2.82 0.12||3.53 0.10||2.40 0.11||2.62 0.13||2.84 0.12||3.10 0.11||3.58 0.14|
|Question-controlled CT 0||2.87 0.12||2.84 0.13||3.51 0.10||2.46 0.11||2.36 0.09||2.76 0.09||3.10 0.10||3.49 0.12|
|Question-controlled CT 1||2.82 0.11||2.88 0.11||3.42 0.10||2.46 0.12||2.47 0.11||2.79 0.13||3.14 0.11||3.55 0.10|
|Question-controlled CT 4||2.78 0.12||2.88 0.10||3.47 0.11||2.40 0.09||2.53 0.13||2.83 0.13||3.24 0.11||3.59 0.10|
|Question-controlled CT 7||2.81 0.10||2.99 0.11||3.54 0.09||2.35 0.11||2.66 0.12||2.92 0.12||3.11 0.10||3.47 0.10|
|Question-controlled CT 10||2.67 0.13||2.87 0.11||3.52 0.12||2.35 0.12||2.63 0.12||2.66 0.10||2.94 0.11||3.53 0.12|
|Question-controlled CT 10 (boost)||2.68 0.12||2.74 0.09||3.42 0.12||2.19 0.13||2.79 0.11||2.74 0.11||3.00 0.12||3.45 0.13|
|Specificity-controlled WD -10||2.76 0.11||2.41 0.12||3.19 0.12||2.15 0.11||2.28 0.13||2.35 0.12||2.89 0.11||3.28 0.12|
|Specificity-controlled WD -4||2.83 0.10||2.76 0.12||3.37 0.10||2.36 0.11||2.46 0.11||2.62 0.12||3.14 0.09||3.52 0.11|
|Specificity-controlled WD 0||2.86 0.12||2.82 0.12||3.53 0.10||2.40 0.11||2.62 0.13||2.84 0.12||3.10 0.11||3.58 0.14|
|Specificity-controlled WD 4||2.84 0.10||2.96 0.12||3.45 0.13||2.44 0.12||2.56 0.09||2.94 0.11||3.20 0.10||3.54 0.11|
|Specificity-controlled WD 6||2.81 0.09||2.91 0.10||3.34 0.09||2.31 0.11||2.53 0.12||2.93 0.12||3.09 0.10||3.41 0.12|
|Specificity-controlled WD 8||2.70 0.11||2.39 0.12||2.54 0.12||1.80 0.13||2.00 0.10||2.49 0.12||2.47 0.10||2.87 0.11|
|Specificity-controlled CT 0||2.79 0.10||2.93 0.09||3.44 0.12||2.38 0.11||2.56 0.12||2.84 0.12||3.12 0.13||3.61 0.11|
|Specificity-controlled CT 2||2.78 0.12||2.74 0.11||3.39 0.13||2.31 0.13||2.56 0.13||2.74 0.12||2.99 0.11||3.47 0.10|
|Specificity-controlled CT 4||2.82 0.10||2.80 0.13||3.44 0.14||2.32 0.13||2.51 0.12||2.78 0.15||3.09 0.13||3.46 0.13|
|Specificity-controlled CT 7||2.81 0.12||2.91 0.13||3.43 0.11||2.45 0.10||2.49 0.11||2.81 0.12||3.15 0.12||3.55 0.11|
|Specificity-controlled CT 9||2.80 0.13||2.78 0.10||3.41 0.12||2.35 0.13||2.28 0.11||2.79 0.11||2.91 0.11||3.51 0.12|
|Response-related controlled WD -10||2.77 0.12||2.45 0.12||3.26 0.11||1.96 0.10||2.31 0.12||2.47 0.12||2.73 0.11||3.12 0.12|
|Response-related controlled WD 0||2.87 0.12||2.97 0.11||3.55 0.09||2.62 0.11||2.48 0.10||2.88 0.12||3.21 0.09||3.70 0.10|
|Response-related controlled WD 5||2.79 0.10||2.83 0.09||3.35 0.12||2.40 0.12||2.51 0.13||2.80 0.13||3.13 0.12||3.41 0.12|
|Response-related controlled WD 10||2.74 0.11||2.42 0.12||2.93 0.11||1.95 0.12||2.20 0.12||2.56 0.12||2.90 0.12||3.12 0.10|
|Response-related controlled WD 13||2.63 0.12||2.06 0.11||2.40 0.09||1.74 0.11||2.07 0.11||2.25 0.12||2.49 0.14||2.63 0.10|