Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

by   Asma Ghandeharioun, et al.

Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of single-turn evaluation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r>.7, p<.05). To investigate the strengths of this novel metric and interactive evaluation in comparison to state-of-the-art metrics and one-turn evaluation, we perform extended experiments with a set of models, including several that make novel improvements to recent hierarchical dialog generation architectures through sentiment and semantic knowledge distillation on the utterance level. Finally, we open-source the interactive evaluation platform we built and the dataset we collected to allow researchers to efficiently deploy and evaluate generative dialog models.


page 2

page 6

page 7

page 9

page 10

page 11

page 14

page 17


RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

Open-domain human-computer conversation has been attracting increasing a...

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

The lack of meaningful automatic evaluation metrics for dialog has imped...

Hierarchical Reinforcement Learning for Open-Domain Dialog

Open-domain dialog generation is a challenging problem; maximum likeliho...

We've had this conversation before: A Novel Approach to Measuring Dialog Similarity

Dialog is a core building block of human natural language interactions. ...

Unsupervised Evaluation of Interactive Dialog with DialoGPT

It is important to define meaningful and interpretable automatic evaluat...

Proposal Towards a Personalized Knowledge-powered Self-play Based Ensemble Dialog System

This is the application document for the 2019 Amazon Alexa competition. ...

Discovering Dialog Structure Graph for Open-Domain Dialog Generation

Learning interpretable dialog structure from human-human dialogs yields ...

Code Repositories


Code to support training, evaluating and interacting neural network dialog models, and training them with reinforcement learning. Code to deploy a web server which hosts the models live online is available at:

view repo


The server portion of the Neural Chat project to deploy chatbots on web. This code is accompanied by another repository that includes the chatbot models: for training, evaluating and interacting with our open-sourced neural dialog models, use

view repo

1 Introduction

The goal of an open-domain conversational agent is to carry out natural social interactions with humans. Current state-of-the-art generative neural networks fail in producing key aspects of good natural conversation, including staying on topic, not being repetitive, and generating emotionally appropriate responses. One of the biggest challenges in training better dialog systems relates to the difficulty of evaluating them. Automatic metrics such as BLEU score relate poorly to human judgment of dialog quality

Liu et al. (2016), and while embedding-distance based metrics provide an alternative Mitchell and Lapata (2008), we will show that they also do not correlate well with human evaluation. Without a reliable metric to optimize, training high quality dialog models remains difficult.

Since humans are the ultimate authority on what constitutes a good conversation, many authors rely on human ratings to evaluate their methods (e.g. Serban et al. (2017b); Park et al. (2018); Serban et al. (2016)). The predominant procedure for obtaining human ratings uses single-turn evaluation: a context of several sentences is used as input to generate a single response for a human to evaluate. However, a single turn is incapable of capturing common failure modes of open-domain dialog systems, such as a lack of diversity in the responses, inability to track long-term aspects of the conversation, and inability to maintain a consistent persona Dinan et al. (2019); Li et al. (2016a); Mazare et al. (2018). In spite of this, most authors still rely on single-turn evaluation, even when evaluating these exact qualities.

Conversation is inherently a process. In this paper, we argue that multi-turn interactive human evaluation is essential for evaluating this process, and for making progress in improving open-domain dialog models. Table 1 illustrates an example where a bot generates a coherent single-turn response, but multi-turn evaluation shows that the distribution of its responses has collapsed onto repeatedly saying the same phrase, a common problem in this domain Li et al. (2016b).

Single-turn HRED - Cornell Multi-turn HRED - Cornell Multi-turn EI HRED - Reddit Self-play EI HRED - Reddit [User]: you’re weird. [HRED]: i’m sorry. [User]: i’m feeling sad today. [HRED]: no. [User]: yes i am! [HRED]: i’m sorry, i’m sorry. i don’t want to know. [User]: you’re weird. [HRED]: i’m sorry. [User] talk to me [EI HRED]: it’s my job. [User]: i love you [EI HRED]: thank you for the kind words. [User]: do you love me? [EI HRED]: it’s a bit personal, so I sent you a dm. [User]: i’m blushing [EI HRED] i can’t wait to see what i can do. [EI HRED] that’s a good idea. [EI HRED] how do you deal with people? [EI HRED] i’ve been on reddit for a long time. [EI HRED] i am in the same boat.

Table 1:

Single-turn evaluation fails to capture a lack of diversity in a dialog model’s responses, as well as its inability to track the conversation and respond in emotionally appropriate ways. We argue multi-turn evaluation is needed to evaluate dialog models, and show that our Emotion+Infersent (EI) models trained on a larger and more diverse corpus, produce better interactive dialog. We present strong evidence that our novel dialog self-play framework combined with psychologically motivated novel automated metrics can accurately estimate quality of a model with respect to its ability to carry out multi-turn conversations.

The relative sparsity of interactive human evaluation of dialog systems may relate to the difficulty and expense of collecting human data. Therefore, we develop a way to approximate human judgment of interactive dialog quality using a novel form of dialog self-play. We begin by proposing a series of metrics to evaluate the quality of conversation motivated by findings in psychology. We then fit a function that predicts human assessments of conversation quality given these metrics. This function is used to predict bot quality through self-play: for a fixed number of turns, the bot generates utterances which are fed back into itself as input in the next turn. The same metrics described above are computed on the self-play generated conversation, and the same function fit to human data is used to predict the bot quality. We show a very high correlation () between the predicted quality scores and the ground-truth human judgments of bot quality, suggesting self-play is a good proxy for interactive conversation assessment.

To demonstrate the relevance of the interactive evaluation and the proposed self-play evaluation, we perform extended experiments with different hierarchical architectures. In particular, we compare three recent baseline hierarchical architectures: HRED, VHRED, VHCR. Motivated by sentiment and semantics being key aspects of producing high quality conversations, we regularize the top level of the hierarchy to ensure it encodes such information, using a form of model distillation Hinton et al. (2015). Our results show the effectiveness of the proposed regularization in interactive evaluation in both the human-bot and the self-play scenarios.

This paper makes three main contributions: 1) demonstrates the necessity of interactive multi-turn evaluation to capture the quality of the dialog systems; 2) Presents a novel self-play framework to estimate a new psychology-motivated hybrid quality score. These estimations are highly correlated with quality scores obtained from interactive human evaluation, more strongly than the state-of-the-art automated metrics; 3) proposes a new method of regularizing hierarchical seq2seq models with knowledge distillation. All the code, data, and interactive evaluation platform resulting from our work are publicly available.

2 Related work

Despite the noisiness of single-turn human evaluation, interactive evaluation in dialog has been mostly limited to presenting the results of competitions (e.g. the Alexa prize Serban et al. (2017a); Venkatesh et al. (2018), or the Conversational Intelligence Challenge Dinan et al. (2019)). Those findings reveal that most bots do not perform well in interactive evaluation, due to repetitiveness, inability to balance dialog acts across the conversation, and inability to maintain a consistent persona Dinan et al. (2019). Even work aimed at maintaining a persona does not test in an interactive setting Mazare et al. (2018); Li et al. (2016a). To the best of our knowledge, no prior work has compared interactive, multi-turn human evaluations of open-domain dialog models to traditional forms of evaluation.

Dialog systems remain difficult to train due to the lack of metrics that can effectively capture good dialog quality. Several authors have proposed training automatic predictors of human judgment or to combine human judgment with automatic metrics Hashimoto and Sassano (2018); Lowe et al. (2017); Hashimoto et al. (2019). However, a state-of-the-art model trained to predict human judgments achieved a correlation of less than 0.5 with the ground truth Lowe et al. (2017).

Perhaps the lack of research into interactive evaluation relates to the difficulty and expense. We show that human judgments of the quality of an interactive evaluation can be automatically and reliably approximated using dialog model self-play. There is limited work investigating self-play for dialog systems: Shah et al. (2018) use a task schema and user simulator to generate samples for input to a goal-directed dialog system, while Li et al. (2016b)

use a copy of a dialog model to compute a reward function that can be optimized with reinforcement learning. However, we are not aware of prior work using self-play for approximating interactive human evaluation.

Multi-turn conversation necessitates tracking long-term aspects of the dialog like the topic and tone. Hierarchical recurrent neural networks (RNNs) have been proposed as a way to improve long-term tracking of the conversation, through maintaining both a word- and utterance-level RNN (e.g.

Serban et al. (2016, 2017b); Park et al. (2018); Shen et al. (2018); Zhao et al. (2017)). Yet dialog is more than language modeling, it requires topic and social coherence. Prior performance improvements to dialog models using topic information include appending topic as an additional input Ghosh et al. (2016), or extracting topic information using Latent Dirchlet Allocation Li and Jurafsky (2017); Xing et al. (2017)

. Towards social and emotional coherence, previous works have investigated various features and loss functions based on emotion

Zhou et al. (2018); Zhou and Wang (2018); Huang et al. (2018); Rashkin et al. (2018).

3 Knowledge distillation for sentiment and semantic regularization

We build on three existing hierarchical seq2seq architectures designed for dialog. Here, we provide a brief summary; for detailed information, see Serban et al. (2016, 2017b); Park et al. (2018). The first baseline model, Hierarchical Recurrent Encoder Decoder (HRED) Serban et al. (2016) extends a traditional seq2seq model by adding a third recurrent neural network (RNN), which is only updated after each dialog turn, or utterance. The idea behind this Context RNN is that it could potentially track longer term aspects of the conversation, such as the topic; however, there is no guarantee that it will learn to do so. The decoder of the HRED model conditions on both the embedding produced by the encoder for the current utterance, , and the embedding of the Context RNN for the previous utterance, .

Figure 1: Illustration of the EI regularization (blue) applied to VHRED baseline (red) to enforce encoding sentiment and semantics of an utterance in the Context RNN. The EI regularization can be similarly applied to HRED and VHCR.

The second baseline model, Variational HRED (VHRED) Serban et al. (2017b), extends HRED with a variational constraint on the utterance embedding space . Let be the -th utterance composed of tokens . VHRED predicts as follows:


Equations (1)-(5) describe the computation of VHRED at inference time where , , and

are Gated Recurrent Unit (GRU) networks for the encoder, context, and decoder RNNs, respectively; at training time, it allows the computation of

, , and to condition on the encoding of the target utterance, , giving the posterior distribution . A Kullback-Leibler (KL) divergence constraint is placed between the posterior and prior, .

The third model, Variational Hierarchical Conversation RNN (VHCR) Park et al. (2018) further extends VHRED by drawing a prior encoding for each conversation, allowing all parts of the model () to condition on , which is unchanging throughout the conversation.

3.1 Emotion and Infersent regularization (EI)

While the hierarchical design of these models is motivated by a desire to allow tracking high-level, slow-changing aspects of the conversation like topic or tone, it is unclear that the network will be able to model these aspects without additional structure or information. We thus propose a regularization to the top level of the hierarchy, the Context RNN, to force it to encode both the sentiment and semantics of the utterance. To do this, we leverage a state-of-the-art sentiment detection model trained on a large Twitter corpus Felbo et al. (2017), as well as the recently proposed Infersent sentence-embedding model trained to predict the meaning (i.e. entailment, contradiction) of sentences Conneau et al. (2017), and distill them into the Context RNN.

First, we use these models to predict the emotional content, , and infersent embedding, of each input utterance. We then add an additional network to the hierarchical models which predicts these values based on the context RNN embedding of the utterance: . The goal is to transfer knowledge of emotion and semantics in text into the context RNN via knowledge distillation Hinton et al. (2015).

Figure 1 illustrates, in blue color, the EI regularization applied to the VHRED model. The regularization can be similarly applied to HRED and VHCR. In our experiments we refer to the regularized models as HRED-EI, VHRED-EI, and VHCR-EI, respectively, or, more generally, EI models as opposed to baseline models. The code for all our models is available at and was originally based on Park et al. (2018).

4 Interactive evaluation methodologies

4.1 Traditional evaluation

Automatic metrics

Embedding-based metrics compare generated sentences to ground truth sentences using a vector representation of words

Mitchell and Lapata (2008). In this work we use three embedding metrics: embedding average, vector extrema, and greedy matching. These three metrics are used in previous open-domain dialog models Serban et al. (2017b); Liu et al. (2016); Park et al. (2018). We also use perplexity

as a standard measure of the likelihood of the generated sentences with respect to the target outputs. Another common metric for variational models is the KL-Divergence between the posterior and the prior distribution, as a way of assessing the information encoded into the latent variables

Shen et al. (2018) (Figure 1 illustrates KL for the VHRED model).

Conventional one-turn human evaluation We employ a similar method to previous work for our single-turn human evaluation of generated responses Serban et al. (2017b); Park et al. (2018), sampling contexts from each corpus and asking humans to compare the generated responses. To reduce ambiguity, we exclude contexts shorter than 10 tokens and contexts containing <unknown> tokens. We recruited participants from Amazon Mechanical Turk (AMT) to compare generated sentences. Annotators could also select a third “tied” option. For each example (context and pair of generated sentences), we asked annotators to rate quality, fluency, relatedness, and empathy of the generated sentences. Each batch of 100 pairwise comparison were labeled by 6 - 8 annotators.

4.2 Interactive human evaluation

To address the limitations of single-turn human evaluation, we built a platform for conducting interactive evaluation of dialog models with humans, which we make available in open-source to the community (see Figure 2). Annotators rated quality, fluency, relatedness, and empathy of a bot after interacting with it for at least 3 turns. Participants can also upvote or downvote each bot response.

Figure 2: Screenshots of our Interactive Evaluation Platform (available at (a) chat window (left) and first part of the evaluation form (right); (b) second part of the evaluation form (to show all evaluation questions asked).

4.3 Novel metrics and self-play

Inspired by real-world human interactions, we introduce novel metrics to capture the morphology of a conversation, i.e., how the users’ responses progress over time and how the bot’s responses interact with them. We propose a hybrid combination of these metrics, , that is optimized to predict conversation quality on human data. We then apply to self-play, i.e., the trajectory of bot-generated responses, and investigate how it relates to human ratings of conversation quality.

Sentiment metrics To approximate emotional tone of an utterance, we use a state-of-the-art sentiment detector trained on a large Twitter corpus Felbo et al. (2017)

. This pre-trained model outputs an emotion embedding – a probability distribution over 64 most-frequently used emojis. To estimate the

Sentiment Coherence

between user’s query and generated samples, we calculate the cosine similarity between their emotion embeddings. We define a set of weights over the 64 emojis and calculate the weighted sum over an emotion embedding vector to derive a

Sentiment score which is higher for positive sentiment and lower for negative sentiment. We define Sentiment Transition as the change between user’s Sentiment before and after a bot response. Additionally, Sentiment Min-Max is defined by the slope of change between min and max Sentiment in user utterances over the course of a conversation. Since humour can be used to create solidarity Hay (2000), we count the number of "ha"s in the user response as a proxy for Laughter. The combination of these metrics provides a snapshot of the trajectory of sentiment in a conversation and quantifies if the bot is able to elicit positive emotions in the user.

Semantic metrics Language style matching is a strong predictor of relationship stability Ireland et al. (2011) and social cohesiveness Gonzales et al. (2010); thus, we introduce metrics to capture lexical similarity. We use Infersent, a state-of-the-art sentence-embedding model to encode the user and bot responses into a 4096-dimensional embedding space Conneau et al. (2017). Infersent was trained to distinguish if two sentences are supporting, contradicting, or have a neutral relationship. We estimate Semantic Similarity by calculating the cosine similarity between the infersent embedding of the user’s query and the generated bot sample. Additionally, we use the classic Word2Vec embeddings trained on Google News Corpus along with average, extrema, and greedy aggregation methods similar to Section 4.1 to derive Average Word Coherence, Extrema Word Coherence, and Greedy Word Coherence between user and bot responses.

Engagement metrics Asking questions is an important active listening skill which is linked to conversation management, attentiveness, and responsiveness Bodie et al. (2012). Therefore, we define Question Score to quantify if the bot is using question words and/or a question mark. We also introduce # Words as a proxy for user engagement that counts the number of words in their response.

Hybrid metric () We combine the aforementioned metrics (

) using linear regression, and optimize their coefficients (

) to best predict human judgment of interactive conversation quality: . We use a leave-bot-out scenario where we isolate all the human conversations with one of the dialog models, , as the hold-out test set. We train the on the remaining quality ratings. We found that the learned s were stable across the training folds, only exhibiting small variations. Other researchers are encouraged to use our learned coefficients directly or adjust them according to their own interactive human evaluation dataset.

Self-play as an approximation for interactive evaluation Since interactive human evaluation is costly, we propose a self-play scenario where the dialog system talks to itself, i.e. the bot generated responses are fed back into it as the next turn input. For each model , we generate 100 random conversations, fixed at 10 turns. The self-play trajectories created using model are treated as the hold-out set. Therefore, the trained values based on all conversations except for the ones with are used to calculate on each generated bot-bot conversation trajectory for . The estimated values are averaged across conversation samples for . This value is used for comparison against the ground-truth interactive quality ratings aggregated on a the bot-level.

5 Experiments

5.1 Datasets

A common source of data for open-domain dialog systems is movie scripts, among which the Cornell dataset (Danescu-Niculescu-Mizil and Lee, 2011) is the largest and most commonly used. Therefore, we use it to benchmark against previous state-of-the-art results Park et al. (2018). Its median conversation length is utterances and the conversations are strictly between pairs of speakers. Recognizing that movie lines have limited conversation diversity, we also built a new corpus, Reddit. Between the many different subreddits available, the conversations vastly differ on topic, language style, and participation patterns. We select the Casual Conversations forum (, a community of conversationalists discussing a variety of topics. We collect a dataset of conversations of at least 3 turns with the median conversation containing utterances from conversational exchanges on the platform in 2018111This Reddit dataset is available at for public use..

5.2 Interactive human evaluation

Table 1 (in Section 1) illustrates how EI regularization produces a higher quality conversation when compared to baseline. Rather than cherry-picking results, we make all of the bots evaluated in the study available at for readers to assess interactively.

Cornell Reddit
Model Metric Baseline EI Baseline EI
HRED quality 2.182 0.305 2.347 0.313 2.527 0.310 2.714 0.299
fluency 3.909 0.387 4.000 0.381 4.436 0.349 4.786 0.316
diversity 2.836 0.374 2.735 0.380 3.418 0.386 3.554 0.372
contingency 2.200 0.291 2.469 0.336 2.382 0.288 2.536 0.322
empathy 2.673 0.352 2.490 0.350 3.018 0.329 3.107 0.337
VHRED quality 2.022 0.309 2.333 0.252 2.694 0.392 2.864 0.341
fluency 3.109 0.351 3.949 0.396 4.250 0.496 4.477 0.402
diversity 3.565 0.442 4.385 0.371 5.00 0.468 4.705 0.353
contingency 2.261 0.287 2.487 0.346 2.472 0.362 2.773 0.370
empathy 2.739 0.374 2.564 0.367 3.000 0.393 3.341 0.385
VHCR quality 2.132 0.247 2.548 0.380 2.615 0.350 2.692 0.298
fluency 2.679 0.306 3.976 0.380 3.923 0.433 4.308 0.395
diversity 3.755 0.340 4.238 0.421 4.436 0.455 4.231 0.382
contingency 2.189 0.270 2.571 0.356 2.077 0.298 2.692 0.354
empathy 2.340 0.316 2.714 0.368 2.974 0.434 3.288 0.379
Table 2:

Mean ratings (from humans) for Baseline and EI (Emotion+Infersent) models for HRED, VHRED, and VHCR architectures with 90% confidence intervals. For 3-factor ANOVA results, see Section


Overall, N=566 ratings were captured. Table 2 summarizes human ratings of baseline and EI models obtained via interactive evaluation. We ran a 3-factor ANOVA on the sum of user scores, where the independent variables are model architecture (HRED, VHRED, VHCR), EI regularization (Baseline, EI), and dataset (Cornell, Reddit). We found a significant main effect of EI regularization and dataset, but no significant difference between the three types of hierarchical models. We found that adding emotion and infersent (EI) regularization to baseline models improved the interactive chat experience significantly, . Further, the models trained on the Reddit dataset performed significantly better, . This finding validates the hypothesis that distilling information about topic and tone into the top level of the hierarchy is useful for good conversation, and suggests that the Reddit dataset could provide more realistic training for open-domain dialog and be valuable to the community.

5.3 Traditional metrics

Automatic metrics Several prior works have focused on ensuring that the variational KL term remains high in order to improve model quality (e.g. Shen et al. (2018); Park et al. (2018)). However, we observe there is no consistency between human quality rating and KL (Table 3). Thus, it is not evident that KL captures human judgements of dialog quality. Even perplexity (a transformation of the cross-entropy loss used to train our models) falls short of capturing human quality judgments, underscoring the difficulty in effectively training good language models. We find embedding metrics show more promise in preserving the order of human quality ratings, but have only weak correlation with human ratings. We present evidence for our novel hybrid metric being a much stronger alternative.

Cornell Reddit
Model Version PPL KL Avg Ext Grd PPL KL Avg Ext Grd
HRED baseline 52.311 - .471 .329 .331 41.730 - .649 .394 .474
EI 47.636 - .560 .383 .400 41.245 - .651 .398 .482
VHRED baseline 49.414 .264 .539 .352 .395 36.240 .188 .635 .383 .464
EI 50.526 .517 .545 .355 .394 35.510 .167 .636 .392 .465
VHCR baseline 61.000 .562 .532 .345 .382 36.736 .267 .619 .371 .448
EI 49.243 .475 .588 .369 .444 37.198 .231 .639 .394 .469
Table 3: Results of automatic traditional metrics for 1-turn responses of models per context of baseline and EI (Emotion + Infersent) models. PPL: perplexity, KL: KL divergence, Avg: Average, Ext: Extrema, Grd: Greedy
Cornell Reddit
Model Wins % Losses % Ties % Wins % Losses % Ties %
HRED-EI 40.8 4.9 24.5 4.9 34.8 9.2 31.3 5.2 29.5 6.6 39.3 10.7
VHRED-EI 36.9 4.7 36.6 5.6 26.6 6.9 39.0 7.0 34.0 5.3 27.0 8.9
VHCR-EI 33.0 6.1 29.0 5.4 38.0 10.1 33.7 7.9 27.3 3.3 39.0 8.6
Table 4: Results from human single-turn evaluation for EI (Emotion+Infersent) vs. baseline models as measured by pairwise comparisons of Quality with 90% confidence intervals.

Human one-turn evaluation As shown in Table 4, while single-turn human evaluation suggests EI regularization is effective due to a higher number of win judgments222We follow Park et al. (2018) in highlighting the higher value between wins/losses and reporting 90% confidence intervals., the results are noisy and difficult to interpret due to large confidence intervals and a high percentage of ties. The median inter-annotator agreement measured pairwise through Cohen’s kappa Fleiss et al. (1969) for our human evaluation was only 0.176 and 0.120 for Cornell and Reddit respectively. This level of annotator agreement is lower than the median Cohen’s kappa of previous work Liu et al. (2016) and explains the larger confidence intervals. Even after removing ambiguous examples (i.e. where equal number of annotators select each response as being better), large annotation variation persists. This may be due to subjective interpretations and ambiguity arising from different interpretations of <unknown> tokens or the short length of contexts in the Cornell corpus (e.g. median length of conversation of 3 utterances). These findings further highlight the importance of an interactive evaluation as opposed to limited single-turn responses.

5.4 Novel metrics applied to human data and self-play

We examine how the novel psychologically-inspired metrics relate to the trajectories of the 100 best and 100 worst quality conversations. This is only feasible with interactive evaluation. As shown in Figure 3, we observe that appropriate sentiment, coherent semantics, and engaging users are indispensable to attaining high quality ratings in multi-turn interaction. Comparing EI and baseline conditions, we see a replication of these trends (Figure 4). For example, EI elicits longer responses from users (greater engagement), with more laughter and higher semantic coherence.

Figure 5 summarizes the relationships between interactive human ratings and the automated metrics. We observe that our sentiment metric applied to human data on its own has higher correlation with interactive human ratings than the commonly used metrics such as perplexity and embedding distance metrics. Most importantly, our novel hybrid metric, , applied to self-play aggregated on the model-level is strongly correlated with all human ratings (), while previous metrics achieved . This is a significant finding, suggesting that even without running interactive human evaluation, we can automatically approximate it through self-play. This metric is agnostic to the training set and model type and can be calculated on the trajectory of self-play utterances for any chatbot, regardless of its architecture. One interpretation is that the self-play framework keeps the conversation within the training set distribution, and the model is less likely to produce <unknown> tokens. Therefore, and its sub-components have meaningful values for the generated responses and can be useful for quality approximation.

Though we expect that the hybrid nature of makes it less exploitable, optimizing for its sub-components in isolation through a self-play scenario should be avoided. Differently from human interaction, maintaining extreme similarity in sentiment or semantics or just asking questions in self-play conversation trajectories could backfire by reducing the diversity of generated responses.

Figure 3: One hundred highest vs. lowest quality conversation trajectories; lines: mean, shaded area: 90% confidence intervals, x-axis: conversation turns. (a) Timing of upvote/downvote ratings: A bad first impression impedes overall rating. (b) Participants talk longer and use more words in conversations rated higher. (c) High-quality conversations elicit more positive user sentiment; many participants leave after expressing negative sentiment. (d) High-quality conversations are more semantically similar as measured by average word coherence between user query and bot responses. Users tend to leave the conversation when the bot responses are semantically dissimilar.
Figure 4: EI vs. baseline conversation trajectories; lines: mean, shaded area: 90% confidence intervals, x-axis: conversation turns. (a) EI elicits longer responses from users, suggesting that they are more engaged compared to baseline. (b) EI evokes more laughter from users compared to baseline. (c) EI has higher semantic coherence as measured by average word coherence. The same pattern applies to greedy and extrema word coherence.

Figure 5: Correlations between five human metrics and automated metrics. Sentiment -U has higher correlation with interactive human ratings than prior metrics. Hybrid Metric -B/B, our novel self-play based metric, has higher correlation across all human metrics more than any other metric proposed to-date. Notes: -U: Calculated on user response, -B: Calculated on bot response, -U/B: Calculated between user and bot response, -B/B: Calculated between consecutive bot utterances.

6 Conclusions

A major obstacle in open-domain dialog generation is the predominant optimization of an objective function that does not map out to human judgment of conversation quality in a naturalistic chat. In this paper, we have argued that it is necessary to go beyond single-turn evaluation by investigating the strengths of interactive evaluation and highlighting blind-spots of traditional one-turn evaluation methods. To alleviate this problem, we have combined interactive human data with psychologically-motivated measures and introduced a novel hybrid metric. Using this metric in a self-play framework provides results that are strongly correlated with human judgment of chatbot empathy (r>.8) and quality (r>.7). Additionally, we have demonstrated a significant improvement to several hierarchical seq2seq generative models using regularization of the utterance level of the hierarchy with knowledge distillation. Finally, we have open-sourced the platform together with a new Reddit dataset.


We thank Ardavan Saeedi, Max Kleiman-Weiner, Oliver Saunders Wilder, Kyle Kastner, Sebastian Zepf, Ryan Lowe, and Kristy Johnson for helpful discussions, and many others for helping test-drive our bots. We thank the MIT Quest for Intelligence, and MIT Stephen A. Schwarzman College of Computing, Machine Learning Across Disciplines Challenge for providing computing resources, and MIT Media Lab Consortium for supporting this research.


  • Bodie et al. [2012] Graham D Bodie, Kellie St. Cyr, Michelle Pence, Michael Rold, and James Honeycutt. Listening competence in initial interactions i: Distinguishing between what listening is and what listeners do. International Journal of Listening, 26(1):1–28, 2012.
  • Conneau et al. [2017] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pages 670–680, 2017.
  • Danescu-Niculescu-Mizil and Lee [2011] Cristian Danescu-Niculescu-Mizil and Lillian Lee. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, pages 76–87. Association for Computational Linguistics, 2011.
  • Dinan et al. [2019] Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098, 2019.
  • Felbo et al. [2017] Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In 2017 Conference on Empirical Methods in Natural Language ProcessingConference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2017.
  • Fleiss et al. [1969] Joseph L Fleiss, Jacob Cohen, and Brian S Everitt.

    Large sample standard errors of kappa and weighted kappa.

    Psychological Bulletin, 72(5):323, 1969.
  • Ghosh et al. [2016] Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy, Tom Dean, and Larry Heck. Contextual lstm (clstm) models for large scale nlp tasks. arXiv preprint arXiv:1602.06291, 2016.
  • Gonzales et al. [2010] Amy L Gonzales, Jeffrey T Hancock, and James W Pennebaker. Language style matching as a predictor of social dynamics in small groups. Communication Research, 37(1):3–19, 2010.
  • Hashimoto and Sassano [2018] Chikara Hashimoto and Manabu Sassano. Detecting absurd conversations from intelligent assistant logs by exploiting user feedback utterances. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 147–156. International World Wide Web Conferences Steering Committee, 2018.
  • Hashimoto et al. [2019] Tatsunori B Hashimoto, Hugh Zhang, and Percy Liang. Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792, 2019.
  • Hay [2000] Jennifer Hay. Functions of humor in the conversations of men and women. Journal of pragmatics, 32(6):709–742, 2000.
  • Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Huang et al. [2018] Chenyang Huang, Osmar Zaiane, Amine Trabelsi, and Nouha Dziri. Automatic dialogue generation with expressed emotions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 49–54, 2018.
  • Ireland et al. [2011] Molly E Ireland, Richard B Slatcher, Paul W Eastwick, Lauren E Scissors, Eli J Finkel, and James W Pennebaker. Language style matching predicts relationship initiation and stability. Psychological science, 22(1):39–44, 2011.
  • Li and Jurafsky [2017] Jiwei Li and Dan Jurafsky. Neural net models of open-domain discourse coherence. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 198–209, 2017.
  • Li et al. [2016a] Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 994–1003, 2016a.
  • Li et al. [2016b] Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, 2016b.
  • Liu et al. [2016] Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau.

    How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation.

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, 2016.
  • Lowe et al. [2017] Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. Towards an automatic turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1116–1126, 2017.
  • Mazare et al. [2018] Pierre-Emmanuel Mazare, Samuel Humeau, Martin Raison, and Antoine Bordes. Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2775–2779, 2018.
  • Mitchell and Lapata [2008] Jeff Mitchell and Mirella Lapata. Vector-based models of semantic composition. proceedings of ACL-08: HLT, pages 236–244, 2008.
  • Park et al. [2018] Yookoon Park, Jaemin Cho, and Gunhee Kim. A hierarchical latent structure for variational conversation modeling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1792–1801, 2018.
  • Rashkin et al. [2018] Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. I know the feeling: Learning to converse with empathy. arXiv preprint arXiv:1811.00207, 2018.
  • Serban et al. [2016] Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In

    Thirtieth AAAI Conference on Artificial Intelligence

    , 2016.
  • Serban et al. [2017a] Iulian V Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, et al. A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349, 2017a.
  • Serban et al. [2017b] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence, 2017b.
  • Shah et al. [2018] Pararth Shah, Dilek Hakkani-Tur, Bing Liu, and Gokhan Tur. Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 41–51, 2018.
  • Shen et al. [2018] Xiaoyu Shen, Hui Su, Shuzi Niu, and Vera Demberg. Improving variational encoder-decoders in dialogue generation. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Venkatesh et al. [2018] Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad, Ming Cheng, Behnam Hedayatnia, Angeliki Metallinou, et al. On evaluating and comparing conversational agents. arXiv preprint arXiv:1801.03625, 4:60–68, 2018.
  • Xing et al. [2017] Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. Topic aware neural response generation. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • Zhao et al. [2017] Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.

    Learning discourse-level diversity for neural dialog models using conditional variational autoencoders.

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654–664, 2017.
  • Zhou et al. [2018] Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. Emotional chatting machine: Emotional conversation generation with internal and external memory. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Zhou and Wang [2018] Xianda Zhou and William Yang Wang. Mojitalk: Generating emotional responses at scale. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1128–1137, 2018.

7 Supplementary Materials

7.1 Ablation models results

We conducted additional evaluations of ablations of our EI models, to determine whether emotion or infersent regularization provided the most benefit. The results in Table 5 reveal that this depends on the dataset and the model in question. We also checked whether simply appending the emotion and infersent embedding of an utterance to the top level of the hierarchy could provide the same benefit as knowledge distillation, even though this would require retaining copies of the DeepMoji and Infersent models, and would be more computationally expensive at inference time. Table 5 reveals that the input-only models do not out-perform the knowledge-distillation EI models on automatic metrics.

Cornell Reddit
Model Version PPL KL Avg Ext Grd PPL KL Avg Ext Grd
HRED baseline 52.311 - .471 .329 .331 41.730 - .649 .394 .474
input only 47.911 - .549 .381 .392 41.227 - .644 .395 .469
EI 48.619 - .562 .359 .416 47.395 - .541 .310 .371
EI 47.988 - .562 .381 .405 41.083 - .646 .394 .472
EI 47.636 - .560 .383 .400 41.245 - .651 .398 .482
VHRED baseline 49.414 .264 .539 .352 .395 36.240 .188 .635 .383 .464
input only 49.819 .442 .543 .353 .393 40.248 .312 .630 .377 .456
EI 51.346 .636 .552 .358 .401 36.212 .199 .631 .380 .458
EI 52.143 .702 .539 .346 .392 36.518 .222 .637 .381 .463
EI 50.526 .517 .545 .355 .394 35.510 .167 .636 .392 .465
VHCR baseline 61.000 .562 .532 .345 .382 36.736 .267 .619 .371 .448
input only 50.966 .558 .531 .344 .382 37.342 .287 .608 .365 .431
EI 52.407 .590 .585 .374 .442 37.449 .254 .619 .366 .444
EI 53.085 .575 .544 .356 .390 37.109 .199 .629 .378 .457
EI 49.243 .475 .588 .369 .444 37.198 .231 .639 .394 .469
Table 5: Automatic metrics computed on ablations of the EI models, trained with distillation from only the emotion recognition model (EI), the infersent model (EI), or receiving emotion and infersent only as input, without knowledge distillation (input-only). Whether emotion or semantics provides the most benefit depends on the dataset and the model.

7.2 Hybrid metric coefficients

Figure 6: The learned coefficients () that the hybrid metric () is comprised of. Using a leave-bot-out method, we observe that the s are stable. The error bars show 90% confidence intervals. Figure 7: Correlation matrix showing the relationships between different aspects of interactive human evaluation. We observe a strong correlation across these aspects.

We optimized the coefficients of sub-components of the hybrid metric using a leave-bot-out scenario. As shown in Figure 6, we observe that s are stable across these training iterations. However, because we have optimized a linear regression equation and some of the features have overlapping information, such as different aggregation methods for calculating word coherence, we do not suggest using s for direct interpretation; further investigation is required.

7.3 Human interactive ratings correlation table

Figure 7 provides detailed information about different aspects of interactive human ratings. We observe that quality is highly correlated with other aspects of the conversation. Specifically, it is most strongly correlated with contingency, which further highlights the importance of semantic metrics of bot-generated responses in a good quality conversation. It also has high correlation with empathy that could better be captured by sentiment metrics.

7.4 Self-play correlation table

Figure 8 provides detailed information about the introduced metrics applied to self-play. We observe that several sentiment, semantic, and engagement metrics also transfer to self-play trajectories and the introduced hybrid metric, , is highly correlated with human quality ratings aggregated on a bot-level. However, exploiting sentiment or semantic similarity in a self-play scenario should be avoided as it hurts ratings of the model, especially diversity of responses.

Figure 8: Correlation matrix showing the relationships between different automated metrics on self-play trajectories and interactive human ratings aggregated on the bot-level. We observe that inducing positive sentiment as measured by Sentiment and Laughter, and being able to generate longer sentences in self-play are associated with higher quality model ratings. It is worth mentioning that maintaining extreme similarity in sentiment or semantics or just asking questions in self-play conversation trajectories could backfire by reducing the diversity of generated responses, though applicable to interactive human data. Most importantly, our novel hybrid metric applied to self-play ( -B/B) is highly correlated with all human ratings of the dialog model. Postfixes: -I: Interactive human evaluation, -B: Calculated on bot response, -B/B: Metric applied to self-play on two consecutive bot generated utterances when the bot converses with itself.

7.5 Reddit casual conversation corpus details

Using the 1.7 Billion post comments dataset hosted on Google BigQuery, we extracted post ids for all posts on from July 2018 to December 2018. For each post, we built a conversation tree of comments and subsequent replies to extract three-turn dialog. We removed links, excluded [removed] and [deleted] tag comments, and only used text before “edit” comments to preserve the original content in the conversation. We make this dataset available for public use at

7.6 Embedding-based metrics

Embedding Average Taking the mean word embedding of the generated sentence and the target sentence , the embedding average metric is the cosine distance between the two.


Vector Extrema The extrema vector for a sentence can be calculated by taking the most extreme value for each dimension () among the word vectors in the sentence. The extrema embedding metric is again the cosine distance between the extrema sentence vectors.


Greedy Matching The greedy matching distance is computed by matching word vectors in a source sentence () with the closest words vectors in the target sentence().


7.7 One-turn evaluation setup details

Figure 9: One-Turn evaluation interface crowdworkers see.
Cornell Reddit
Model Metric Wins % Losses % Ties % Wins % Losses % Ties %
HRED quality 40.8 4.9 24.5 4.9 34.8 9.2 31.3 5.2 29.5 6.6 39.3 10.7
fluency 10.3 4.4 17.3 4.1 72.5 8.1 22.8 5.3 20.0 7.1 57.3 11.2
relatedness 36.3 6.5 28.7 4.8 35.0 7.9 34.3 2.8 30.3 7.8 35.5 9.7
empathy 37.8 7.2 24.5 5.6 37.8 8.9 32.5 3.4 31.2 5.9 36.3 8.0
VHRED quality 36.9 4.7 36.6 5.6 26.6 6.9 39.0 7.0 34.0 5.3 27.0 8.9
fluency 23.4 9.6 27.7 8.3 48.9 16.3 29.0 13.6 23.3 9.3 47.7 21.6
relatedness 37.4 5.4 33.1 7.2 29.7 9.6 38.3 5.6 33.0 5.1 28.7 9.0
empathy 36.6 9.4 34.0 8.4 29.4 15.8 34.7 8.7 33.7 6.7 31.7 10.9
VHCR quality 33.0 6.1 29.0 5.4 38.0 10.1 33.7 7.9 27.3 3.3 39.0 8.6
fluency 13.5 4.1 25.5 4.3 66.0 7.7 24.7 7.2 18.3 5.2 57.0 10.2
relatedness 40.8 4.8 26.8 6.8 32.5 10.5 28.3 6.6 31.3 3.6 40.3 8.4
empathy 32.8 6.6 28.0 7.8 39.3 13.7 30.3 3.9 24.0 4.6 45.7 7.6
Table 6: Results from human single-turn evaluation for EI vs. Baseline models for HRED, VHRED, and VHCR models across quality, fluency, relatedness and empathy pairwise comparisons with 90% confidence intervals

We replicated the one-turn evaluation found in previous work Serban et al. [2017b], Park et al. [2018]. We sampled conversation contexts from the test set of each corpus and generated samples by each model based on these contexts. After filtering by context length (> 10 tokens) and removing contexts which contain <unknown> tokens, we sampled 100 examples. We divided each set of 100 examples into two batches of 50 for annotators to rate. Annotators recruited through Amazon Mechanical Turk were first trained with an example question. Annotators must be in the United States and had to correctly answer all training questions to complete the task. Figure 9 shows the interface displayed to crowdworkers in the one-turn evaluation task. We asked annotators to select which sentence was better for quality, fluency, relatedness, and empathy. Table 6 summarizes the results for all 4 metrics and is an uncondensed version of table 4. One notable exception to the pattern of EI models winning is fluency; baseline models trained on the Cornell corpus generated more fluency wins.

Noting the high disagreement between annotators in this task, we further examined the ambiguous examples in the human evaluation test set. We define an ambiguous example as a question where an equal number of annotators select the first sentence as better as the second sentence. If the two examples were similar, annotators would select the "tied" option. An equal number of selections for each answer as the winner indicates a disagreement in perception. Table 7 summarizes the number of ambiguous examples per model and metric out of 100 in total for each box. After removing these ambiguous example from calculating wins, losses and ties, the results are similar to table 2. The number of ambiguous samples further highlights the noisy and unreliable nature of single-turn evaluation.

Cornell Reddit
Quality 12 13 15 26 15 9
Fluency 4 10 10 12 20 6
Relatedness 11 12 10 15 13 7
Empathy 16 9 12 14 17 7
Table 7: Count of ambiguous examples in human one-turn evaluation.

7.8 Interactive evaluation details

Figure 10: Interactive evaluation chat interface

For our interactive evaluation, we built a platform to mimic a natural chat setting. Figure 10 is an example conversation within the platform that interactive evaluation participants see. Annotators can optionally click the up and down arrows beside each chatbot response to give feedback on the specific utterance. Once 3 or more turns of the conversation has taken place, participants may click "Close Chat and Rate". This will take them to the rating page where the conversations to be rated is presented along side the 7 point Likert scale questions used to asses the conversation (Figure 2).

Cornell Reddit
Baseline 55 46 53 55 36 39
EI 49 39 42 56 44 52
Table 8: Summary table of ratings collected per model.

Participants both from Amazon Mechanical Turk and from the authors’ institution were recruited for interactive evaluation. Although the minimum required number of turns is 3, the average number of responses per conversation of participants varied between 3.00-10.58 turns with the average at 5.43 turns. Table 8 summarizes the number of ratings collected for each model.

The average rating each annotator gave differed significantly between annotators. As a result, we also computed scores for interactive evaluation after normalizing each annotator’s scores. We restricted ratings down to only annotators who completed 10 or more ratings which left 301 ratings. Similar to table 2, the mean ratings for EI (Emotion+Infersent) models were higher than the mean ratings for the baseline models.

7.9 Website server setup and configuration

The server was hosted on a Google Cloud Platform virtual instance with 64GB of RAM and a NVIDIA Tesla P100 graphics card. The backend was a Django program being served by NGINX and uWSGI. For simplicity, we opted to have the Django process import the chatbots into the same Python process as Django, rather than have the two connect to each other via other means such as sockets. This configuration decreased development time and increased reliability, but it would need to be revisited if the server needed to scale several orders of magnitude past what was required for this study. The current configuration was still able to support hundreds of simultaneous users and host more than 30 bots concurrently.

The chatbots were kept in a separate project from the Django project and maintained separately from the server code. Each chatbot extended an abstract class that defined key methods for the Django program to use, and was registered to a globally accessible dictionary via a decorator. The Django project was provided the path to the Chatbots project in its PYTHONPATH, so it could import the dictionary in which all the chatbot objects had been registered and use that to dynamically determine which chatbots were available and to access them in its views.

It is important to note that the chatbots used PyCUDA, and PyCUDA does not work in a multiprocessing environment. Because of this, uWSGI needed to be configured to only have one python process and to disable any attempt at multiprocessing. Furthermore, the chatbots required substantial startup times, so all chatbots are kept in memory at all times in the Django process. In order to keep all the chatbots in memory concurrently, we needed a very high amount of RAM on our server and opted for a 64GB virtual instance, and a GPU with 16GB RAM. This combination of CUDA to run the chatbots on the GPU with a high amount of RAM to keep all bots in memory at the same time resulted in incredibly fast server response times, with effectively no increase in response time when using the bots in requests compared to requests that did not.

For further information and instructions on server configuration, please read the server documentation available at

7.10 Emotion embedding details

Figure 11: (a) 64-most frequent emojis as predicted by Felbo et al. [2017] used for calculating emotion embeddings. (b) Assigned weights used for reducing the 64-dimensional emotion embedding into a Sentiment score.

We calculate emotion embeddings of an utterance using a using a state-of-the-art sentiment-detection model Felbo et al. [2017]. This pre-trained model outputs a probability distribution over 64 most-frequently used emojis as presented in Felbo et al. [2017]). We define a set of weights over the emojis and calculate the weighted sum over an emotion embedding vector to derive a Sentiment score which is higher for positive sentiment and lower for negative sentiment (See Figure 11).

7.11 Hyper-parameter tuning details

For the baseline models that were trained on the Cornell dataset, we used the parameters reported in Serban et al. [2016, 2017b], Park et al. [2018]

that achieved state-of-the-art results for HRED, VHRED, and VHCR models trained on the same dataset, respectively. For EI models, we compared a combination of values for encoder hidden size (400, 600, 800, 1250), decoder hidden size (400, 600, 800, 1250), context size (1000, 1250), embedding size (300, 400, 500), word drop (0, .25), sentence drop (0, .25), beam size (1, 5). Learning rate (.0001), dropout (.2) were fixed. Batch size 80 was used. If due to memory limitation the job was not successfully completed, batch size 64 was used. Additionally, we tuned the EI parameters, i.e., emotion weight (25, 150), infersent weight (25K, 30K, 50K, 100K), emotion sizes (64, 128, 256), infersent sizes (128, 1000, 2000, 4000). Due to limited computational resources, we were not able to run a grid search on the aforementioned values. Instead we used combinations of the parameters that heuristically were more viable.

For the models that were trained on the Reddit dataset, a set of properly tuned baseline parameters were non-existent. Thus, to ensure fair comparison, we used a similar approach for baseline and EI hyper-parameter tuning: We explored a combination of values for encoder hidden size (400, 600, 800, 1250), decoder hidden size (400, 600, 800, 1250), context size (1000, 1250), embedding size (300, 400, 500, 600), word drop (0, .25), sentence drop (0, .1, .25), and beam size (1, 5). Learning rate (.0001), dropout (.2) were fixed. Batch size 64 was used. If due to memory limitation the job was not successfully completed, batch size 32 was used. Due to limited computational resources, we were not able to run a grid search on all the aforementioned values. Instead we used combinations of the parameters that heuristically were more viable. To ensure fair comparison, any selected combination was tested for both baseline and EI models. Then, for EI models, we tuned the parameters that were solely relevant to the EI design, such as the weight of emotion and infersent term in the loss function and the size of the added discriminator networks: Emotion weight (25), infersent weight (25K, 50K, 100K), emotion sizes (64, 128, 256), infersent sizes (100, 128, 1000, 2000, 4000). See Table 9 for a summary of the final selected parameters.




Batch size


Decoder hidden size

Encoder hidden size

Context size

Embedding size

Word drop

Sentence drop

Beam size

Emotion weight

Emotion discriminator layer size

Infersent weight

Infersent discriminator layer size

Cornell Baseline HRED 80 .2 400 400 1000 300 .0 .0 5 - - - -
VHRED 80 .0 1000 1000 1000 400 .25 .0 5 - - - -
VHCR 80 .2 1000 1000 1000 500 .25 .25 5 - - - -
EI HRED 64 .2 1000 1000 1000 500 .0 .0 1 25 128 100K 4000
VHRED 80 .2 1250 1250 1000 600 .0 .0 1 25 128 30K 128
VHCR 32 .2 1000 1000 1250 600 .0 .0 1 25 128 25K 4000
Reddit Baseline HRED 64 .2 1000 1000 1000 500 .0 .0 1 - - - -
VHRED 32 .2 1250 1250 1000 600 .0 .0 1 - - - -
VHCR 32 .2 1000 1000 1250 600 .0 .25 1 - - - -
EI HRED 64 .2 1000 1000 1000 500 .0 .0 1 25 128 25K 2000
VHRED 32 .2 1250 1250 1250 600 .0 .0 1 25 128 100K 4000
VHCR 32 .2 1000 1000 1250 600 .0 .0 1 25 128 100K 4000
Table 9: Hyper-parameters used for different models.