Open-domain dialog system evaluation is one of the most difficult challenges in the dialog community. Open-domain chatbots have a user-centric goal: to provide human with enjoyable user experience. However, user experience is difficult to quantify due to bias and variance among different users. Previous research has optimized on automatic dialog evaluation metrics such as BLUE bleu, which measures the difference between the generated responses and the reference responses. Due to the contrast between the one-to-many nature of open-domain conversations and the limited number of available references, such metrics correlate poorly with human judgments howNotEval; ademACL; whyWeNeedNewEval. Designing a fully automatic dialog evaluation metric is still an open research problem.
Currently, both academia and industry alexa; li2019end; MOSS rely on human ratings to evaluate open-domain dialogs. Following the ubiquitous application of Likert scores in survey research like online reviews marketing1 and consumer satisfaction marketing1992, a common practice of human evaluation on dialogs is to ask either a third-person rater or the chatbot user to report a Likert score. However, concerns have been raised about the validity of Likert score-based ratings. Kulikov et al. DBLP:journals/corr/abs-1811-00907 observe high bias and variance of Likert scores. Such issue is more severe in real-world commercial dialog systems like Alexa social chatbot alexa; alexaEval, because the real-world users have neither monetary incentive nor necessary annotation training to calibrate their ratings.
To explore the validity of Likert score based dialog evaluation, we first perform a large-scale data analysis of 3,608 collected real-world human-machine dialogs along with their self-reported Likert scale ratings from Amazon Alexa Prize Challenge alexa; gunrock; chen2018gunrock
. One noticeable property of the ratings is its J-shape skew distribution: nearly half of the dialogs are rated with the highest Likert score. The prevalence of such extreme distribution of ratings has long been observed by the business research community in variable aspects of real-lifemarketing25platforms; marketing1; marketing2; marketing3.
Although we could tell which dialog system is better by running statistical test on a large number of noisy ratings, it is difficult to locate dialogs with bad performance reliably to improve dialog system quality. In this paper, we take on the challenge of calibrating a large number of noisy self-reported user ratings to build better dialog evaluation models. We formulate the task as to first denoise the self-reported user ratings and then train a model on the cleaned ratings. We design CMADE (Comparison Model for Automatic Dialog Evaluation), a progressive three-stage denoising pipeline. We first perform a self-supervised learning to obtain good dialog representations. We then fine-tune CMADE on smoothed self-reported user ratings to improve the dialog representation while preventing the network from overfitting on noisy ratings. Finally, we apply data Shapley to remove noisy training data, and fine-tune the model on the cleaned training set. Our experiments show that CMADE is able to successfully identify noisy training data and achieves 89.2% in accuracy and 0.787 in Kappa on a test set with unseen expert-rated dialog pairs.
2 Related Work
Open-domain dialog system evaluation is a long-lasting challenge. It has been shown that previous automatic dialog evaluation metrics correlate poorly with human judgments howNotEval; ademACL; whyWeNeedNewEval. A well-known reason is that these automatic dialog evaluation metrics rely on modeling the distance between the generated response and a limited number of references available. The fundamental gap between the open-ended nature of the conversations and the limited references multiRefTianchengZhao is not addressed in methods that are lexical-level based bleu; rouge; meteor, embedding based greedyMatching; forgues2014bootstrapping, or learning based ruber; ademACL.
Given the aforementioned limitations, Likert-score based rating is the de-facto standard for current dialog research and social conversational systems such as in Amazon Alexa Prize Challenge gunrock; chen2018gunrock. Various forms of evaluation settings have been explored to better measure human judgments. Single-turn pairwise comparison singleTurnAB1; singleTurnAB2 is primarily used for comparing two dialog systems. Each system predicts a single utterance given the static “gold” context utterance from human-human logs. Although such A/B test setting is robust to annotator score bias, it cannot capture the multi-turn nature of dialogs. A more complete multi-turn evaluation is typically measured with a Likert scale for the full dialog history, where either a third-person rater or the chatbot user WhatGoodConversation reports a Likert score on user experience alexaEval, engagement engagement or appropriateness ademACL. However, as observed in DBLP:journals/corr/abs-1811-00907; alexa; alexaEval Likert scores suffer from bias and variance among different users. Different from previous empirical observations, we conduct a large-scale quantitative and qualitative data analysis of Likert score based ratings. To address the issue of Likert scores, the Alexa team proposed a rule-based ensemble of turn-granularity expert ratings AlexaEvalDilek, and automatic metrics like topical diversity AlexaTopicEval and conversational breadth. ACUTE-EVAL ACUTEEVAL makes a small-scale attempt to use multi-turn pair-wise comparison to rank different chatbots. Given the ubiquity and simplicity of Likert scores based evaluation, instead of proposing an alternative measure, we take on the challenge of denoising Likert scores with minimal expert annotations introduced (one order of magnitude smaller). Different from ACUTEEVAL, our proposed expert annotation scheme is for comparing the dialogs within the same chatbot.
3 Dialog Rating Study
The data used in this study was collected during the 2018 Amazon Alexa Prize Competition alexa2018. Our data contain long and engaging spoken conversations between thousands of real-world Amazon Alexa customers and Gunrock, the 2018 Alexa Prize winning social bot gunrock. The chatbot has 11 topic dialog modules including movies, books, and animals. One notable characteristic of the chatbot is its versatile and complex dialog flows which interleaves facts, opinions and questions to make the conversation flexible and interesting chen2018gunrock. At the end of each dialog, a self-reported Likert scale rating is elicited by the question “on a scale of one to five, how likely would you talk to this social bot again?”
We first filter out dialogs that have inappropriate content using keyword matching. We then select 3,608 ten-turn dialogs on movies, because movie dialogs are more coherent and diverse compared to other topics according to both real users and Amazon selected experts. We observe that dialogs with more than eight turns are more meaningful and semantically versatile, while dialogs more than 10 turns exceed the max length limit of the BERT model (512 tokens). So we select dialogs that have ten turns. Our approach could support longer conversations by adopting a memory footprint efficient algorithm for self-attention to support sequences with thousands of tokens musicTransformer. We leave this to future work.
We aim to evaluate user experience for each dialog from the same chatbot of the same length. This is significantly more challenging than identifying which chatbot provides a better user experience on average since our problem setup requires us to capture more subtle difference in user experience.
3.2 Likert Score Based Evaluation
J-Shape Skewness We perform a detailed analysis of the self-reported Likert scale ratings. As shown in Table 1, abnormally, nearly half of the dialogs are rated as five, which is the highest score. A similar skewed distribution is also observed in previous years’ Alexa competition UW2017alexaWin. In fact, the business research community has long observed the prevalence of the extreme distribution of reviews in which the reviews are heavily skewed to the positive end of the rating scale (known as ”J-shape”) in online reviews (e.g., Amazon, Airbnb, Yelp) marketing1; marketing2; marketing3, word of mouth marketing4 and consumer satisfaction marketing1992; marketing1996.
Comparison to expert ratings We randomly selected dialogs rated score-5 and showed these to an expert, and our expert rated of them with score-4 or less. The Alexa team alexaEval has also reported that the inter-user agreement is quite low for their internal rating analysis. Such phenomena indicate that the self-reported Likert scale ratings are extremely noisy. Using such ratings cannot localize individual bad interactions. In addition, Likert score based evaluation also suffers from insensitivity issues. As observed by the Alexa team alexaEval in multiple internal user studies, even though users evaluated multiple dialogs with the same score, they had a clear rank order among the dialogs.
The skewness, noisiness and insensitivity of the self-reported Likert scale rating make it a sub-optimal dialog evaluation metric. In practice, we find that directly training a classifier (even for pre-trained BERT-based model) on the noisy self-reported Likert scale ratings suffers from underfitting. One of the Alexa Price Challenge team, AlanaAlexaTeam2017alana train a binary-classifier between successful dialogs (human rating 4 or 5) and unsuccessful dialogs (rating 1 or 2) with heavy hand-engineered features. They reach 69.40% accuracy on this binary classification problem, which is far from usable in real-world settings.
3.3 Pairwise Comparison Based Evaluation
Selecting the better dialog from two options is easier for a human evaluator than giving an absolute number like the Likert score, which requires the evaluator to maintain a consistent standard. People’s perception is inherently relative, and pair-wise comparison is local and does not require the user to have global consistency. There are many other examples where humans find it easier to perform pairwise comparisons rather than providing direct labels reviewer3; deepstore; memsec, including content search comparisonContentSearchcomparisonImageSearch; ijcv
, and age estimationageEstimationPairPosterior.
We randomly sample 400 dialog pairs for experts to annotate. We ask the question, “If you were the user, in which scenario would you be more likely to come back and talk to the system again? ” We guide the experts to focus on the user experience rather than calibrating the performance of any specific module of the dialog system. Two researchers with conversational training experience annotated the data. The leading expert has been working in an Alexa competition team for more than one year with an emphasis on the user ratings. For each dialog pair , they label ‘ is better than ’ or ‘ is better than ’ or ‘cannot tell’. They reached a high inter-annotator agreement score cohen1968weightedkappa with kappa . To make sure that the dev & test is accurate, we throw away all “cannot tell” dialog pairs. We then study the correlation between Likert score based evaluation and pairwise comparison based evaluation.
3.4 Correlation Between User Ratings and Expert Ratings
To further analyze the self-reported Likert scale ratings, we also compare the annotated labels of the 403 dialog pairs with the self-reported Likert scale ratings of these dialogs. For each pair of dialogs, we compare the pairwise comparison label and the delta between the self-reported Likert scale ratings of the two dialogs. Ideally, the dialog with a higher self-reported Likert scale rating should be the one that is annotated as having a better user experience in the pairwise comparison. We count the number and fraction of “disagreement” between the two types of ratings. Overall, roughly 1/3 of the dialog pairs disagree. As shown in Table 2, as the gap between the self-reported Likert scale ratings becomes larger, the disagreement between expert and self-reported ratings goes down. This suggests that if the difference between the two dialogs’ Likert score is huge, they are more likely to be consistent with the comparison ratings.
4 Problem Formulation
Suppose the training set consists of data points where is a dialog and is the noisy self-reported user ratings. We define a strict partial order relationship where means that dialog provides a better user experience than dialog . Note that does not always imply since self-reported user ratings are noisy (§ 3.3, § 3.4). The test set consists of dialog pairs along with their binary pair-wise comparison labels , where is annotated by experts and indicates whether dialog A provides a better user experience than dialog B, i.e., . The development set has a similar structure.
Following the structure of the expert annotated pairs, we formulate our model as a pair-wise dialog predictor with a similar architecture as RankNet ranknet. For a dialog pair , the model predicts an un-normalized score for each dialog: and where is a dialog encoder that maps each dialog to a feature space and
is a linear transformation that converts each dialog feature into a real number. We define a binary relationship where means that the model predicts that dialog provides a better user experience than dialog . We denote model’s prediction of as where . We model the predicted posterior as:
Our goal is to reduce the noise of the self-reported user ratings (§ 3). Directly training a classification model using the noisy ratings leads to severe underfitting. To this end, we propose a three-stage training pipeline to denoise self-reported ratings to train an automatic dialog comparison model. Figure 1 describes the overall pipeline:
In Stage 1, we learn dialog feature representation with a self-supervised dialog flow anomaly detection task.
In Stage 2, we perform label smoothing to adjust the noisy self-reported ratings in the training set and fine-tune the dialog comparison model on the smoothed ratings.
In Stage 3, we perform data Shapley jamesShapley; boxinShapley on the self-reported user ratings to identify and remove noisy data points. We then fine-tune the dialog comparison model on the cleaned training set.
5.1 Stage 1: Learning Representation via self-supervised dialog anomaly detection
|Sys: What movie did you see?|
|User: Spider man into the spider verse|
|Sys: Ah, I know about Spider man into the spider verse! I’m wondering. What would you rate this movie on a scale from 1 to 10?|
|Replaced Sys: Isn’t it crazy how famous actors can get? Are you interested in talking more about Scarlett Johansson?|
A fake dialog example created by dialog flow perturbation in Stage 1. We perturb the dialog flow by replacing a system utterance (here the second Sys utterance in the table) with a random system utterance from the corpus (here the replaces Sys utterance) to generate a fake dialog. With high probability, the fake dialog is less appropriate than the origin one.
Having a good dialog representation is the first step towards denoising the data. Our primary goal in this stage is to train a dialog encoder to learn good dialog feature representations for the following stages. Here could be any sequence encoder that could encode a dialog and we use BERT bert in this paper.
For each dialog in the training set, we perturb the dialog flow to generate a fake dialog and train the model to differentiate the fake dialog and the real one. Dialog flow is a user-centric measure of whether a conversation is “going smoothly” beyondTuring. To perturb the dialog flow for each dialog , we randomly replace a user utterance in with a random user utterance from the training corpus , yielding a perturbed dialog . With high probability, the system utterance immediately following the replaced user utterance becomes inappropriate. Therefore, we incorporate into the training pairs. Similarly, we also randomly replace a system utterance and yield another perturbed dialog. We generate two perturbed dialogs for each dialog in the training set and thus real-fake dialog pairs in total. An example is shown in Table 3. We note that appropriateness is one of the most widely applied metrics of human evaluation on dialogs ademACL. By learning to differentiate the perturbed dialog and the original one, we expect CMADE to learn a good dialog encoder which maps dialogs with similar dialog flow close to each other in the feature space.
5.2 Stage 2: Fine-tuning with smoothed self-reported user ratings
Stage 1 only performs unsupervised learning and does not incorporate any supervision from human ratings. To obtain better dialog feature representations for Stage 3, Stage 2 fine-tuneswith supervision from the noisy self-reported user ratings. We adopt a simple yet effective label smoothing, inspired by labelSmooth; allen, using the representation learned in Stage 1. A key assumption in Stage 2 is that dialogs with similar dialog flow provide a similar user experience. For each dialog , we find its nearest neighbors in the feature space defined by . We use the average self-reported ratings of the nearest neighbors as a smoothed rating for . To construct training dialog pairs, we randomly sample dialog pairs and and derive a pair-wise comparison label by comparing the smoothed rating and : We discard the pairs with equal and . To improve the dialog feature representation, we fine-tune the model on sampled dialog pairs along with the derived labels from comparing the smoothed scores . We note that depends solely on the noisy self-reported ratings in the training set and does not depend on the expert annotations. Theoretically, we could iterate between label smoothing and model fine-tuning since the fine-tuned model provides better dialog feature representation. In practice, we find that one iteration is enough to reach good prediction performance.
Label smoothing has led to state-of-the-art models in image classification labelSmooth, language translation attentionIsAllYouNeed and speech recognition speechLabelSmoothApplication. Prior attempts in label smoothing labelSmooth; attentionIsAllYouNeed; speechLabelSmoothApplication; labelSmoothWhen focus on categorical labels to prevent the network from becoming over-confident while we apply label smoothing on ordinal labels (i.e., Likert scores) to prevent the network from overfitting on noisy ordinal labels.
5.3 Stage 3: Denoising with data Shapley & further fine-tuning
In Stage 2, noisy ratings still have effect in the smoothed ratings for other data points. In Stage 3, we aim to identify and remove dialogs with noisy self-reported user ratings with data Shapley value technique jamesShapley; boxinShapley; boxinShapley2
. Shapley value comes originally from cooperative game theorydubey1975uniqueness. In a cooperative game, there are players and a utility function assigns a reward to each of subsets of players: is the reward if the players in subset cooperate. Shapley value defines a unique scheme to distribute the total gains generated by the coalition of all players with a set of appealing mathematical properties. Shapley value has been applied to problems in various domains, ranging from economics shapleyEconomics2005featureSelectionShapley; yona2019s.
In our setting, given , we view them as players. We could also view the utility function as the performance on the development set. The Shapley value for player is defined as the average marginal contribution of to all possible subsets that are formed by other users boxinShapley:
As suggested by the definition of data Shapley, computing data Shapley value requires an exponentially large number of computations to enumerate possible subsets and train the model on each subset, which is intractable. Inspired by boxinShapley, CMADE tackles this issue by reducing the deep model to a k-nearest neighbors (KNN) model and then apply the closed-form solution of shapley value on KNN. Using the feature extractor trained in Stage 1 and Stage 2, we fix and map all dialogs in the training data to . We first define the utility function in a special case where the development set only contains one dialog pair . In our setting, the development set contains dialog pairs annotated by experts. Given any nonempty subset , we use the KNN Regressor to rate and . To do this, we compute and sort based on their euclidean distance in the dialog feature space to , yielding with as the top-K most similar dialogs to . Similarly, we get with as the top-K most similar dialogs to . Based on the self-reported user ratings in the training data, we use the KNN Regressor to rate and as follows:
The model predicts if and vice versa.
To obtain a closed-form solution to calculate Shapley value, instead of defining the utility function as the accuracy of the pair-wise prediction, we define the utility function as follows:
Theorem 1 Consider the utility function in Equation (3). Then the Shapley value of each training point can be decomposed into two terms and which depend on and respectively. and can be calculated recursively as follows:
With Theorem 1, the Shapley value calculation could be finished in time. The above result for a single point in the development set could be readily extended to the multiple-testpoint case. In our experiment, with such optimization, the Shapley value calculation takes less than 5 seconds to finish. Theorem 1 comes primarily from boxinShapley; boxinShapley2 and we extends their results of vanilla KNN regressor boxinShapley to our pairwise testing setting.
By applying the Shapley technique to the data, we identify noisy training data points which contribute negatively to the performance and remove them from the training set. Similar to Stage 2, to construct training dialog pairs, we randomly sample dialog pairs and from the cleaned training set and derive by comparing the self-reported rating and . We then further fine tune the model from Stage 2. Theoretically, we could iterate between Stage 2 and Stage 3 multiple times while in practice one iteration is enough.
5.4 Towards Scalable Pair-based Training
We use a similar factorization technique for pair-wise ranking in LambdaRank lambdarank to speed up training. For Stage 2 and 3, we have possible dialog pairs, which leads to quadratically increasing training time. Similar to LambdaRank lambdarank, it is possible to calculate the exact gradient of possible dialog pairs with forwards and back-propagations. More specifically, we denote the possible input pairs during training at Stage 2 or Stage 3 as: . The total cost for possible dialog pairs is the sum of cross-entropy costs:
Theorem 2 We can compute in by factor it into a weighted sum of where the weight only depends on and . W.l.o.g., we assume .
Here and are the outputs of the two branches of the model. Theorem 2 shows that instead of performing back-propagation for all possible pairs, we could first perform forward passes to obtain and then calculate . Calculating from in Equation 5.4
takes negligible time since this stage does not involve any neural network operation. Finally, we calculate a weighted sum ofback-propagation and update the model parameters.
We fine tune the pre-trained BERT bert to learn the dialog feature extractor . We partition the 403 expert annotated dialog pairs into a 200-pair development set and a 203-pair test set. We set for both the KNN label smoothing in Stage 2 and the KNN Shapley value calculation in Stage 3.
The details of extending BERT to encode multi-turn dialogs are as follows. Each dialog is represented as a sequence of tokens in the following input format: Starting with a special starting token , we concatenate tokenized user and system utterances in chronological order with as the separators for adjacent utterance. In other words, we represent each dialog as a sequence: , , , , , , , , , , , , where and are the token of the system and user utterance in the turn. Following BERT, we also add a learned embedding to every token indicating whether it comes from user utterances or system utterances.
|(6)||Stage 2 + 3||0.764||0.529||0.042|
|(9)||Stage 1 + 3||0.788||0.628||0.039|
|(10)||Stage 1 + 2||0.837||0.673||0.037|
Model Comparisons and Ablations
We compare CMADE to its several ablations (Table 4) and evaluate the performance on the testing set, which is annotated by experts. We also report the kappa agreement cohen1968weightedkappa (kappa
and Standard Error) between the predicted output and the expert annotations. (1) BERT-Classification and (2) BERT-Regression fine tune the pre-trained BERT to perform a 5-class classification and regression respectively directly using the noisy self-reported ratings. To test BERT-Classification on dialog pairs, we apply the DEX trick ageEstimationDEX to get a floating-point number of predicted rating and thus get rid of the cases when the model predicts the dialog pairs as tie. (3) BERT-Pairwise shares the same model architecture with CMADE. It constructs dialog pairs for training by randomly sample dialog pairs and and derive by comparing the corresponding self-reported user rating and . We discard the pairs with equal and . (4) BERT-Pairwise+Dev augments (3) by adding the 200 expert annotated dialog pairs in the development into the training data. We also compare the variants of CMADE which skips one or two of the three stages.
Our first takeaway is that vanilla classification or regression formulation might not be the best way to formulate the problem of learning a dialog evaluation model. As shown in Table 4
, pairwise architecture (BERT-Pairwise, 0.73) is better than classification (BERT-Classification, 0.53) or regression (BERT-Regression, 0.64) in this problem. Similar to our observation, the research community in computer vision has long observed that both vanilla classification and regression formulation has drawbacks in age estimationageEstimationDEX; ageEstimationOrdinalRegression; ageEstimationPairPosterior.
Our second takeaway is that denoising algorithm that is more aggressive usually makes stronger assumptions on the quality of feature representations. Therefore, it helps to create a denoising pipeline that starts with better feature representation learning and less aggressive denoising algorithm to learn better feature representation before applying the more aggressive denoising algorithms. As shown in Table 4, our three-stage denoising pipeline CMADE (Acc. 0.892) significantly outperforms all baselines by a large margin. Although (8) Stage 1 does not directly provide high accuracy (Acc. 0.620), the feature representation it learned is extremely important. Without Stage 1, both (5) Stage 2 (Acc. 0.755) and (6) Stage 2 + Stage 3 (Acc. 0.763) perform worse.
Since the KNN label smoothing is performed on the feature space, we expect the smoothing performs worse without self-supervised dialog feature representation learning in Stage 1. However, they still work better than baseline (1) (2) (3) which are models that do not account for the noise in data. This is because we use the pre-trained BERT to initialize our dialog encoder and thus is still able to provide some useful features for Stage 2. In addition, we observe that denoising with data Shapley in Stage 3 requires better dialog feature representation. (7) Stage 3 (Acc. 0.714) performs even worse than BERT-Pairwise (0.730) without good representations to perform the Shapley denoising algorithm. Skipping Stage 2 also hurts performance (Acc. 0.788). However, it does not mean that Shapley denoising in Stage 3 is not powerful. We observe a large performance gain in applying stage 3 after stage 1 and stage 2 (Acc. 0.837 v.s. 0.892). Finally, we note that adding the expert annotated development set directly into the training data is much less efficient compared to using the development set for data Shapley to denoise. BERT-Pairwise+Dev only reaches an accuracy of 0.749.
|User: the whole harry potter series|
|Sys: whoops i think i misheard you. can you repeat the movie name please?|
|User: all the harry potter series|
|Sys: ah, i don’t know any movies by that name. sorry if i sometimes confuse you saying something else with a movie name …|
|Sys: oh, okay. anyways, i really enjoyed watching black panther and wonder woman recently. what’s your favorite superhero movie?|
|User: super man|
|Sys: what was that? can you please say the movie name again?|
|User: stop i gotta go|
We also present an analysis to show how Shapley denoising works as shown in Figure 2. We examine the Shapley value for each training datum in Stage 3. We first show an example dialog with a negative Shapley value in Table 5. According to the Shapley value, we remove data points one by one starting from the least valuable to the most valuable. Each time, after the point is removed, we create new KNN regressor models on the remaining dialogs and labels and evaluate them on the test set with expert annotations. We extract the features of the remaining dialogs using the dialog encoder tuned by Stage 1 and Stage 2. As shown in Figure 2, removing training data with low Shapley values increases the performance to a certain point before convergence for of all choices. We observe a similar trend when re-training a model on the remaining data. In contrast, removing data randomly decreases the performance on the test set. In addition, larger has a better performance, which validates the denoising effect of KNN with large .
The ultimate chatbot evaluation metric should be user-centric, as chatbots are there to provide human with an enjoyable experiences. Previously Likert-score based self-reported rating is the de-facto standard for current dialog evaluation . However, our analysis indicates that self-reported dialog ratings are skewed (J-shape), noisy and insensitive due to bias and variance among different users. We propose a three-stage denoising pipeline CMADE to reduce self-reported ratings and, at the same time, build an automatic comparison-based automatic dialog quality predictor. CMADE’s results highly correlate with expert judgments on pair-wise dialog comparison ratings (89.2% agreement, 0.787 Kappa).
We would like to sincerely thank ACL 2020 Chairs and Reviewers for their review efforts and helpful feedback. We thank Yu Li for his insightful guidance and support in shaping this project. We thank Boxin Wang for helpful discussions on data Shapley. We would also like to extend our gratitude to Yanbang Wang, Youzhi Tian, Weiyan Shi and Michihiro Yasunaga for their valuable feedback and suggestions.