Automatic evaluation is a crucial component for the development of open-domain dialogue systems (Danescu-Niculescu-Mizil and Lee, 2011; Yao et al., 2017). The goal of dialogue evaluation is to produce evaluation scores that correlate well to human judgments (scores) over multiple dialogue qualities, i.e., fluency, relevancy, and specificity (Zhang et al., 2018; Weston et al., 2018). In Dialogue System Technology Challenge 10 (DSTC10) 111https://dstc10.dstc.community/home, the track5 sub-task1 “Automatic Open-domain Dialogue Evaluation” (Chen et al., 2021) proposes such a challenge in which all participants need to seek effective automatic dialogue evaluation metrics on 14 datasets (37 evaluation qualities in total) during the development phase. A single overall score must be submitted for each dialogue in test datasets during the final evaluation phase.
Nowadays, word overlap-based metrics and embedding-based metrics are standard automatic evaluation metrics. Word overlap-based metrics, which measure the overlapping words between reference and candidate responses, have been used to evaluate the dialogue responses (Papineni et al., 2002; Banerjee and Lavie, 2005; Sordoni et al., 2015). Embedding-based metrics measure the evaluation quality of a response by calculating the semantic similarity between the model response and corresponding reference, such as Greedy Matching (Rus and Lintean, 2012), Embedding Averaging (Wieting et al., 2016), and BERTScore (Zhang et al., 2019).
However, the aforementioned metrics heavily rely on the given references, and it has been shown that these metrics are ineffective due to the one-to-many nature of dialogue (Zhao et al., 2017). Recently, learning-based metrics, which aim to predict the scores of various qualities of response, have a better correlation with human judgment (Tao et al., 2018; Ghazarian et al., 2019; Lan et al., 2020). For example, USL-H (Phy et al., 2020) designs BERT-based (Devlin et al., 2019)classifiers for three groups of evaluation qualities: understandability (Nübel, 1997), sensibleness (Adiwardana et al., 2020), and likability. USL-H also applies a simple weighted sum to integrate the scores of each evaluation quality. Therefore, USL-H achieves good correlations with human judgment.
Nevertheless, these metrics have some important issues in dealing with dialogue evaluation tasks. First, most evaluation models, which only focus on a few evaluation aspects, are difficult to fully measure the quality of the open-domain dialogue. For example, USL-H ignores some important qualities like topic transition dynamics (Huang et al., 2020) and user engagement in dialogue (Ghazarian et al., 2020). Second, these metrics lack an effective score composition approach to integrate scores generated for each evaluation quality.
To address the above issues, we propose a Multi-Metric Evaluation based on Correlation Re-Scaling (MME-CRS) for evaluating open-domain dialogue as follows. Firstly, to evaluate the dialogue quality more comprehensively, we design 5 groups of sub-metrics for sub-task1 instead of three groups of metrics designed by USL-H. Second, we propose a novel score composition method called Correlation Re-Scaling (CRS) to composite metric scores. Our proposed approach ranks first and achieves an average Spearman correlation score of 31.04% on the test dataset, which is 1.11% higher than the second.
In particular, we summarize our contributions in this paper as follows:
We design an evaluation metric composed of 5 groups of sub-metrics to better evaluate the comprehensive quality of open-domain dialogue.
We propose a novel score composition method CRS to integrate sub-metric scores more effectively. The weight distribution generated by CRS generalizes well on unseen test data.
Our proposed metric MME-CRS ranks first on the “Automatic Open-domain Dialogue Evaluation” of DSTC10 track5 task1 with a large margin, which proves the superiority of our designed metrics and CRS method.
Figure 1 shows the architecture of our proposed metric MME-CRS. In this section, we will first introduce 5 groups of sub-metrics in detail. Then score composition approach CRS is discussed to integrate sub-metric scores for diverse qualities.
Automatic Evaluation Metrics
The evaluation quality contains various aspects, such as fluency, relevancy, specificity, and user engagement. For example, sub-task1 of DSTC10 track5 contains 14 development datasets, of which 37 different qualities are included in the total. What’s more, the evaluation of each aspect usually relies on several metrics, and the weight distribution over sub-metric varies from aspect to aspect. To better measure each evaluation aspect of dialogue, we design 5 groups of fundamental sub-metrics as follows.
Fluency Metric (FM) quantifies whether or not a response is fluency or understandable. A fluency utterance does not have to be grammatically correct because an open-domain response is usually the central part of a complete sentence. The auxiliary verb or stop words may be missing.
We use this characteristic to build a training set of fluent and non-fluent responses. First, we randomly determine if a response is fluent. If it is, we assign response
with label one and randomly apply one of the following rules: (i) no modification, (ii) delete each stopword with a probability of 0.5. Otherwise, we label responsewith zero and apply one of the following rules following Sinha et al. (2020) for negative sampling: (i) word reorder (shuffle the order of all words), (ii) word drop (randomly drop x% words), or (iii) words repeat (randomly select span(s) of words and randomly repeat them).
For a response with words, we fine-tune SimCSE (Gao et al., 2021) to embed each word in.
Relevance Metric (RM) measures coarse-grained relevance between context and response. We fine-tune another SimCSE model based on the next utterance prediction task to predict whether a context-response pair is relevant or not. Similar to the fluency metric, we first randomly determine a context-response pair from the Daily Dialog dataset (Li et al., 2017) is valid or not. For the valid case, we randomly apply one of the following changes to the response: (i) no modification, (ii) remove stop words.
Lan et al. (2020) observes that most random sampled negative responses are low-quality, and the decision boundary learned is far from the actual decision boundary, which hurts the performance. Hence, for the invalid case, we propose a simple but effective negative response sampling method. First, we randomly choose ten responses from the response pool and compute the Word2Vec similarity (Mikolov et al., 2013) between reference and candidate responses. Then we sort candidate responses based on their similarity score and choose the middle one as a negative response.
To fine-tune the SimCSE model, we first concatenate a context-response pair to a single sentence. Then we compute the score using the same approach as the fluency metric.
Topic Coherence Metric (TCM) measures fine-grained topic transition dynamics of dialogue flows. Huang et al. (2020) demonstrates the effectiveness of incorporation graph information into dialogue evaluation. Following Huang et al. (2020), topic-level dialogue graphs are firstly constructed based on ConceptNet (Speer et al., 2017)
. The topic transition dynamics over topic-level dialogue graphs are modeled applying a graph neural network. Then the topic-level graph representation is fed into an MLP layer to predict topic coherence score. Huang et al. (2020) also embeds the context-response pair and jointly predict coherence score together with topic-level graph representation. The former embedding is ignored in this part to focus on the topic coherence metric.
Engagement Metric (EM) measures whether the user is willing to participate in the dialogue. We build a training set based on the human engagement scores. User engagement score usually ranges from 0 to 5, and the user’s enthusiasm is proportional to the engagement score. Ghazarian et al. (2020) propose to label response with engagement score less than two as zero, while we find that scaling the engagement score to between 0 and 1 yields more significant benefits.
We train an utterance-level engagement classifier to predict whether the user engagement is high or low. Specifically, for a response with words, we fine-tune SimCSE to get the contextual embedding for each word . We use average-pooling here to get the embedding of the whole response. Then an MLP layer followed by a Softmax layer is added to predict the engagement score .
Ghazarian et al. (2020) aggregates the embedding of both context and response to predict the score of user engagement, while we observe that user engagement mainly relies on the model response. The relationship between dialogue context and response should be handled by relevancy metric or topic coherence metric.
Specificity Metric (SM) measures the model’s ability to handle diverse words in complex open-domain talking context. We introduce the specificity metric here because some deep models tend to generate general or ambiguous answers.
Mehri and Eskenazi (2020b) uses a Roberta model (Liu et al., 2019) to compute the mask language model (MLM) task, while we use a more light SimCSE model following other proposed sub-metrics. Similar to (Phy et al., 2020), we only use the response with words to compute specific score. In detail, we mask each response word and predict negative log-likelihood (SM-LL) based on SimCSE-MLM. We also investigate negative cross-entropy (SM-NCE) and perplexity (SM-PPL) to further improve the effectiveness of specific metrics.
Correlation Re-Scaling Method
Instead of designing a score composition function for the overall aspect alone, we propose to compute weight distribution along designed sub-metrics for each evaluation aspect. The evaluation of each evaluation aspect usually relies on several designed sub-metric. For example, suppose an annotator thinks a response generated by the dialogue model is specific. In that case, he probably implies that the response is also fluent and relevant to the dialogue context. However, the designed specific metric is only trained to predict the specific score for a response. Hence, to better evaluate each dialogue aspect, we propose to model the relationship between designed sub-metrics and diverse evaluation qualities.
We propose a novel Correlation Re-Scaling (CRS) method to compute the weight distribution for each aspect. For a dialogue evaluation dataset with qualities, we first randomly sample 300 dialogues for Spearman correlation computation.
For each dialogue in dataset , we compute fundamental sub-metric scores as , where is the number of sub-metrics. If Spearman correlation score is less than 0, then the corresponding sub-metric is believed to have no contribution to dialogue quality ; thus, is simply set to 0. We treat correlation score as the importance of the corresponding sub-metric to quality .
We believe that important sub-metrics should be given higher weight, it is significant for score composition over multiple scores. Hence we compute the normalized weight distribution as follows:
Where is the power number of , and the assigned weight to is . The larger is, the more weight is given to more important sub-metrics. According to our experiments, the effect of the score composition method works best when is between 1/3 and 1/2. It is a simple but effective way to determine the value of .
To further improve the generalization ability of CRS method, we calculate the average of 14 development datasets as follows:
Where is the number of development datasets that have quality, and is the normalized weight distribution over each sub-metric for diverse qualities.
For each test dataset , we first compute 7 kinds of sub-metric scores. Then the composition score for each evaluation quality can be computed as follows:
Dataset for Metric Training. The organizers of the task1 require that the development datasets are only allowed for validating the proposed metrics, not for training the evaluation systems.
Hence, we select the Daily Dialog dataset (Li et al., 2017) for training our metrics (except for the user engagement metric). The Daily Dialog dataset, which is about day-to-day communication on daily topics, consists of 11118/1000/1000 dialogues for train/valid/test sets, respectively. As for the user engagement metric, we use 13124 responses from ConvAI datasets222http://convai.io/2017/data/ and scale the engagement score proportionally to between 0 and 1.
|Top 1 (ours)||11.66||41.44||29.88||32.64||17.23||8.96||44.76||45.60||32.53||21.98||54.76||31.04|
|Our submission 1||2||29.96|
|Our submission 2||1||31.04|
|Our submission 3||4||29.77|
|Our submission 4||5||29.64|
|Our submission 5||6||29.61|
Development dataset. During the development phase, the evaluation metrics need to evaluate on 14 datasets (37 evaluation qualities in total). The organizers of “Automatic Open-domain Dialogue Evaluation” identify the following datasets to test the effectiveness of the proposed evaluation metrics 333The statistics information of development datasets is listed in Table 1.:
DSTC6 (Hori and Hori, 2017).
DSTC7 (Galley et al., 2019).
Persona-Chatlog (See et al., 2019).
PersonaChat-USR (Mehri and Eskenazi, 2020b).
TopicalChat-USR (Mehri and Eskenazi, 2020b).
FED-Turn (Mehri and Eskenazi, 2020a).
FED-Conversation (Mehri and Eskenazi, 2020a).
DailyDialog (Gupta et al., 2019).
DailyDialog (Zhao et al., 2020).
PersonaChat (Zhao et al., 2020).
DailyDialog (Huang et al., 2020).
Empathetic (Huang et al., 2020).
ConvAI2 (Huang et al., 2020).
HUMOD (Merdivan et al., 2020).
Test dataset. During the final test phase, 5 datasets (including 11 evaluation qualities in total) are introduced by task1 organizers to fully evaluate the proposed metrics. The dataset statistics are listed in Table 2.
The Spearman correlation is used in the “Automatic Open-domain Dialogue Evaluation Challenge” of DSTC10. The Spearman correlations between the submitted scores and corresponding human scores will be computed per evaluation category per dataset. The submissions from different participants will be ranked by the average correlation scores across all the datasets’ evaluation qualities.
We use a pre-trained SimCSE model (Gao et al., 2021) to fine-tune proposed metrics except for the topic coherence metric, and the model weights of different metrics are not shared. All the metrics based on the SimCSE model are trained with an Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-5. We train these metrics on the Daily Dialog dataset (Li et al., 2017) and choose models that have the lowest loss on the Daily Dialog evaluation data. We also test other pre-trained models, such as BERT (Devlin et al., 2019) or Roberta (Liu et al., 2019), but no performance improvement is observed.
Similar to Huang et al. (2020), the layer of graph attention networks (GATs) (Veličković et al., 2018) is 3, and the number of heads is set to 4, but we remove the contextualized encoding of context-response pair for model simplification. The training of the model is consistent with Huang et al. (2020) except that we modify the learning rate to 1e-5.
Comparison Setting. In this part, we will compare 1) Deep AM-FM (Zhang et al., 2021) (the baseline of task1) and top 5 teams; 2) our different submissions on the test datasets. Table 3 lists the Deep AM-FM and top 5 with the highest average Spearman correlation on the test datasets. In Table 4, we list the comparison results of our 5 submissions. The baseline method Deep AM-FM and our submissions are introduced as follows:
Deep AM-FM. A DNN-based automatic metric that measures the evaluation quality of dialogue generation along two dimensions: 1) Adequacy Metric: The semantic similarity between dialogue context and response; 2) Fluency Metric: The syntactic quality of the sentence construction.
Top1-5. Top1-5 refer to the top 5 teams with the highest average Spearman correlation on test datasets of task1. And the top1 is our team.
Our submission 1-3. The metric scores are integrated using our proposed CRS score composition method. To simplify the computation of the CRS method and improve the generalization ability of our metric, we simply set to 1, 2, 3, respectively in our submission 1-3.
Our submission 4-5. We compute a specific for each dialogue quality in each dataset. In submission 4, a SimCSE pre-trained model is used, while in submission 5, a BERT pre-trained model is used instead, avoiding that the SimCSE model does not work well on the test set.
Comparison Results. We compare our approach with Deep AM-FM and the top 5 teams in Table 3. The performance comparison of our different submissions is shown in Table 4. These results support the following statements:
Our MME-CRS achieves the highest average Spearman correlation (1.11% higher than the second) on five test datasets, which demonstrates the effectiveness of our proposed metric MME-CRS.
Our method ranks first in 6 out of 11 dialogue evaluation qualities, demonstrating that our proposed evaluation metrics have a higher correlation with human judgments than baseline and other teams.
Our submissions 1-3, which fix to a constant number, perform better than submissions 4-5, which indicate a constant generalizes better when migrating to the test datasets. Furthermore, setting to 2 achieves the best performance on test datasets.
Comparison Setting. In the final evaluation period, participants must submit a single score for each dialogue, and the organizers will compute the correlation between human scores of each quality with submitted scores. In Table 5, we remove fluency metric (FM), relevance metric (RM), topic coherence metric (TCM), engagement metric (EM), and specific metric (SM), respectively, to explore the importance of different metrics in the submitted scores. We take the SM composed of three metrics as a whole part to explore the influence of the SM. Considering that RM and TCM both rely on dialogue context, we also remove them together in our experiments.
Comparison Results. The comparison results of the ablation experiment are shown in Table 5. These results support the following statements:
TCM, RM, and EM contribute most to the performance. When we delete them from score composition, the final average Spearman correlation will drop 3.26%, 1.56%, and 1.01%, respectively.
Coarse-grained RM and fine-grained TCM are a beneficial complement to each other. If we ignore one of them, the performance will drop slightly. However, if both of them are ignored, the average correlation will drastically drop to 11.07%.
The improvement of SM can be ignored on the test datasets. We observe that many responses in test datasets tend to be very specific but are not relevant to the dialogue context. We infer that these models used to generate the response are overfitted on the test dataset.
The Effectiveness of CRS Method
Comparison Setting. Score composition is a significant component of open-domain dialogue because the full evaluation of dialogue usually depends on many aspects. In this part, we will compare the performance of MME-Avg, which simply averages the scores from different metrics, and our MME-CRS method.
Comparison Results. The comparison result is listed in Table 6, and the following statements can be drawn from the results.
The average Spearman correlation score is significantly superior to that of MME-Avg (3.49% higher), indicating that our proposed CRS method can effectively integrate different scores to comprehensively measure the quality of dialogue.
The correlation of MME-CRS is higher on all evaluation qualities, demonstrating that each evaluation quality can benefit from the score composition method CRS.
In this paper, we propose an open-domain dialogue evaluation approach composed of 5 groups of metrics to fully measure the quality of model response. Further, we propose a novel metric composition method called CRS. CRS models the relationship between metrics and evaluation qualities to comprehensively integrate sub-metric scores for dialogue evaluation. Experimental results on test datasets show that our proposed MME-CRS achieves the best performance, showing that our metric correlates better with human judgments. Compared with baseline and other teams, our approach obtains superior performance and ranks 1st in the final evaluation.
- Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. Cited by: Introduction.
- METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: Introduction.
- Automatic evaluation and moderation of open-domain dialogue systems. arXiv preprint arXiv:2111.02110. Cited by: Introduction.
- Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, pp. 76–87. Cited by: Introduction.
- Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186. Cited by: Introduction, Training Details.
- Grounded response generation task at dstc7. In AAAI Dialog System Technology Challenges Workshop, Cited by: 2nd item.
- SimCSE: simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821. Cited by: Automatic Evaluation Metrics, Training Details.
- Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp. 82–89. Cited by: Introduction.
Predictive engagement: an efficient metric for automatic evaluation of open-domain dialogue systems.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 7789–7796. Cited by: Introduction, Automatic Evaluation Metrics, Automatic Evaluation Metrics.
- Investigating evaluation of open-domain dialogue systems with human generated multiple references. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pp. 379–391. Cited by: 8th item.
- End-to-end conversation modeling track in dstc6. dialog 888 (107,506), pp. 2–000. Cited by: 1st item.
Grade: automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9230–9240. Cited by: Introduction, Automatic Evaluation Metrics, 11st item, 12nd item, 13rd item, Training Details.
- Adam: a method for stochastic optimization. In ICLR (Poster), Cited by: Training Details.
- Pone: a novel automatic evaluation metric for open-domain generative dialogue systems. ACM Transactions on Information Systems (TOIS) 39 (1), pp. 1–37. Cited by: Introduction, Automatic Evaluation Metrics.
- Dailydialog: a manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 986–995. Cited by: Automatic Evaluation Metrics, Datasets, Training Details.
- Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: Automatic Evaluation Metrics, Training Details.
- Unsupervised evaluation of interactive dialog with dialogpt. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 225–235. Cited by: 6th item, 7th item.
- Usr: an unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 681–707. Cited by: Automatic Evaluation Metrics, 4th item, 5th item.
Human annotated dialogues dataset for natural conversational agents. Applied Sciences 10 (3), pp. 762. Cited by: 14th item.
- . arXiv preprint arXiv:1301.3781. Cited by: Automatic Evaluation Metrics.
- End-to-end evaluation in verbmobil i. Proceedings of MT Summit VI, pp. 232–239. Cited by: Introduction.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: Introduction.
- Deconstruct to reconstruct a configurable evaluation metric for open-domain dialogue systems. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 4164–4178. Cited by: Introduction, Automatic Evaluation Metrics.
- An optimal assessment of natural language student input using word-to-word similarity metrics. In International Conference on Intelligent Tutoring Systems, pp. 675–676. Cited by: Introduction.
- What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1702–1723. Cited by: 3rd item.
- Learning an unreferenced metric for online dialogue evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2430–2441. Cited by: Automatic Evaluation Metrics.
- A neural network approach to context-sensitive generation of conversational responses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 196–205. Cited by: Introduction.
- Conceptnet 5.5: an open multilingual graph of general knowledge. In Thirty-first AAAI conference on artificial intelligence, Cited by: Automatic Evaluation Metrics.
- Ruber: an unsupervised method for automatic evaluation of open-domain dialog systems. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction.
- Graph attention networks. In International Conference on Learning Representations, Cited by: Training Details.
- Retrieve and refine: improved sequence generation models for dialogue. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, pp. 87–92. Cited by: Introduction.
- Towards universal paraphrastic sentence embeddings. In 4th International Conference on Learning Representations, Cited by: Introduction.
- Towards implicit content-introducing for generative short-text conversation systems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 2190–2199. Cited by: Introduction.
- Deep am-fm: toolkit for automatic dialogue evaluation. In Conversational Dialogue Systems for the Next Decade, pp. 53–69. Cited by: Overall Comparisons.
- Learning to control the specificity in neural response generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1108–1117. Cited by: Introduction.
BERTScore: evaluating text generation with bert. In International Conference on Learning Representations, Cited by: Introduction.
Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 654–664. Cited by: Introduction.
- Designing precise and robust dialogue response evaluators. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 26–33. Cited by: 10th item, 9th item.