Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols

06/10/2020 ∙ by Sarah E. Finch, et al. ∙ Emory University 0

As conversational AI-based dialogue management has increasingly become a trending topic, the need for a standardized and reliable evaluation procedure grows even more pressing. The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems, rendering it difficult to conduct fair comparative studies across different approaches and gain an insightful understanding of their values. To foster this research, a more robust evaluation protocol must be set in place. This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems, identifying their shortcomings while accumulating evidence towards the most effective evaluation dimensions. A total of 20 papers from the last two years are surveyed to analyze three types of evaluation protocols: automated, static, and interactive. Finally, the evaluation dimensions used in these papers are compared against our expert evaluation on the system-user dialogue data collected from the Alexa Prize 2020.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most successful automated dialogue systems follow task-oriented dialogue management methodology, which defines an explicit goal that the system is seeking to fulfill through the conversation with the user Gao et al. (2019). Recently, the research in chat-oriented dialogue management has experienced a substantial increase in popularity. Unlike task-oriented dialogues, where the success is generally measured as ability to complete the goal of the task, evaluation of chat-oriented dialogues is much less straightforward, since the conversational goals can be highly subjective Huang et al. (2019).

The evaluation of chat-oriented dialogue systems has been typically accomplished through the use of automated metrics and human evaluation (Section 2). Automated evaluation requires no human labor once the evaluation script is written (Section 3). For automated evaluation to be a reliable measurement of the dialogue system quality, however, it needs to be shown to be a close approximation of human judgements (Section 4). Unfortunately, commonly used automated metrics correlate weakly with human judgments, indicating poor utility of such metrics Liu et al. (2016). Human evaluation has become more commonplace in recent dialogue system works; however, it presents its own challenges. For one, it is time-consuming and expensive to obtain human judgments. More critically, there is a lack of standardized protocol for such human evaluation, which makes it challenging to compare different approaches to one another.

There have been many previous attempts at standardizing dialogue system evaluations. A major limitation has been their focus on task-oriented dialogue systems, which does not translate well to chat-oriented dialogue systems Walker et al. (1997); Malchanau et al. (2019). Previous works which have included chat-oriented evaluations have lacked comprehensive coverage over the many varieties of such evaluation procedures that are currently in use. Instead, the emphasis has rested primarily on automated metrics at the expense of detailed analysis of human evaluation Deriu et al. (2019)

. At this stage in conversational AI, it is probable that automated and human metrics reveal different aspects of dialogue systems

Hashimoto et al. (2019). It would be remiss to focus on a single evaluation category when assessing the state of the field. For this reason, our work aims to fill in the gaps of previous dialogue system evaluation surveys by identifying and comparing human evaluation protocols for chat-oriented dialogue systems.

To this end, we present a comparative analysis of the evaluations used for chat-oriented dialogue systems over the past several years. Since the field of conversational AI has experienced a rapid growth in these years, it presents a unique opportunity to observe and assess which evaluation metrics have been most widely adopted by the larger community in this period of expeditious development. We provide a detailed survey of both automated and human evaluations in order to present the most accurate depiction of the current evaluation protocols. However, our in-depth analysis is limited to that of the human evaluations due to the abundance of previous work in automated metric analysis. As such, we defer to such work as liu_how_2016, ghandeharioun_approximating_2019, and ghazarian_better_2019 for more detail on automated metrics.

As a part of our analysis, we also present a case study of real human-machine dialogues which explores the significance of different human evaluation metrics in terms of overall user satisfaction through an expert analysis. As a result of our work, the most commonly used evaluation metrics in contemporary literature - both automated and human - are revealed in detail and our findings towards the prevalence, impact, and applicability of human evaluation metrics are illustrated.

2 Evaluation Protocols

For a holistic understanding of current evaluation protocols on dialogue systems, we have carefully selected 20 relevant papers since 2018, primarily from top-tier venues, and synthesized their methods. These papers focus on open domain (or non-task-oriented) dialogue, and employ a variety of approaches including:111Throughout the paper, the following are used to refer to the related work: 1: li_syntactically_2018 2: liu_knowledge_2018 3: luo_auto-encoder_2018 4: moghe_towards_2018 5: parthasarathi_extending_2018 6: xu_better_2018 7: young_augmenting_2018 8: zhang_personalizing_2018 9: du_boosting_2019 10: li_incremental_2019 11: lin_moel_2019 12: madotto_personalizing_2019 13: qiu_are_2019 14: tian_learning_2019 15: wu_proactive_2019 16: zhang_recosa:_2019 17: zhou_unsupervised_2019 18: zhu_retrieval-enhanced_2019 19: adiwardana_towards_2020 20: wang_improving_2020.

  • Incorporation of knowledge bases
    [2, 4, 7, 18, 20]

  • Integration of personality [8, 12]

  • Handling of emotion-driven responses [10]

  • Purely depending on neural-based sequence-to-sequence models [19]

Based on these papers, three main categories are found as evaluation protocols for open-domain dialogue systems: automated, static, and interactive. Automated evaluation is performed systematically by a batch script such that no human effort is required once the script is written (Section 2.1). Static evaluation is done by human where the evaluator assesses a dialogue whose last utterance is generated by the dialogue system (Section 2.2). Interactive evaluation is also done by human, although the evaluator assesses the quality of the dialogue after directly interacting with the dialogue system (Section 2.3).

Table 1 shows the distributions of the three evaluation protocols. Most recent approaches adopt both automated and human evaluations, with only 2 papers not including any form of human evaluation. The most common protocol for human evaluation is static evaluation, with very few papers conducting interactive assessments of dialogue systems. No work has adopted all three types of evaluation protocols.

Method References #
AUT [1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12 17
13, 14, 15, 16, 17, 20]
STA [1, 3, 4, 7, 9, 10, 11, 12, 13, 14 16
15, 16, 17, 18, 19, 20]
INT [8, 19] 2
AUT & STA [1, 3, 4, 7, 9, 10, 11, 12, 13, 14 14
15, 16, 17, 20]
AUT & INT [ ] 0
STA & INT [19] 1
Table 1: Distributions of the three evaluation protocols. #: number of papers using the corresponding protocol, AUT/STA/INT: automated/static/interactive evaluation. &: approaches using both protocols.

2.1 Automated Evaluation

Automated evaluation provides an objective quantitative measurement of the dialogue systems by operationalizing various dimensions of dialogue into mathematical formulations. Depending on the specific objectives behind different systems, a few studies define novel automated metrics to capture the benefit of their proposed approaches. Automated evaluation provides the most straightforwardand undemanding methods by which to evaluate dialogue systems; however, they are generally viewed as poor indicators of true dialogue quality, following results from liu_how_2016.

2.2 Static Evaluation

Static evaluation is an offline procedure where the evaluators never directly interact with the dialogue systems under review; instead, they are provided with dialogue excerpts. These excerpts are generated by first randomly sampling dialogues from a corpus consisting of human-to-human conversations, then having the systems produce responses to the sampled dialogues. The sampled dialogues together with the system responses are provided to human evaluators to assess. Because only the last utterance in these excerpts are generated by the dialogue systems, it is difficult to evaluate sequential aspects about dialogue management through static evaluation (e.g., coherence among responses generated by the same system).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 #
BLEU 14
C 1
Coherence 1
Distinct 9
Embedding 5
Entity A/R 1
Entity Score 2
Entropy 1
Inertia 1
Perplexity 7
ROUGE 2
Table 2: Metrics of the automated evaluation used by recent papers on open-domain dialogue systems. The top row shows the reference numbers to the 20 surveyed papers. #: number of papers using the corresponding metrics.

2.3 Interactive Evaluation

Unlike static evaluation, interactive evaluation has the same person play the role of both the user (one who interacts with the system) and the evaluator. In this setup, the evaluator has a conversation with the dialogue system and makes the assessment at the end of the conversation. Even though this procedure is more demanding in terms of time and human effort than static evaluation, it allows the evaluator to gain a better sense of the capability of the dialogue system through explicit interaction.

3 Analysis of Automated Evaluation

Table 2 shows the 11 metrics used for automated evaluation in our survey:

  • BLEU: a subset of BLEU-1 through BLEU-4 Papineni et al. (2002)

  • C: sum of entailment scores between response and persona description Madotto et al. (2019)

  • Coherence: average word embedding similarity between dialogue context and generated response Xu et al. (2018)

  • Distinct: a subset of Distinct-1, Distinct-2, and Distinct-sentence Li et al. (2016)

  • Embedding: a subset of average, extrema, and greedy embedding similarity Liu et al. (2016)

  • Entity A/R: Accuracy and recall for including the correct entities in the response Liu et al. (2018)

  • Entity Score: average number of entities per response Young et al. (2018)

  • Entropy: average character-level entropy over all responses Mou et al. (2016)

  • Inertia: inertia on the clusters of embeddings of responses Du and Black (2019)

  • Perplexity: inverse likelihood of predicting the responses of the test set Chen et al. (1998)

  • ROUGE: a subset of ROUGE-1, ROUGE-2, and ROUGE-L Lin (2004)

1 2 3 4 7 8 10 11 12 13 14 15 17 18 19 20 #
Appropriateness 2
Coherence 2
Consistency 3
Context Coherence 1
Correctness 2
Diversity 1
Emotion 1
Empathy 1
Engagingness 1
Fluency 9
Grammaticality 1
Humanness 1
Informativeness 4
Knowledge Rel. 3
Logic 1
Proactivity 1
Quality 2
Readability 1
Relevance 3
Sensibleness 1
Specificity 2
Table 3: Dimensions of the human evaluation used by recent dialogue system papers. The top row shows the reference numbers to the 20 survey papers. [5, 6] do not perform any human evaluation; [9, 16] perform human evaluation without reference to dimensions. #: number of papers adopting the corresponding dimensions.

The automated metrics in Table 2 fall into the following five categories:

Ground Truth Response Similarity

Most commonly used automated metrics focus on assessing how well system responses match the ground truth human responses, using word overlap (BLEU, ROUGE) or embedding similarity.

Context Coherence

Embedding similarities between dialogue contexts and system responses have been used to quantitatively assess the relevance between the system responses and the preceding dialogue history (Coherence, Embedding).

Response Diversity

Other widespread metrics assess the diversity of the system responses in order to determine the amount of repetition and generic content in the system responses (Distinct, Entropy, Inertia, Entity Score).

Language Model Fitness

Generative models are usually evaluated in terms of how well they learn to model the language of the dialogues in their training corpus (Perplexity).

Application-Specific

The other observed metrics can be considered application-specific since Entity A/R is used to measure the ability of the system to produce the correct entities in its responses and C is specifically created as a measure of the consistency between the dialogue responses and their respective persona descriptions.

4 Analysis of Human Evaluation

While automated evaluation measures dimensions of dialogue objectively, human evaluation captures the subjective assessment from the user’s point of view. Regardless of the exact method chosen, all human evaluations involve gathering external annotators who answer questions regarding the dialogues resulting from a dialogue system.

4.1 Dimensions of Human Evaluation

There is high variability in the dimensions of dialogue that previous studies have used for assessing dialogue systems in both static and interactive evaluations. Table 3 provides a detailed overview of the dimensions used by each of the surveyed papers when evaluating their work. There are a total of 21 uniquely-worded dimensions found; 11 of them appear in only a single paper. The resulting matrix provides clear evidence of the inconsistencies in human evaluation methods, as its sparsity is indicative of low overlap among those methods. The long tail distribution of the evaluation metrics makes it difficult for cross-work comparisons without a substantial study to align the disparate evaluation of one work with another.

Although the evaluation dimensions appear to be distinct on the surface, several of them appear to be similar in meaning. To analyze the level of overlap among the seemingly distinct evaluation dimensions, we compile the definitions and instructions shared by each of the papers regarding their evaluation dimensions and rating scales. Based on manual analysis, we are able to group dimensions together that are indeed evaluating the same aspect of dialogue as one another, even though the authors mention them by different names. Table 4 provides the dimension groupings that are identified on the basis of their respective definitions.

Fluency Whether the response from the listener is understandable Lin et al. (2019)
Whether the response is fluent and natural Li et al. (2019)
Whether each sentence has correct grammar Luo et al. (2018)
Fluency measures if the produced response itself is fluent Wu et al. (2019):
Consistency Whether the reply is fluent and grammatical Li and Sun (2018)
Readability Whether the utterance is grammatically formed Qiu et al. (2019)
Grammaticality Whether the response is fluent and grammatical Zhu et al. (2019)
(a) Grammatical Capability.
Relevance Whether the responses of the listener seem appropriate to the conversation Lin et al. (2019)
Whether the response is appropriate/relevant in the current context language Moghe et al. (2018)
Whether the reply is relevant to the query Qiu et al. (2019)
Appropriateness Whether the response is appropriate in grammar, topic, and logic Young et al. (2018)
Coherence Whether the generated response is relevant to the input Luo et al. (2018)
Whether the whole dialogue is fluent (does not contain irrelevant or illogical responses) Wu et al. (2019)
Context Coherence Whether the response is coherent with the context and guides the following utterances Li et al. (2019)
Logic Whether the post and the reply are logically matched Li and Sun (2018)
Sensibleness Whether the response makes sense given the context Adiwardana et al. (2020)
(b) Turn Coherence.
Informativeness Whether the response provides new information and knowledge in addition to the post Young et al. (2018)
Whether the response has unique words and multi-topic clauses Tian et al. (2019)
Whether the response has meaningful information relevant to its message Zhu et al. (2019)
Whether the model makes full use of knowledge in the response Wu et al. (2019)
Specificity Whether the model produced movie-specific responses or generic responses Moghe et al. (2018)
Whether the response is specific to the context Adiwardana et al. (2020)
Diversity Whether the reply narrates with diverse words Qiu et al. (2019)
(c) Response Informativeness.
Table 4: Proposed reductions of dialogue evaluation dimensions into non-overlapping components

Definitions in Table 3(a) aim to address the grammaticality of system responses, including words like grammar, understandable, and accurate. As a result, the four dimensions recorded in this table can be viewed as lexical variations of the same underlying Grammaticality dimension. Similarly, definitions in Table 3(b) highlight keywords like appropriate, relevant, and on-topic, thus providing evidence that each of those dimensions are instances of the Relevance dimension. Finally, Table 3(c) has a high occurrence of information and diversity-focused definitions, and we can reduce the dimensions shown there to the single Informativeness dimension.

Other than these highly overlapping dimensions, Quality Tian et al. (2019); Zhou et al. (2019) and Humanness Moghe et al. (2018) can both be considered as the single Quality dimension, since they are used to elicit an overall quality assessment of the dialogue system responses. Similarly, Emotion Li and Sun (2018) and Empathy Lin et al. (2019) can be reduced into the Emotional Understanding dimension that captures both the comprehension and production of emotional responses. The remaining two dialogue dimensions assess a unique quality of dialogue and are useful as independent dialogue dimensions:

  • Engagingness: whether the response includes interesting content Zhang et al. (2018)

  • Proactivity: whether the response introduces new topics without breaking coherence Wu et al. (2019)

Finally, two evaluation dimensions are specifically used for a subset of dialogue systems that incorporate knowledge:

  • Correctness: was the response accurate based on the real-world knowledge Liu et al. (2018); Wang et al. (2020)

  • Knowledge Relevance: was the knowledge shared in the response appropriate to the context Liu et al. (2018); Wang et al. (2020)

Knowledge Relevance is very similar to the previously discussed Relevance dimension, although it is specifically targeting an assessment of the appropriateness of the knowledge being used. Even more niche, the Correctness dimension is unique to knowledge-focused systems that seek to present only true factual information to the user; thus, such a dimension may not be useful in other contexts. Due to their targeted nature, these two dimensions may fall outside of the scope of a general, comprehensive, unified evaluation of dialogue systems, and instead be used for a targeted subgroup.

Dimension Definition
Grammaticality Responses are free of grammatical and semantic errors
Relevance Responses are on-topic with the immediate dialogue history
Informativeness Responses produce unique and non-generic information that is specific to the dialogue context
Emotional Responses indicate an understanding of the user’s current emotional state and
Understanding provide an appropriate emotional reaction based on the current dialogue context
Engagingness Responses are engaging to user and fulfill the particular conversational goals implied by the user
Consistency Responses do not produce information that contradicts other information known about the system
Proactivity Responses actively and appropriately move the conversation along different topics
Quality The overall quality of and satisfaction with the dialogue
Table 5: The final set of our proposed dialogue dimensions for human evaluation.

In total, after merging similar dimensions and discarding non-generalizable dimensions, a total of eight dimensions have been identified that share little to no definitional overlap and are reasonably applicable to all dialogue systems. Table 5 shows the finalized set of dialogue evaluation dimensions.

4.2 Diversities in Evaluation Metrics

Aside from the discrepancies in dialogue dimensions used for evaluation among different works, the actual procedure of evaluating these dialogue dimensions varies even further, particularly for static evaluations. A majority of work instructs human annotators to rate the dialogue system responses on a set of dialogue dimensions using numeric scales, where the scales being used are often different even between works that employ the same dialogue dimensions. For instance, one of the most commonly used dimension is the Fluency of the dialogue, with 9 out of the 16 papers in Table 3 have adopted this as an evaluation dimension. Between those 9 studies, Fluency ratings include scales of:

  • 02: wu_proactive_2019,li_incremental_2019

  • 03: wang_improving_2020,liu_knowledge_2018

  • 15: moghe_towards_2018,zhang_personalizing_2018,lin_moel_2019,madotto_personalizing_2019

  • 110: luo_auto-encoder_2018

Furthermore, some studies use a preference metric for static evaluation in addition to - or even instead of - the numerical ratings Lin et al. (2019); Young et al. (2018); Du and Black (2019); Zhang et al. (2019)

. In this case, human annotators are asked to select the most compelling response among many generated by multiple dialogue systems or even humans. Thus, preference metrics provide estimated ranking scores among different systems by measuring the percentage of times each system is preferred over the others.

Unlike the diversity in static evaluation, for the two papers, zhang_personalizing_2018 and adiwardana_towards_2020, employing interactive evaluation, only numerical ratings on specific dialogue dimensions are used as evaluation methods; other methods such as preference metrics are not used in either case.

4.3 Static vs Interactive Evaluations

Establishing the necessary assessment metrics is only one consideration to achieve an accurate dialogue evaluation. The other major consideration is the procedure underlying the evaluation. This section discusses the two human evaluation protocols, static and interactive evaluations, that have previously been used by many dialogue systems.

Although both evaluation protocols overcome the deficiencies brought forth by automated evaluation through human judgment, interactive evaluation is hypothesized to be a more reliable assessment strategy than static one. What static evaluation offers above interactive evaluation is a lower cost in terms of time and labor. By removing the human annotator from the task of interacting with the dialogue system, and instead having them review a dialogue excerpt, the amount of work required is reduced.

However, this is simultaneously a point in favor of static evaluation, but also a factor as to why it is less reliable. As ghandeharioun_approximating_2019 suggest, chat-oriented dialogues have a less defined conversational goal which can best be summarized as being able to hold a “natural social interaction with humans”. The success - or failure - at this can only be evaluated by the targeted recipient of the conversation; namely, the user that the system is interacting with. External annotators, at best, can estimate the user’s satisfaction with the conversation based on their own projected opinions, which is not necessarily the most accurate assessment.

OQ GR RE IN EU EN CO PR
1 5.00 (0.00) 1.94 (0.98) 2.86 (1.29) 1.00 (0.00) 2.33 (0.89) 4.94 (0.23) 1.64 (0.87)
2 4.70 (0.47) 2.85 (0.88) 3.25 (1.25) 1.15 (0.37) 3.15 (0.75) 4.90 (0.31) 2.15 (0.59)
3 4.62 (0.51) 3.46 (0.52) 2.92 (0.86) 1.08 (0.28) 2.92 (0.49) 4.77 (0.44) 2.38 (0.65)
4 4.71 (0.46) 3.89 (0.42) 4.25 (0.70) 1.11 (0.31) 3.86 (0.36) 4.82 (0.39) 2.93 (0.54)
5 4.33 (0.58) 4.33 (0.58) 3.67 (0.58) 1.33 (0.58) 4.00 (0.00) 5.00 (0.00) 3.00 (0.00)
(a) The OQ column shows the overall quality ratings from our expert and the other columns show the average ratings from the expert on the corresponding dialogue dimensions.
OQ GR RE IN EU EN CO PR
1 4.85 (0.37) 2.20 (1.20) 2.95 (1.28) 1.00 (0.00) 2.60 (1.05) 4.85 (0.37) 1.95 (0.94)
2 4.80 (0.41) 3.05 (1.10) 3.95 (1.19) 1.25 (0.44) 3.30 (0.92) 5.00 (0.00) 2.10 (0.79)
3 4.85 (0.37) 2.75 (1.07) 2.50 (0.95) 1.00 (0.00) 2.60 (0.75) 4.90 (0.31) 2.05 (0.89)
4 4.65 (0.49) 3.40 (0.82) 3.30 (0.92) 1.10 (0.31) 3.25 (0.79) 4.85 (0.37) 2.25 (0.72)
5 4.80 (0.41) 3.30 (1.13) 4.10 (0.97) 1.05 (0.22) 3.50 (0.76) 4.80 (0.41) 2.85 (0.75)
(b) The OQ column shows the overall quality ratings from the Alexa Prize and the other columns show the average ratings from the expert on the corresponding dialogue dimensions.
Table 6: The average ratings by our expert on each of the dialogue dimensions in Table 5 with respect to the overall ratings from the expert and the Alexa Prize. OQ: Quality, GR: Grammaticality, RE: Relevance, IN: Informativeness, EU: Emotional Understanding, EN: Engagingness, CO: Consistency, PR: Proactivity.

In addition, static evaluation is commonly conducted by producing a single system response in a fixed dialogue context. This fails to reveal certain system deficiencies, such as repetitiveness, inconsistency, and lack of long-term memory of the information shared in the conversation. It also prevents an assessment of the system’s error-handling or misunderstanding recovery capabilities from being encountered. All of these aspects are necessary to truly assess the quality of dialogues that a given dialogue system can produce. Without this information, only a biased perspective can be achieved, and the evaluation will not reflect the true capability of the system if it were to be used in practice.

5 Case Study: Alexa Prize 2020

This section presents a case study of the significance of the proposed dialogue dimensions in Table 5 using real human-machine dialogues. For this analysis, 100 rated conversations were taken from the Alexa Prize Socialbot Grand Challenge 3222https://developer.amazon.com/alexaprize, which is a university competition to create innovative open-domain chatbots Ram et al. (2018). During the competition, conversations are rated in terms of Overall Quality on a scale of 1 (worst) to 5 (best) under the interactive evaluation protocol. For this case study, we sampled conversations with an equal distribution between all ratings, where every conversation has at least three turns to ensure sufficient content.

Because only the Overall Quality dimension is provided from the interactive evaluation, we also conducted an expert analysis on the same conversations in order to explore the implications of the other previously identified dialogue dimensions. To this end, one of the authors - who has over three years of experience in dialogue system research - manually rated the conversations on each of the dialogue dimensions in Table 5.

It is worth mentioning that the following findings are taken as only a preliminary analysis, strongly considering the low agreement between the expert and interactive evaluations on OQ, which will be discussed shortly (Section 5.2). This disparity between the expert and human user evaluations renders it difficult to convey a convincing conclusion regarding the significance of the evaluation dimensions. However, we hope this work begins the momentum to investigate the importance of such evaluation dimensions in overall human perception of dialogue quality.

5.1 Quality vs. Other Dialogue Dimensions

Table 6

shows the average rating and its standard deviation on each of the 7 dialogue dimensions (

GR, RE, IN, EU, EN, CO, PR) across the overall quality ratings (OQ). All ratings on those 7 dimensions are assessed by our expert. OQ ratings are provided by the expert for Tables 5(a) and the human users from the Alexa Prize for Table 5(b).

Relevance & Proactivity

The clearest positive relationship to OQ is observed from RE and PR, especially from the expert evaluation although it can be seen in the interactive evaluation as well. This suggests that these dimensions are pertinent to the human perception of dialogue quality, and that this relationship is even more apparent when evaluators are given the opportunity to review previous dialogue turns when determining OQ.

Informativeness & Engagingness

The relation ship between IN and EN to OQ is not as obvious as the previous two dimensions, RE and PR, although an indication of a positive relationship is observed.

Grammaticality

Due to the manual curation of responses in our Alexa Prize chatbot, we have tight control over the grammaticality of our responses; thus, the overall variance in

GR is low. Interestingly, we do notice that there is a slight inverse relationship between GR and OQ. Although this may seem counter-intuitive, the likely explanation is that conversations with higher OQ tend to be longer so that they comprise a greater number of topics and, as more topics are introduced, the chance for an (accidentally) ungrammatical response to be revealed is higher. Nonetheless, it appears that ungrammaticality is not a strict deterrent on OQ.

Emotional Understanding & Consistency

The effect of EU and CO on OQ is inconclusive from the presented analysis. This is attributed to the low variation in these dimensions of our chatbot, as we can enforce the consistency of responses and do not aim to tackle emotional understanding.

5.2 Expert vs. Interactive Evaluations

The inter-annotator agreement between the OQ ratings of the expert and the users from the Alexa Prize is provided in Table 7. The agreement is measured for both fine-grained ratings that consider all scales (1 - 5) and coarse-grained ratings that consider only two scales (low: 1 - 2, high: 3 - 5). Although the inter-annotator agreement is higher for the coarse-grained ratings, it is apparent that the agreement scores are dramatically low for both.

Rating Type Agreement
Fine-grained 0.13
Coarse-grained 0.22
Table 7: Cohen’s Kappa scores on the overall quality ratings between the expert and interactive evaluation.

Table 8 shows that the expert evaluation tends to be more punishing overall, with a much fewer number of conversations receiving a 5.0 rating. Indeed, 56% of the conversations from the expert evaluation would be categorized as a low rating, whereas the interactive evaluation has only 40%. Even so, the low agreement indicates that the quality assessments across the two evaluation protocols are highly variable across the same conversations.

OQ 1 2 3 4 5
Interactive 20 20 20 20 20 100
Expert 36 20 13 28 3 100
Table 8: Comparison of the rating distribution between expert and interactive evaluation

This provides preliminary support for the hypothesis in Section 4 that external evaluators are unable to accurately infer the same impression of a conversation as that of the user who is actually participating in the conversation. Although there are potential methods which aim to mitigate this effect - such as agglomerate ratings across more than one external annotator - the underlying cause of such variance may be attributed to the poor suitability of external evaluations for dialogue system evaluation as a whole, but further work is required.

6 Conclusion and Future Work

In this paper, we provide an extensive background and the current states on the three types of dialogue system evaluation protocols, automated, static, and interactive. Our analysis shows that static evaluation is the dominating human evaluation used in the most recent dialogue system works, although it has several concerning limitations, some of which are exemplified through our case study. We propose a set of eight dialogue dimensions that encapsulate the evaluations of previous studies without redundancy. As a result of our case study, we find preliminary evidence that the dimensions of relevance, proactivity, informativeness, and engagingness are likely to be contributing factors to the overall perception of dialogue quality.

Our future work will build upon these findings to develop a thorough understanding of the necessary dialogue dimensions for comprehensive interactive evaluation of dialogue systems. Through an analysis based on large-scale user studies, we look to propose an evaluation protocol that captures the human judgement of dialogue quality through precise formulation of evaluation dimensions, in order to enable targeted dialogue system advancements.

Acknowledgments

We gratefully acknowledge the support of the Alexa Prize Socialbot Grand Challenge 3. Any contents in this material are those of the authors and do not necessarily reflect the views of the Alexa Prize.

References

  • D. Adiwardana, M. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, and Q. V. Le (2020) Towards a Human-like Open-Domain Chatbot. arXiv preprint arXiv:2001.09977. External Links: Link Cited by: 3(b), 3(c).
  • S. F. Chen, D. Beeferman, and R. Rosenfeld (1998) Evaluation metrics for language models. In DARPA Broadcast News Transcription and Understanding Workshop (BNTUW), Cited by: 10th item.
  • J. Deriu, A. Rodrigo, A. Otegi, G. Echegoyen, S. Rosset, E. Agirre, and M. Cieliebak (2019) Survey on evaluation methods for dialogue systems. arXiv preprint arXiv:1905.04071. Cited by: §1.
  • W. Du and A. W. Black (2019) Boosting Dialog Response Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 38–43. External Links: Link Cited by: 9th item, §4.2.
  • J. Gao, M. Galley, L. Li, et al. (2019) Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval 13 (2-3), pp. 127–298. Cited by: §1.
  • T. Hashimoto, H. Zhang, and P. Liang (2019)

    Unifying human and statistical evaluation for natural language generation

    .
    In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1689–1701. Cited by: §1.
  • M. Huang, X. Zhu, and J. Gao (2019) Challenges in building intelligent open-domain dialog systems. arXiv preprint arXiv:1905.05709. Cited by: §1.
  • J. Li and X. Sun (2018) A Syntactically Constrained Bidirectional-Asynchronous Approach for Emotional Conversation Generation. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    ,
    Brussels, Belgium, pp. 678–683. External Links: Link, Document Cited by: §4.1, 3(a), 3(b).
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: 4th item.
  • Z. Li, C. Niu, F. Meng, Y. Feng, Q. Li, and J. Zhou (2019) Incremental Transformer with Deliberation Decoder for Document Grounded Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 12–21. External Links: Link Cited by: 3(a), 3(b).
  • C. Lin (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In

    Proceedings of the Workshop on Text Summarization Branches Out

    ,
    Barcelona, Spain, pp. 56–60. External Links: Link Cited by: 11st item.
  • Z. Lin, A. Madotto, J. Shin, P. Xu, and P. Fung (2019) MoEL: Mixture of Empathetic Listeners. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 121–132. External Links: Link, Document Cited by: §4.1, §4.2, 3(a), 3(b).
  • C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau (2016) How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2122–2132. External Links: Link, Document Cited by: §1, 5th item.
  • S. Liu, H. Chen, Z. Ren, Y. Feng, Q. Liu, and D. Yin (2018) Knowledge Diffusion for Neural Dialogue Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1489–1498. External Links: Link, Document Cited by: 6th item, 1st item, 2nd item.
  • L. Luo, J. Xu, J. Lin, Q. Zeng, and X. Sun (2018) An Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 702–707. External Links: Link, Document Cited by: 3(a), 3(b).
  • A. Madotto, Z. Lin, C. Wu, and P. Fung (2019) Personalizing Dialogue Agents via Meta-Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5454–5459. External Links: Link Cited by: 2nd item.
  • A. Malchanau, V. Petukhova, and H. Bunt (2019) Multimodal Dialogue System Evaluation: A Case Study Applying Usability Standards. In 9th International Workshop on Spoken Dialogue System Technology, Vol. 579, pp. 145–159 (en). External Links: ISBN 9789811394423 9789811394430, Link Cited by: §1.
  • N. Moghe, S. Arora, S. Banerjee, and M. M. Khapra (2018) Towards Exploiting Background Knowledge for Building Conversation Systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2322–2332. External Links: Link, Document Cited by: §4.1, 3(b), 3(c).
  • L. Mou, Y. Song, R. Yan, G. Li, L. Zhang, and Z. Jin (2016) Sequence to backward and forward sequences: a content-introducing approach to generative short-text conversation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 3349–3358. Cited by: 8th item.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Link, Document Cited by: 1st item.
  • L. Qiu, J. Li, W. Bi, D. Zhao, and R. Yan (2019) Are Training Samples Correlated? Learning to Generate Dialogue Responses with Multiple References. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3826–3835. External Links: Link Cited by: 3(a), 3(b), 3(c).
  • A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng, A. Nagar, et al. (2018) Conversational ai: the science behind the alexa prize. arXiv preprint arXiv:1801.03604. Cited by: §5.
  • Z. Tian, W. Bi, X. Li, and N. L. Zhang (2019) Learning to Abstract for Memory-augmented Conversational Response Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3816–3825. External Links: Link Cited by: §4.1, 3(c).
  • M. A. Walker, D. J. Litman, C. A. Kamm, and A. Abella (1997) PARADISE: a framework for evaluating spoken dialogue agents. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 271–280. External Links: Link, Document Cited by: §1.
  • J. Wang, J. Liu, W. Bi, X. Liu, K. He, R. Xu, and M. Yang (2020) Improving Knowledge-aware Dialogue Generation via Knowledge Base Question Answering. arXiv preprint arXiv:1912.07491 (en). External Links: Link Cited by: 1st item, 2nd item.
  • W. Wu, Z. Guo, X. Zhou, H. Wu, X. Zhang, R. Lian, and H. Wang (2019) Proactive Human-Machine Conversation with Explicit Conversation Goal. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3794–3804. External Links: Link Cited by: 2nd item, 3(a), 3(b), 3(c).
  • X. Xu, O. Dušek, I. Konstas, and V. Rieser (2018) Better Conversations by Modeling, Filtering, and Optimizing for Coherence and Diversity. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3981–3991. External Links: Link, Document Cited by: 3rd item.
  • T. Young, E. Cambria, I. Chaturvedi, H. Zhou, S. Biswas, and M. Huang (2018) Augmenting End-to-End Dialogue Systems With Commonsense Knowledge. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    pp. 4970–4977 (en). External Links: Link Cited by: 7th item, §4.2, 3(b), 3(c).
  • H. Zhang, Y. Lan, L. Pang, J. Guo, and X. Cheng (2019) ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multi-turn Dialogue Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3721–3730. External Links: Link Cited by: §4.2.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing Dialogue Agents: I have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2204–2213. External Links: Link, Document Cited by: 1st item.
  • K. Zhou, K. Zhang, Y. Wu, S. Liu, and J. Yu (2019) Unsupervised Context Rewriting for Open Domain Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1834–1844. External Links: Link, Document Cited by: §4.1.
  • Q. Zhu, L. Cui, W. Zhang, F. Wei, and T. Liu (2019) Retrieval-Enhanced Adversarial Training for Neural Response Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3763–3773. External Links: Link Cited by: 3(a), 3(c).