A Review of Evaluation Techniques for Social Dialogue Systems

by   Amanda Cercas Curry, et al.
Heriot-Watt University

In contrast with goal-oriented dialogue, social dialogue has no clear measure of task success. Consequently, evaluation of these systems is notoriously hard. In this paper, we review current evaluation methods, focusing on automatic metrics. We conclude that turn-based metrics often ignore the context and do not account for the fact that several replies are valid, while end-of-dialogue rewards are mainly hand-crafted. Both lack grounding in human perceptions.


page 1

page 2


Assessing Dialogue Systems with Distribution Distances

An important aspect of developing dialogue systems is how to evaluate an...

ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons

While dialogue remains an important end-goal of natural language researc...

The First Evaluation of Chinese Human-Computer Dialogue Technology

In this paper, we introduce the first evaluation of Chinese human-comput...

How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

Though generative dialogue modeling is widely seen as a language modelin...

Open data, open review and open dialogue in making social sciences plausible

Nowadays, protecting trust in social sciences also means engaging in ope...

Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols

As conversational AI-based dialogue management has increasingly become a...

1. Introduction

Non-task-oriented, social dialogue systems, aka “chatbots”, receive an increasing amount of attention as they are designed to establish a rapport with the user or customer, providing engaging and coherent dialogue. Traditional dialogue systems (McTear, 2004; Rieser and Lemon, 2011) tend to be task-orientated for a limited domain and evaluation methods of such systems have been much researched (see (Hastie, 2012) for an overview). Evaluation of social dialogue systems, on the other hand, is challenging as there is no clear measure for task success and evaluating whether such a rapport has been established is far from clear-cut. One common method for evaluating such systems is human evaluation where subjects are recruited to interact with and rate different systems. However, human evaluation is highly subjective, time-consuming, expensive and requires careful design of the experimental set-up.

2. Automatic Metrics

Automatic evaluation is popular because it is cost-effective and faster to run than human evaluation, and is needed for automatic benchmarking and tuning of algorithms. Here, we discuss existing automatic methods for developing social systems in terms of word-overlap metrics, machine learning-based estimation models and reward-based metrics. Since social systems lack a final success measure, many of the discussed metrics operate at turn-level.

2.1. Word-Overlap Metrics

Word-overlap metrics, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), are borrowed from Machine Translation (MT) and Summarisation and have been widely been used to evaluate neural dialogue system output, as reported in, for example (Li et al., 2016; Sordoni et al., 2015). However, these metrics have not been shown to correlate well with human judgements in a dialogue setting (Liu et al., 2016)

. One possible explanation is that there is no “gold standard” to compare with, as in MT: there may be many valid responses to an utterance that have no or few overlapping n-grams and would thus receive low BLEU or ROUGE scores, see example in Table

1. Measures from information theory such as perplexity have also been used for evaluation, e.g. comparing neural models to n-grams (Vinyals and Le, 2015), however perplexity can be difficult to interpret. There is, therefore, a need for an evaluation method that does not measure success by comparing an utterance to human-generated responses but instead considers the utterance itself and its appropriateness within its context.

User utterance:
Have you read Murakami’s new novel?
Reference response:
No I don’t think I have read Murakami’s new novel, what is it about?
System output:
Yes, it wasn’t my favourite but I still liked it.
Table 1. Valid system response with low word overlap to reference.

2.2. Machine Learning Methods for Dialogue Evaluation

Recently, Machine Learning (ML) based evaluation has gained popularity. This method operates on the turn-level and aims to provide an estimation model of a “good” response. The advantages of this method is that it has been shown to come closer to human-generated responses (Lowe et al., 2017) than BLEU and ROUGE. However, such methods require retraining for each domain.

Discriminative Models:

These models attempt to distinguish the “right” from the “wrong” answer. Next-Utterance Classification (NUC) (Lowe et al., 2016) can be evaluated by measuring the system’s ability to select the next answer from a list of possible answers sampled from elsewhere in the corpus, using retrieval metrics such as recall. NUC offers several advantages: performance is easy to compute automatically and the task is interpretable and can be easily compared to human performance. However, similar issues to word-based metrics do apply in that there is not necessarily one single correct answer.

More recently, adversarial evaluation measures have been proposed to distinguish a dialogue model’s output from that of a human. For example, the model proposed by (Kannan and Vinyals, 2017)

achieves a 62.5% success rate using a Recurrent Neural Networks (RNN) trained on email replies.

Classification Models:

(Lowe et al., 2017)

propose to predict human scores from a large dataset of human ratings of Twitter responses. The proposed models learn distributed representations of the context, reference response and the system’s response using a hierarchical RNN encoder. The learned model correlates with human scores at the turn level and also generalises to unseen data. However, it does tend to have a bias towards generic responses.

2.3. Reward-based Metrics

Reinforcement Learning (RL) based models have been applied to task-based systems (Rieser and Lemon, 2011) to optimise interaction for some reward. For social systems, this has also been investigated as a means to avoid generic responses, such as “I don’t know”. Here, the evaluation function is implemented as the reward. We will discuss these types of reward at turn-level and at system-level.

Turn-level rewards:

(Li et al., 2016) propose a metric involving a weighted sum of three measures:

  • Coherence: semantic similarity between consecutive turns,

  • Information flow: semantic dissimilarity between utterances of the same speaker,

  • Ease of answering: negative log-likelihood of responding to an utterance with a dull response (as defined by a blacklist).

In their experiments, they find the RL approach outperforms their other systems in terms of dialogue length, diversity of answers and overall quality of multi-turn dialogues. This suggests that the proposed reward function successfully captures the relationship between an utterance and a response at least partially, which can be useful in evaluating potential responses without the need for human-generated references. However, while coherence at the turn-level is a key factor in quality estimation, it does not necessarily reflect the overall quality of the dialogue.

System-level rewards:

The reward function by (Li et al., 2016)

was based on heuristics, whereas

(Yu et al., 2016) use a Wizard-of-Oz experiment to measure engagement and deduct a reward function with the following metrics:

  • Conversational depth: the number of consecutive turns belonging to the same topic,

  • Lexical diversity/information gain: the number of unique words that are introduced into the conversation from both the system and the user,

  • Overall dialogue length.

3. Conclusion and Discussion

It is clear that there is still work to be done with respect to establishing an effective evaluation method that can capture all aspects of dialogue from naturalness and coherence to long-term engagement and flow. Word-based metrics such as BLEU, ignore the fact there may be any number of equally valid and appropriate responses, while turn-based metrics do not account for the over-use of generic responses, and system-level rewards are based on heuristics. In future work, we will utilise data we gathered as part of the Amazon Alexa Prize challenge to build a data-driven model to predict customer ratings.


This research is supported by Rieser’s EPSRC projects DILiGENt (EP/M005429/1) and MaDrIgAL (EP/N017536/1); and from the RAEng/Leverhulme Trust Senior Research Fellowship Scheme (Hastie/LTSRF1617/13/37).