The automatic evaluation of generative dialogue systems remains an important open problem, with potential applications from tourism Şimşek and Fensel (2018) to medicine Fazzinga et al. (2021). In recent years, there has been increased focus on interpretable approaches Deriu et al. (2021); Chen et al. (2021) often through combining various sub-metrics, each for a specific aspect of dialogue Berlot-Attwell and Rudzicz (2021); Phy et al. (2020); Mehri and Eskenazi (2020b). One of these key aspects is “relevance” (sometimes called “context coherence”), commonly defined as whether “[r]esponses are on-topic with the immediate dialogue history” Finch and Choi (2020).
These interpretable approaches have motivated measures of dialogue relevance that are not reliant on expensive human annotations. Such measures have appeared in many recent papers on dialogue evaluation, including USR Mehri and Eskenazi (2020b), USL-H Phy et al. (2020), and others Pang et al. (2020); Merdivan et al. (2020). Additionally, dialogue relevance has been used directly in training dialogue models Xu et al. (2018).
Despite this work, comparison between these approaches has been limited. Aggravating this problem is that authors often collect human annotations on their own datasets with varying amounts and types of non-human responses. Consequently, direct comparisons are not possible. It is known that metrics of dialogue quality often perform poorly on new test sets of quality ratings Yeh et al. (2021), but it remains an open question whether poor generalization also plagues the much simpler dialogue relevance task. We address this problem by evaluating and comparing six prior approaches on four publicly available datasets of dialogue annotated with human ratings of relevance. We find poor correlation with human ratings across various methods, with high sensitivity to dataset.
Based on our observations, we propose a simple metric of logistic regression trained on pretrained BERT NSP featuresDevlin et al. (2019), using “i don’t know.” as the only negative example. With this metric, we achieve state-of-the-art correlation on the HUMOD dataset Merdivan et al. (2020). We release our metric and evaluation code to encourage comparable results in future research.
Our primary contributions are: (i) empirical evidence that current dialogue relevance metrics for English are sensitive to dataset, and often have poor correlation with human ratings, (ii) a simple relevance metric that exhibits good correlation and reduced domain sensitivity, and (iii) the counter-intuitive result that a single negative example can be equally effective as random negative sampling.
2 Prior metrics
Prior metrics of relevance in dialogue can generally be divided into more traditional approaches that are token-based, and more current approaches based on large pretrained models. These metrics are given the context (i.e., the two-person conversation up to a given point in time), as well as a response (i.e., the next speaker’s response, also known as the ‘next turn’ in the conversation). From these, they produce a measure of the response’s relevance to the context. The ground-truth response (i.e., the ‘gold response’) may or may not be available.
2.1 -gram approaches
There have been attempts to use metrics based on -grams from machine-translation and summarization, such as BLEU Papineni et al. (2002), ROUGE Lin (2004), and METEOR Banerjee and Lavie (2005) in dialogue. However, we discard these approaches due to their limitations: they require a ground-truth response, and correlate poorly with dialogue relevance Merdivan et al. (2020).
2.2 Average-embedding cosine similarity
Xu et al. (2018)
proposed to measure the cosine similarity of a vector representation of the context and the response. Specifically, the context and response are represented via an aggregate (typically an average) of the uncontextualized word embeddings. This approach can be modified to exploit language models by instead using contextualized word embeddings.
2.3 Fine-tuned embedding model for Next Utterance Prediction (NUP)
This family of approaches combines a word embedding model (typically max- or average-pooled BERT word embeddings) with a simple 1-3 layer MLP, trained for next utterance prediction (typically using negative sampling) Mehri and Eskenazi (2020b); Phy et al. (2020). The embedding model is then fine-tuned to the domain of interest. In some variants, the model is provided with information in addition to the context and response; e.g., Mehri and Eskenazi (2020b) appended a topic string to the context. This approach has also been directly used as a metric of overall dialogue quality Ghazarian et al. (2019). In this paper, we focus on the specific implementation by Phy et al. (2020)
: max-pooled BERT embeddings passed into a single-layer MLP followed by two-class softmax, trained with binary cross-entropy (BCE) loss and random sampling of negative samples.
Note that, for methods that are fine-tuned or otherwise require training, it will often be the case that annotated relevance data is not available on the domain of interest. As a result, model performance cannot be measured on a validation set during training. Therefore, either the method must be trained to convergence on the training set, or a different method other than validation set performance must be employed to reduce the risk of halting training on a model with poor performance.
Another concern with using trained metrics to evaluate trained dialogue systems is that they may both learn the same patterns in the training data. An extreme example would be a dialogue model that learns only to reproduce responses from the training data verbatim, and a relevance metric that learns to only accept verbatim responses from the training data. We believe that this risk can be reduced by training the metric on separate data from the model. However, this approach is only practical if the metric can be trained with a relatively small amount of data and therefore does not compete with the dialogue model for training examples. Alternatively, a sufficiently generalizable metric may be trained on data from a different domain.
2.4 Normalized conditional probability
Pang et al. (2020)
also exploited pretrained models, however they instead relied on a generative language model (specifically GPT-2). Their proposed metric is the conditional log-probability of the response given the context, normalized to the range(see Appendix D.1 for details).
Mehri and Eskenazi (2020a) also relied on a generative language model (specifically, DialoGPT Zhang et al. (2020)), however their approach measured the probability of followup-utterances, e.g., “Why are you changing the topic?” to indicate irrelevance. Their relevance and correctness scores are defined as , where is a negative response suggesting irrelevance or incorrectness. Note that positive utterances can be used, however the author’s measures of correctness and relevance only used negative utterances.
3 Datasets used for analysis
A literature review reveals that many of these methods have never been evaluated on the same datasets. As such, it is unclear both how these approaches compare, and how well they generalize to new data. For this reason, we consider four publicly available English datasets of both human and synthetic dialogue with human relevance annotations. All datasets are annotated with Likert ratings of relevance from multiple reviewers; following Merdivan et al. (2020), we average these ratings over all reviewers. Due to variations in data collection procedures, as well as anchoring effects Li et al. (2019), Likert ratings from different datasets may not be directly comparable. Consequently, we keep the datasets separate. This also allows us to observe generalization across datasets.
Altogether, our selected datasets cover a wide variety of responses, including human, LSTM, Transformer, Meena Adiwardana et al. (2020), and Mitsuku2222019 Loebner prize winning system generated responses, and random distractors. See Table 1 for an overview.
3.1 HUMOD Dataset
The HUMOD dataset Merdivan et al. (2020) is an annotated subset of the Cornell movie dialogue dataset Danescu-Niculescu-Mizil and Lee (2011). The Cornell dataset consists of conversations from films. The HUMOD dataset is a subset of contexts, each consisting of between two and seven turns. Every context is paired with both the original human response, and a randomly sampled human response. Each response is annotated with crowd-sourced ratings of relevance from 1-5. The authors measured inter-annotator agreement via Cohen’s kappa score Cohen (1968), and it was found to be 0.86 between the closest ratings, and 0.42 between randomly selected ratings. Following the authors, we split the dataset into a training set consisting of the first contexts, a validation set of the next contexts, and a test-set of the remaining contexts. As it is unclear how HUMOD was subsampled from the Cornell movie dialogue dataset, we do not use the Cornell movie dialogue dataset as training data.
3.2 USR Topical-Chat Dataset (USR-TC)
The USR-TC dataset is a subset of the Topical-Chat (TC) dialogue dataset Gopalakrishnan et al. (2019) created by Mehri and Eskenazi (2020b). The Topical-Chat dataset consists of approximately conversations between Amazon Mechanical Turk workers, each grounding their conversation in a provided reading set. The USR-TC dataset consists of 60 contexts taken from the TC frequent test set, each consisting of 1-19 turns. Every context is paired with six responses: the original human response, a newly created human response, and four samples taken from a Transformer dialog model Vaswani et al. (2017). Each sample follows a different decoding strategy, namely: argmax sampling, and nucleus sampling Holtzman et al. (2020) at the rates , respectively. Each response is annotated with a human 1-3 score of relevance, produced by one of six dialogue researchers. The authors reported an inter-annotator agreement of 0.56 (Spearman’s correlation). We divide the dataset evenly into a validation and test set, each containing 30 contexts. We use the TC train set as the training set.
3.3 Pang et al. (2020) Annotated DailyDialogue Dataset (P-DD)
The P-DD dataset Pang et al. (2020) is a subset of the DailyDialogue (DD) dataset Li et al. (2017). The DailyDialogue dataset consists of conversations scraped from websites where English language learners could practice English conversation. The P-DD dataset contains 200 contexts, each of a single turn and paired with a single synthetic response, generated by a 2-layer LSTM Bahdanau et al. (2015). Responses are sampled using top-K sampling for ; note that varies by context. Each response is annotated with ten crowdsourced 1-5 ratings of relevance with a reported inter-annotator Spearman’s correlation between 0.57 and 0.87. Due to the very small size of the dataset (only 200 dialogues in total), and the lack of information on how the contexts were sampled, we use this dataset exclusively for testing.
3.4 FED Dataset
The FED dataset Mehri and Eskenazi (2020a), consists of 375 annotated dialogue turns taken from 40 human-human, 40 human-Meena Adiwardana et al. (2020), and 40 human-Mitsuku conversations. We use a subset of the annotations, specifically turnwise relevance, and turnwise correctness (the latter defined by the authors as whether there was a “a misunderstanding of the conversation”). As the authors note, their definition of correctness is often encapsulated within relevance; we thus evaluate on both annotations. Due to the small size, we used this dataset only for testing.
4 Evaluating Prior Metrics
For each of the aforementioned datasets, we evaluate the following relevance metrics:
COS-MAX-BERT: Cosine similarity with max-pooled BERT contextualized word embeddings, inspired by BERT-RUBER Ghazarian et al. (2019)
COS-NSP-BERT: Cosine similarity using the pretrained features extracted from the [CLS] token used by next-sentence-prediction head.
NUP-BERT: Fine-tuned BERT next-utterance prediction approach. Implementation by Phy et al. (2020). We experiment with fine-tuning BERT to the HUMOD train set (3750 dialogues), the full TC train set, and TC-S (a subset of the TC training set containing dialogues).
NORM-PROB: GPT-2 based normalized conditional-probability; approach and implementation by Pang et al. (2020); note that the P-DD dataset was released in the same paper.
FED-RELEVANT & FED-CORRECT: DialoGPT based normalized conditional-probability; approach and implementation by Mehri and Eskenazi (2020a)
In all cases, we use hugging-face bert-base-uncased
as the pretrained BERT model. Only NUP-BERT was fine-tuned. To prevent an unfair fitting to any specific dialogue model, and to better reflect the evaluation of a new dialogue model, only human responses were used at train time. All hyperparameters were left at their recommended values. NUP-BERT performance is averaged over 3 runs.
Note that we also evaluate GRADE Huang et al. (2020) and DYNA-EVAL Zhang et al. (2021); however these do not measure relevance, but rather dialogue coherence: “whether a piece of text is in a consistent and logical manner, as opposed to a random collection of sentences” Zhang et al. (2021)
. As relevance is a major aspect of dialogue coherence, we include these baselines for completeness. As both metrics are graph neural networks intended for larger train sets, we use checkpoints provided by the authors. GRADE is trained on DailyDialogueLi et al. (2017), and DynaEval on Empathetic Dialogue Rashkin et al. (2019). Both are trained with negative sampling, with GRADE constructing more challenging negative samples.
A summary of the authors’ stated purpose for each metric can be found in the Appendix C.
Table 2 makes it clear that the normalized probability and cosine similarity approaches do not generalize well across datasets. Although NORM-PROB excels on the P-DD dataset, it has weak performance on HUMOD and a significant negative correlation on USR-TC. Likewise the FED metrics perform well on the FED data, but are negatively correlated on all other datasets. Consequently, we believe that the NORM-PROB and FED metrics are overfitted to their corresponding datasets. Similarly, although COS-FT has the best performance on the USR-TC dataset, it performs poorly on HUMOD, and has negative correlation on P-DD. As such, it is clear that, while both cosine-similarity and normalized probability approaches can perform well, they have serious limitations. They are very sensitive to the domain and models under evaluation, and are capable of becoming negatively correlated with human ratings under suboptimal conditions.
Looking at the dialogue coherence metrics, DYNA-EVAL performs strongly on FED, and weakly on all other datasets. GRADE performs very strongly on HUMOD and P-DD (the latter, likely in part as it was trained on DailyDialogue), but is uncorrelated on USR-TC. Given that these metrics were not intended to measure relevance, uneven performance is to be expected as relevance and dialogue coherence will not always align.
The final baseline, NUP-BERT, is quite competitive, outperforming each of the other baselines on at least 2 of the datasets. Despite this, we can see that performance on HUMOD, USR-TC, and FED is still fairly weak. We can also observe that NUP-BERT has some sensitivity to the domain of the training data; fine-tuning on HUMOD data results in lower Spearman’s correlation on USR-TC, and fine-tuning on USR-TC performs worse on the FED datasets. However, the amount of training data (TC vs TC-S) has little impact.
Overall, the results of Table 2 are concerning as they suggest that at least five current approaches generalize poorly across either dialogue models or domains. The absolute performance of all metrics studied vary considerably by dataset, and the relative performance of closely related metrics such as COS-FT and COS-NSP-BERT, or NUP-BERT with different training data, varies considerably between datasets. As a result, research into new dialogue relevance metrics is required. Furthermore, it is clear that the area’s evaluation methodology must be updated to use various dialogue models in various different domains.
5 IDK: A metric for dialogue relevance
Based on these results, we propose a number of modifications to the NUP-BERT metric to produce a novel metric that we call IDK (“I Don’t Know”). The architecture is mostly unchanged, however the training procedure and the features used are altered.
First, based on the observation that the amount of training data has little impact, we freeze BERT features and do not fine-tune to the domain. Additionally, whereas the NUP-BERT baseline uses max-pooled BERT word embeddings, we use the pre-trained next sentence prediction (NSP) features: “(classification token) further processed by a Linear layer and a Tanh activation function […] trained from the next sentence prediction (classification) objective during pre-training”
“(classification token) further processed by a Linear layer and a Tanh activation function […] trained from the next sentence prediction (classification) objective during pre-training”444https://huggingface.co/transformers/v2.11.0/model_doc/bert.html.
Second, to improve generalization and reduce variation in training (particularly important as the practitioner typically has no annotated relevance data), and operating on the assumption that relevance is captured by a few key dimensions of the NUP features, we add L1 regularization to our regression weights (). Note that experiments with L2 regularization yielded similar validation set performance (see Appendix, Table 10).
Third, in place of random sampling we use a fixed negative sample, “i don’t know". This allows us to train the model on less data.
Additionally, we simplify the model, using logistic regression in place of 2-class softmax. We train for 2 epochs using BCE loss – the same as the NUP-BERT baseline. We use the Adam optimizerKingma and Ba (2015) with an initial learning rate of , and batch size 6.
Table 3 reports the correlation between the metric’s responses and the average human rating. We achieve a Pearson’s correlation on HUMOD of 0.58, surpassing HUMOD baselines Merdivan et al. (2020), and achieving parity with GRADE (). Examples of the our metric’s output on the HUMOD dataset, and a scatter plot of IDK vs human scores are in Appendices A and F, respectively.
Compared to NUP-BERT, our proposed metric provides strong improvement on the HUMOD dataset and equivalent or stronger performance on USR-TC and FED, at a cost of performance on P-DD. In particular, IDK (TC-S) performance on the FED datasets is considerably stronger than NUP-BERT (TC-S). As the performance drop on P-DD is less than the performance gain on HUMOD, and as HUMOD is human data rather than LSTM data, we consider this tradeoff to be a net benefit.
Compared to GRADE in particular, we have reduced performance on P-DD, equivalent performance on HUMOD, and stronger performance on USR-TC and FED (in particular, correlation on the USR-TC dataset is non-zero). It is worth noting that, in general, our approach does not out-perform the baselines in all cases – only the majority of cases. As such, when annotated human data is not available for testing, it would appear that our approach is the preferred choice.
Our metric is also preferable, as it is less sensitive to domain. To numerically demonstrate this, we measure the domain sensitivity of the evaluated metrics as the ratio of best Spearman’s correlation to worst Spearman’s correlation – this value should be positive (i.e., there is no dataset where the metric becomes negatively correlated), and as close toas possible (i.e., there is no difference in performance). Looking at Table 10, we find IDK strongly outperforms all prior metrics, reducing this ratio by more than - compared to the best baseline.
5.1 Testing NSP feature dimensionality
As a followup experiment, we tested our assumption that only a fraction of the BERT-NSP features are needed. Plotting the weights learned by IDK on HUMOD, we found a skewed distribution with a small fraction of weights with magnitude above(See Appendix, Figure 1). Hypothesizing that the largest weights correspond to the relevant dimensions, we modified the pretrained huggingface NSP BERT to zero all dimensions of the NSP feature, except for the dimensions corresponding to the largest IDK HUMOD weights. We then evaluated NSP accuracy on three NLTK Bird et al. (2009) corpora: Brown, Gutenburg, and Webtext. As expected, we found that reducing the dimensionality from to had no negative impact (see Appendix, Table 7). Again, note that the mask was created using IDK trained on HUMOD data, and the weights of BERT and the NSP prediction head were in no way changed. Therefore, it is clear that (at least on these datasets) over of the BERT NSP feature dimensions can be safely discarded.
5.2 Ablation tests
Table 5 outlines correlation when ablating the L1 regularization, or when using randomly sampled negative samples in place of “i don’t know". Random samples are produced by shuffling the responses of the next dialogues in the dataset.
Overall, it appears that the majority of the performance gains come from the combination of L1 regularization with pretrained BERT NSP features. The clearest observation is that L1 regularization is critical to good performance when using “i don’t know" in place of random samples – otherwise, the model presumably overfits. Second, using “i don’t know" in place of random samples has a mixed, but relatively minor effect. Thirdly, the effect of L1 regularization is quite positive when training on TC data (regardless of the negative samples), and mixed but smaller when training on HUMOD data. Overall, this suggests that when a validation set of domain-specific annotated relevance data is not available, then L1 regularization may be helpful. Its effect varies by domain, but appears to have a much stronger positive effect than a negative effect.
The result that L1 regularization allows us to use “i don’t know” in place of random negatives samples is quite interesting, as it seems to counter work in contrastive representation learning Robinson et al. (2021), and dialogue quality evaluation Lan et al. (2020) suggesting that “harder” negative examples are better. We believe that the reason for this apparent discrepancy is that we are not performing feature learning; the feature space is fixed, pretrained, BERT NSP. Furthermore, we’ve shown that this feature space is effectively 7 dimensional. As a result, we believe that the L1 regularization causes an effective projection to 7D. Consequently, as our model is low-capacity, “i don’t know”
is sufficient to find the separating hyperplane. Having said this, it is still unclear why we seeimproved performance on FED when training on HUMOD data. Comparing the histograms of learned weight magnitudes (see Appendix, Figure 2) we find that the ablated model has larger number of large weights – we speculate that the random negative samples’ variation in irrelevant aspects such as syntactic structure is responsible.
5.3 Additional Experiments
We repeated our IDK experiments with two different fixed negative samples; performance and domain sensitivity are generally comparable, although unexpectedly more sensitive to the choice of training data (see Appendix J). We also experimented with using the pretrained BERT NSP predictor as a measure of relevance, however performance is considerably worse on the longer-context FED dataset (see Appendix I). Finally, we observed that BCE loss encourages the model to always map “i don’t know” to zero; yet, the relevance of “i don’t know” varies by context. Unfortunately, experiments with a modified triplet loss did not yield improvements (see Appendix H).
6 Related Work
In addition to the prior metrics already discussed, the area of dialogue relevance is both motivated by, and jointly developed with, the problem of automatic dialogue evaluation. As relevance is a major component of good dialogue, there is a bidirectional flow of innovations. The NUP-BERT relevance metric is very similar to BERT-RUBER Ghazarian et al. (2019)
; both train a small MLP to perform the next-utterance-prediction task based on aggregated BERT features. Both of these share a heritage with earlier self-supervised methods, such as adversarial approaches to dialogue evaluation that train a classifier to distinguish human from generated samplesKannan and Vinyals (2017). Another example of shared development is the use of word-overlap metrics such as BLEU Papineni et al. (2002) and ROUGE Lin (2004) that have been imported wholesale into both dialogue relevance and overall quality from the fields of machine-translation and summarization, respectively.
Simultaneously, metrics of dialogue evaluation have been motivated by dialogue relevance. There is a long history of evaluating dialogue models on specific aspects; Finch and Choi (2020) performed a meta-analysis of prior work, and proposed dimensions of: grammaticality, relevance, informativeness, emotional understanding, engagingness, consistency, proactivity, and satisfaction. New approaches to dialogue evaluation have emerged from this body of work, seeking to aggregate individual measures of various dimensions of dialogue, often including relevance Mehri and Eskenazi (2020b); Phy et al. (2020); Berlot-Attwell and Rudzicz (2021). These approaches also share heritage with earlier ensemble measures of dialogue evaluation such as RUBER Tao et al. (2018) – although in the case of RUBER, it combined a referenced and unreferenced metric rather than separate aspects.
Metrics of dialogue relevance and quality also share common problems such as the diversity of valid responses. Our findings that existing relevance metrics generalize poorly to new domains is consistent with previous findings about metrics of dialogue quality Lowe (2019); Yeh et al. (2021). Thus, our work suggests that this challenge extends to the subproblem of dialogue relevance as well.
At the same time, it must be remembered that measuring holistic dialogue quality is a very different task from measuring dialogue relevance – it is well established that aspects of dialogue such as fluency, and interestingness are major components of quality Mehri and Eskenazi (2020b, a), and these should have no impact on relevance.
With respect to prior work comparing relevance metrics, we are aware of only one tangential work. Yeh et al. (2021) performed a comparison of various metrics of dialogue quality; within this work they dedicated three paragraphs to a brief comparison of how these quality metrics performed at predicting various dialogue qualities, including relevance. They reported results on only two of the datasets we used (P-DD and FED). Interestingly, the authors found that the FED metric performs well on P-DD (reporting a Spearman’s correlation of 0.507), however our results demonstrate that the components of FED that are meant to measure relevance (i.e. FED-REL and FED-COR) are significantly negatively correlated with human relevance scores. Additionally, as Yeh et al. (2021) focus on quality, they do not compare performance between the two relevance datasets. Instead they compare performance on quality against performance on relevance, and use the discrepancy to conclude that measuring relevance alone (as done by NORM-PROB) is insufficient to determine quality. Although we agree that relevance alone is insufficient for dialogue quality evaluation, our work provides a richer understanding. Our finding that NORM-PROB performs poorly across a range of relevance datasets suggests that the poor performance of NORM-PROB in the quality-prediction task is also caused by the poor relevance generalization in addition to the insufficiency of relevance to measure overall quality.
Our experiments demonstrate that several published measures of dialogue relevance have poor, or even negative, correlation when evaluated on new datasets of dialogue relevance, suggesting overfitting to either model or domain. As such, it is clear that further research into new measures of dialogue relevance is required, and that care must be taken in their evaluation to compare against a number of different models in a number of domains. Furthermore, it is also clear that for the current practitioner who requires a measure of relevance, there are no guarantees that current methods will perform well on a given domain. As such, it is wise to collect a validation dataset of human-annotated relevance data for use in selecting a relevance metric. If this is not possible, then our metric, IDK, appears to be the best option – achieving both good correlation and the lowest domain sensitivity, even when trained on different domains. Furthermore, when training data is scarce, our results suggest that the use of strong regularization allows for the use of a single negative example, “i don’t know”, in the place of randomly sampled negative samples. If that is still too data intensive, then our results suggest that our metric is fairly agnostic to the domain of the training data; therefore training data can be used from a different dialogue domain in place of the domain of interest.
Having said this, it is clear that further research into what exactly these metrics are measuring, and why they fail to generalize, is merited. The results are often counter-intuitive; our demonstration that
of the BERT NSP features can be safely discarded is just one striking example. Similarly, although our empirical results suggest that use of a single negative example generalizes across domains, there is no compelling theoretical reason why this should be so. More generally, all the metrics outlined are complex, dependent on large corpora, and created without ground truth annotations. As a result, they are all dependent on either surrogate tasks (i.e., NUP), or unsupervised learning (e.g., FastText embeddings). Consequently, it is especially difficult to conclude what exactly these metrics are measuring. At present, the only strong justification that these metrics are indeed measuring relevance is good correlation with human judgements – poor generalization across similar domains is not an encouraging result.
Although the metric outlined is not appropriate for final model evaluation (as it risks unfairly favouring dialogue models based on the same pretrained BERT, or similar architectures), our aim is to provide a useful metric for rapid prototyping and hyperparameter search. Additionally, we hope that our findings on the domain sensitivity of existing metrics will spur further research into both the cause of – and solutions to – this problem.
Our work demonstrates that several existing metrics of dialogue relevance are problematic as their performance varies wildly between test-domains. We take a first step towards resolving this issue by proposing IDK: a simple metric that is less sensitive to test domain and trainable with minimal data. We reduce IDK’s data requirements through the novel use of a fixed negative example, provide evidence that the underlying BERT NSP features are low-dimensional, and propose that this fact (combined with IDK’s lack of feature learning) allows for the counter-intuitive use of a single negative example. Beyond this, we call for better evaluation of future relevance metrics, and thus release our code for processing four diverse, publicly available, relevance-annotated data sets.
Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institutehttps://vectorinstitute.ai/partners/. Ian Berlot-Attwell is funded by an Ontario Graduate Scholarship and a Vector Institute Research Grant. Frank Rudzicz is supported by a CIFAR Chair in AI. We would also like to thank the various reviewers who helped to shape and improve this work; without them it would not be what it is today.
- Towards a human-like open-domain chatbot. CoRR abs/2001.09977. External Links: Cited by: §3.4, Table 1, §3.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §3.3.
- METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, pp. 65–72. External Links: Cited by: §2.1.
- On the use of linguistic features for the evaluation of generative dialogue systems. CoRR abs/2104.06335. External Links: Cited by: §1, §6.
- Natural language processing with python. O’Reilly. External Links: Cited by: Table 7, §5.1.
- DSTC10: track 5: automatic evaluation and moderation of open-domain dialogue systems. Note: Accessed: 9-7-2021 https://drive.google.com/file/d/1B2YBtWaLJU5X3uudSZEaOyNWQ_QoTZLG/view Cited by: §1.
- Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit.. Psychological bulletin 70 (4), pp. 213. Cited by: §3.1.
- Improving neural conversational models with entropy-based data filtering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5650–5669. External Links: Cited by: 1st item.
- Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, Portland, Oregon, USA, pp. 76–87. External Links: Cited by: §3.1, Table 1.
- Survey on evaluation methods for dialogue systems. Artificial Intelligence Review 54 (1), pp. 755–810. External Links: Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §1.
- An argumentative dialogue system for covid-19 vaccine information. In Logic and Argumentation, P. Baroni, C. Benzmüller, and Y. N. Wáng (Eds.), Cham, pp. 477–485. External Links: Cited by: §1.
- Towards unified dialogue system evaluation: a comprehensive analysis of current evaluation protocols. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 1st virtual meeting, pp. 236–245. External Links: Cited by: §1, §6.
- Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, Minneapolis, Minnesota, pp. 82–89. External Links: Cited by: §2.3, 2nd item, §6.
- Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Proc. Interspeech 2019, pp. 1891–1895. External Links: Cited by: §3.2, Table 1.
- The curious case of neural text degeneration. In International Conference on Learning Representations, External Links: Cited by: §3.2.
- GRADE: automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 9230–9240. External Links: Cited by: 1st item, §4.
- Adversarial evaluation of dialogue models. CoRR abs/1701.08198. External Links: Cited by: §6.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §5.
- PONE: A novel automatic evaluation metric for open-domain generative dialogue systems. ACM Trans. Inf. Syst. 39 (1), pp. 7:1–7:37. External Links: Cited by: §5.2.
- A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 110–119. External Links: Cited by: Appendix J.
- ACUTE-EVAL: improved dialogue evaluation with optimized questions and multi-turn comparisons. CoRR abs/1909.03087. External Links: Cited by: §3.
- DailyDialog: a manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, pp. 986–995. External Links: Cited by: §3.3, Table 1, §4.
- ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Cited by: §2.1, §6.
- A retrospective for "Towards an automatic Turing test - learning to evaluate dialogue responses". ML Retrospectives. External Links: Cited by: §6.
- Unsupervised evaluation of interactive dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 1st virtual meeting, pp. 225–235. External Links: Cited by: 4th item, 5th item, §2.4, §3.4, Table 1, 6th item, §6.
- USR: an unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 681–707. External Links: Cited by: 2nd item, §1, §1, §2.3, §3.2, Table 1, §6, §6.
Human annotated dialogues dataset for natural conversational agents. Applied Sciences 10 (3). External Links: Cited by: Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing Metrics, and a Novel Simple Metric, §1, §1, §2.1, §3.1, Table 1, §3, §5.
- Towards holistic and automatic evaluation of open-domain dialogue generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 3619–3629. External Links: Cited by: 3rd item, §D.1, §1, §2.4, §3.3, §3.3, Table 1, 5th item.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Cited by: §2.1, §6.
- Deconstruct to reconstruct a configurable evaluation metric for open-domain dialogue systems. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 4164–4178. External Links: Cited by: §1, §1, §2.3, 4th item, §6.
- Towards empathetic open-domain conversation models: a new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5370–5381. External Links: Cited by: §4.
- Contrastive learning with hard negative samples. In International Conference on Learning Representations, External Links: Cited by: §5.2.
- Now we are talking! Flexible and open goal-oriented dialogue systems for accessing touristic services. e-Review of Tourism Research. External Links: Cited by: §1.
- RUBER: an unsupervised method for automatic evaluation of open-domain dialog systems. In AAAI Conference on Artificial Intelligence, External Links: Cited by: §6.
- Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Cited by: §3.2.
- Better conversations by modeling, filtering, and optimizing for coherence and diversity. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3981–3991. External Links: Cited by: 1st item, §1, §2.2.
- A comprehensive assessment of dialog evaluation metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, Online, pp. 15–33. External Links: Cited by: §1, §6, §6.
- DynaEval: unifying turn and dialogue level evaluation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 5676–5689. External Links: Cited by: 2nd item, §4.
- DIALOGPT : large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, pp. 270–278. External Links: Cited by: §2.4.
Appendix A Example Evaluations
|Did you ever make a wish?||-||-|
|Oh, lots of times.||-||-|
|Did your wishes ever come true?||5.00||4.97|
|What’s your real name?||1.00||3.81|
|From high school Mary? Yeah, I saw her about six months ago at a convention in Las Vegas.||1.00||1.13|
|I made a wish today, and it came true just like Edward said it would.||5||4.9|
|When I am sure I am among friends.||2.33||3.01|
|John, we’re going huntin’.||-||-|
|We’re all going.||-||-|
|I will keep you safe. We are both older.||2.00||1.09|
|Nick , Vince , Albert and John.||4.00||4.95|
|A ride? Hell, that’s a good idea. Okay, let’s go. Hey, let’s go.||2.33||4.68|
|I guess so||3.00||2.59|
Appendix B NSP Masking Experiment Results
The results of the NSP masking experiment are outlined in Table 7. Note that masking of the NSP feature had no impact on the pretrained model, and actually improved accuracy by on the Webtext corpus.
Appendix C Exact objectives of prior metrics
In this section, we briefly outline the stated purpose of each of our relevance metrics evaluated:
COS-FT: “In this work, given a dialogue history, we regard as a coherent response an utterance that is thematically correlated and naturally continuing from the previous turns, as well as lexically diverse.” Xu et al. (2018)
NUP-BERT: “Maintains Context: Does the response serve as a valid continuation of the preceding conversation?” Mehri and Eskenazi (2020b)
NORM-PROB: “context coherence of a dialogue: the meaningfulness of a response within the context of prior query” Pang et al. (2020)
FED-REL: “Is the response relevant to the conversation?” Mehri and Eskenazi (2020a)
FED-COR: “Is the response correct or was there a misunderstanding of the conversation? […] No one has specifically used Correct, however its meaning is often encapsulated in Relevant.” Mehri and Eskenazi (2020a)
We also outline the stated purpose of the dialogue coherence metrics evaluated:
Appendix D Details for Prior work
Pang et al. (2020) relied on a pretrained generative language model (specifically GPT-2). Their proposed metric is the conditional log-probability of the response given the context, normalized to the range . Specifically, for a context with candidate response , their proposed relevance score is defined as: , where is the number of tokens in the response, is the conditional probability of the response given the context under the language model, and is the percentile of the distribution of over the examples being evaluated.
Appendix E Learned HUMOD-IDK Weights
Figure 1 depicts the distribution of weight-magnitudes learned by IDK on the HUMOD training set. Notably, there is a very small subset of weights which is an order of magnitude larger than the others. Figure 2 demonstrates that the use of random sampling in place of “i don’t know” when training on the HUMOD dataset causes a larger number of large weights.
Appendix F Scatter Plots
Appendix G Performance on validation data split
|NUP-BERT (H)||*0.37 (0.01)||*0.38 (0.00)||*0.38 (0.02)||*0.39 (0.01)|
|NUP-BERT (TC-S)||*0.32 (0.01)||*0.36 (0.02)||*0.38 (0.04)||*0.41 (0.04)|
|NUP-BERT (TC)||*0.33 (0.02)||*0.37 (0.02)||*0.45 (0.07)||*0.44 (0.02)|
|Name||HUMOD Spear||HUMOD Pear||TC Spear||TC Pear|
|H_Rand3750_bce||*0.58 (0.00)||*0.57 (0.01)||*0.46 (0.00)||*0.43 (0.02)|
|H_Rand3750||*0.58 (0.00)||*0.58 (0.00)||*0.46 (0.00)||*0.45 (0.02)|
|H_IDK_L1||*0.56 (0.01)||*0.53 (0.02)||*0.45 (0.03)||*0.44 (0.02)|
|H_IDK_L2||*0.55 (0.00)||*0.55 (0.01)||*0.44 (0.00)||*0.44 (0.00)|
|H_Rand3750_L1||*0.42 (0.22)||*0.40 (0.20)||*0.44 (0.00)||*0.45 (0.01)|
|H_Rand3750_L2||*0.56 (0.00)||*0.55 (0.01)||*0.45 (0.00)||*0.44 (0.02)|
|H_Rand3750_bce_L1||*0.58 (0.00)||*0.58 (0.00)||*0.45 (0.00)||*0.46 (0.00)|
|H_Rand3750_bce_L2||*0.57 (0.00)||*0.56 (0.00)||*0.45 (0.00)||*0.42 (0.00)|
|H_IDK_bce_L1||*0.57 (0.00)||*0.56 (0.00)||*0.42 (0.01)||*0.41 (0.00)|
|H_IDK_bce_L2||*0.50 (0.01)||*0.51 (0.01)||*0.39 (0.00)||*0.42 (0.00)|
|H_IDK_bce||*0.39 (0.05)||*0.40 (0.05)||*0.36 (0.02)||*0.34 (0.00)|
|H_IDK||*0.15 (0.05)||*0.19 (0.06)||0.09 (0.05)||☥0.21 (0.05)|
|TC-S_IDK_L1||*0.29 (0.43)||*0.23 (0.53)||*0.39 (0.07)||*0.41 (0.07)|
|TC-S_IDK_L2||*0.54 (0.01)||*0.55 (0.01)||*0.43 (0.01)||*0.44 (0.00)|
|TC-S_IDK_bce_L1||*0.57 (0.00)||*0.56 (0.00)||*0.43 (0.00)||*0.40 (0.00)|
|TC-S_IDK_bce_L2||*0.47 (0.02)||*0.48 (0.01)||*0.41 (0.00)||*0.39 (0.01)|
|TC-S_IDK_bce||*0.35 (0.04)||*0.33 (0.05)||*0.40 (0.01)||*0.31 (0.01)|
|TC-S_IDK||*0.25 (0.10)||*0.24 (0.10)||*0.34 (0.05)||*0.36 (0.03)|
|TC-S_Rand3750_L1||*-0.19 (0.67)||*-0.20 (0.63)||*-0.13 (0.52)||*-0.14 (0.50)|
|TC-S_Rand3750_L2||☥-0.33 (0.27)||☥-0.32 (0.26)||*-0.45 (0.02)||*-0.43 (0.02)|
|TC-S_Rand3750_bce_L1||*0.56 (0.01)||*0.52 (0.03)||*0.44 (0.03)||*0.40 (0.02)|
|TC-S_Rand3750_bce_L2||*0.04 (0.55)||*0.09 (0.56)||☥-0.26 (0.27)||☥-0.23 (0.31)|
|TC-S_Rand3750_bce||*0.31 (0.05)||*0.36 (0.03)||☥0.16 (0.29)||☥0.18 (0.26)|
|TC-S_Rand3750||☥0.15 (0.17)||*0.11 (0.02)||☥-0.14 (0.24)||☥-0.06 (0.27)|
Appendix H Additional Experiments: Triplet Loss
An intuitive limitation of using “i don’t know” as a negative example with BCE loss is that this encourages the model to always map “i don’t know” to exactly zero. However, the relevance of “i don’t know” evidently varies by context. Clearly, it is a far less relevant response to “I was interrupted all week and couldn’t get anything done, it was terrible!” than it is to “what is the key to artificial general intelligence?” Motivated by this intuition, we experimented with a modified triplet loss, where .
Intuitively, a triplet loss would allow for the relevance of “i don’t know” to shift, without impacting the loss as long as the ground-truth responses continue to score sufficiently higher. Note that the loss is modified to combat gradient saturation due to the sigmoid non-linearity. However, the results (see Table 9) suggest equivalence, at best. Often, this loss performs equivalently to BCE but it can also produce degenerate solutions (note the high variance when training on TC data). Furthermore, it does not appear to produce superior correlations.
For this reason, we believe that, although adapting triplet loss for next-utterance prediction in place of BCE could be made to work, it does not appear to provide any advantages. If validation data is available, it can be used to confirm whether the model has reached a degenerate solution, and thus this loss could be used interchangeably with BCE. However, there does not appear to be any advantage in doing so.
Appendix I Additional Experiments: BERT NSP
As a followup experiment we compared IDK against directly using the pretrained BERT NSP predictor. In general, Spearman’s correlation was comparable on all datasets except for FED, and Pearson’s correlation was degraded. Performance on FED was inferior to IDK. We speculate that the reason for this is that the FED datasets has longer contexts, which is problematic for the NSP predictor as it was trained with sentences rather than utterances. Results are summarized in Table 11.
Appendix J Additional Experiments: IDK with other fixed negative samples
As a followup experiment we trained IDK using two different fixed negative samples: "i couldn’t say" (simply chosen as a synonym for "i don’t know"), and "i’m ok." (chosen as an example of a generic response from Li et al. (2016)). Results are reported in Table 11; in general we still see an performance improvement over NUP-BERT, and in some cases we exceed the performance of baseline IDK. We also see that performance remains consistent between runs, maintaining a lower standard deviation than NUP-BERT.
However, it is also clear that changing the fixed negative sample has some unexpected consquences: specifically, we see variation based on training data that is not observed when using "i don’t know" as the fixed negative sample (although the variation due to training data appears to be less than NUP-BERT).
We retain the reduced sensitivity to test set. Specically, our ratios of best-to-worst Spearman’s correlation are 3.44 for IDK-ICS (H), 4.14 for IDK-ICS (TC-S), 5.27 for IDK-OK (H), and 3.93; most are very close to the baseline IDK ratio of 3.9, and all are an improvement on the best prior work; 6.2 on NUP-BERT (H) – it is worth noting that NUP-BERT (TC-S) attains a ratio of 11.6, considerably worse than when trained on HUMOD data.