|Context||A: What do you want to do tonight?|
|B: Why don’t we go see a movie?|
|Ground truth response||A: Yeah Let’s go to the theater|
|A1||That sounds good! Have you seen Thor?||0.00 (3)||0.00 (3)||0.95 (2)||0.59 (2)||0.64 (1)||5.00 (1)|
|A2||Good, What movie?||0.00 (3)||0.00 (3)||0.92 (4)||0.55 (4)||0.62 (2)||5.00 (1)|
|A3||Or hang out in city||0.00 (3)||0.00 (3)||0.89 (6)||0.48 (5)||0.49 (3)||3.80 (3)|
|N1||The weather is no good for walking||0.32 (1)||0.15 (2)||0.94 (3)||0.47 (6)||0.44 (4)||2.60 (4)|
|N2||The sight is extra beautiful here||0.32 (1)||0.17 (1)||0.97 (1)||0.64 (1)||0.38 (5)||1.00 (5)|
|N3||Enjoy your concert||0.00 (3)||0.00 (3)||0.91 (5)||0.57 (3)||0.33 (6)||1.00 (5)|
Evaluating the system generated responses for open-domain dialogue is a difficult task. There are many possible appropriate responses given a dialogue context, and automatic metrics such as BLEU P02-1040 or ROUGE W04-1013 rate the responses that deviate from the ground truth as inappropriate. Still, it is important to develop and use an automatic metric because human annotation is very costly. In addition to BLEU and ROUGE, there is a widely-used evaluation metric based on the distributed word representation D16-1230, but this metric shows low correlations with human judgments.
One reason for the difficulty in developing an automatic metric that correlates well with human judgements is that the range of appropriate responses for a given context is very wide. Table 1 shows an example of a conversation between Speaker A and B. While there is a ground truth response “Yeah let’s go to the theater,” A could have also said “That sounds good! Have you seen Thor?” or “Good. What movie?” Note that based on word overlap with the ground truth, these two responses would receive low scores. Responses labeled N#, such as “The weather is no good for walking” are not appropriate. As the Table shows, the existing metrics from BLEU to RUBER are not able to tell apart these appropriate A# responses from the inapproriate N# responses.
Some recent metrics such as ADEM lowe-etal-2017-towards and RUBER tao2018ruber compute the similarity between a context and a generated response. However, ADEM requires human-annotated scores to train and thus cannot be applied to new datasets and domains. RUBER overcomes this limitation by using the idea that a random response should be used as a “negative sample”, but it is not able to distinguish the responses in the example in Table 1, because it uses only one random sample which does not provide sufficient information about appropriate and inappropriate responses.
In this paper, we propose Speaker Sensitive Responses Evaluation Model (SSREM) that analyzes the appropriateness of the responses. We use speaker sensitive responses that are generated by one speaker to train the model. We test SSREM in comparison with other evaluation metrics. First, we make annotated human scores for responses in Twitter conversation data. The evaluation scores of SSREM shows a higher correlation with human scores than other evaluation metrics. And SSREM outperforms other metrics in terms of identifying the ground truth responses given a context. We show the additional advantage of SSREM: it can be applied to evaluate a new corpus in a different domain. We train SSREM on Twitter corpus and test it on a corpus of movie reviews, and we show that SSREM outperforms other metrics in terms of the correlation with human scores and the task of identifying the ground truth response.
Our contributions in this paper include the following.
2 Related Work
In this section, we describe existing automatic evaluation metrics for dialogue response generation and discuss their limitations.
For task-oriented dialogue models such as airline travel information system tur2010left, completing the given task is most important, and the evaluation metrics reflect that hastie2012metrics; bordes2016learning. But open-domain conversation models do not have specific assigned tasks; the main goal of an open-domain conversation model is generating appropriate responses given a conversation about any topic.
Existing automatic evaluation metrics compare a generated response and the ground truth response. The most widely-used metric are BLEU P02-1040 and ROUGE W04-1013
based on the overlap of words between the two responses. A limitation of these word overlap-based metrics is that they cannot identify the synonyms, and to overcome this limitation, the embedding-based metrics use distributed word vector representationsD16-1230. However, these metrics have poor correlation with human judgments D16-1230; novikova-etal-2017-need; gupta-etal-2019-investigating because they still only look at the similarity between the generated response and the ground truth. SSREM is a model with the awareness that a response can be different from the ground truth response but still appropriate for the conversation context.
The responses for a casual conversation can be varied. For example, there are four appropriate responses including ground truth response for a given context in Table 1. Some previous approaches suggest considering the context together with the response such as ADEM lowe-etal-2017-towards and RUBER tao2018ruber. ADEM uses pre-trained VHRED serban2017hierarchical to encode the texts and compute the score by mixing similarities among the context, generated response and a ground truth. One limitation of ADEM is that it requires human annotated scores to learn the model. Human labeling is cost-intensive, so it is impractical to apply to a new dataset or domain. RUBER uses negative sampling to overcome this issue, but it uses only one random negative sample against one positive sample which is not ideal pmlr-v9-gutmann10a. SSREM does not require human scores to learn the model and uses many speaker sensitive negative samples.
3 Speaker Sensitive Response Evaluation Model
This section describes our Speaker Sensitive Response Evaluation Model (SSREM) that trains with speaker sensitive utterance samples. SSREM looks at a given context and its ground truth response together to evaluate a generated response. We describe the motivation of SSREM with empirical observations in section 3.1. We present the structure of SSREM in section 3.2. With the motivation, we present a training method of SSREM with speaker sensitive utterance samples in section 3.3.
sets with a 95% confidence interval
We are motivated by the assumption that there is varying degree of similarity among utterances in a corpus of conversations containing many speakers and conversations.
If we pick a set of random utterances from the corpus, they will not be very similar.
If we pick a set of utterances from a single speaker conversing with multiple partners, those utterances will be more similar than the random utterances in 1.
If we pick a set of utterances from conversations between a single dyad, even if the conversations are far apart in time, those utterances would be more similar than those in 2.
If we pick a set of utterances in a single conversation session, they are the most similar, even more so than those in 3.
To test these assumptions, we first categorize one speaker A’s utterances into four types of sets corresponding to the assumptions above.
Random (): Random utterances from speakers who are not A
Same Speaker (): Speaker A’s utterances
Same Partner (): A’s utterances in conversations with the same partner B
Same Conversation (): A’s utterances in a single conversation
Figure 1 shows one example of the sets. We make three sets because A participates in three conversations. We make two sets because A has conversations with B and C. is all utterances from A so we create one set of utterances for A. Finally, is random utterances from non-A’s utterances. We create five sets for each speaker.
From these sets, we compute the similarity among utterances in a set. First, we convert an utterance into a vector by averaging the words in the utterance with GloVe Twitter 200d pennington2014glove. And we compute the similarity of the vectors by Frobenius norm. Finally, we calculate the mean similarity of each set with a 95% confidence interval. Table 2 shows the results. has the lowest similarity mean value, so it supports the first assumption. has higher similarity mean value than . It supports the second assumption. The mean similarity value of is higher than . It supports the third assumption. Finally, has the highest mean similarity value. It also supports the last assumption. From the observations, we assume that utterances are clustered by the speakers and addressees.
SSREM evaluates a generated response from a context and a ground truth response . The output of SSREM is as follows:
where is a parametrized function to measure the similarity between the context and the generated response . is a function to convert a sequence of words to a vector. is a matrix that weights of the similarity between two vectors. It is the parameter of the function. is another function to measure the ground-truth response and the generated one. is a function to mix the values of and functions. To normalize each output of the and functions, we adopt linear scaling to unit range aksoy2001feature which rescale the value as follows:
where is an maximum and is minimum of .
SSREM is similar to RUBER, which computes the similarities among , and separately and merge it at the end. However, SSREM uses speaker sensitive samples, whereas RUBER takes one positive sample and one negative sample.
3.3 Training with Speaker Sensitive Samples
SSREM has a parametrized function that takes context and a generated response . To train the function, we define a classification problem to identify the ground truth response from a set of candidate responses . The
has the ground truth response and some negative samples. A classifier tries to identify the ground truth response with the negative samples. Negative samples are usually selected from the uniform distribution. But we sample the speaker sensitive utterances which described in section3.1 for SSREM.
Formally speaking, let be the speaker of the ground truth response . It means it is ’s turn to say the response for the context . The candidate response set is given by
where , , and
are the negative samples from speaker sensitive responses. Then, the probability of a ground truth responsegiven context and is as follows:
We maximize this probability among all context-ground truth response pair. So the loss function of the classification problem is
This approach is similar to learning the sentence representations logeswaran2018an
, but we use the speaker sensitive negative samples. It is also similar to Noise Contrastive Estimation (NCE)pmlr-v9-gutmann10a; Mnih:2012:FSA:3042573.3042630. But we set the noise distribution to speaker sensitive distribution and only take the data sample term in the objective function of the NCE.
Selecting negative samples is important for learning. When we choose the noise distribution, it would be close to the data distribution, because otherwise, the classification problem might be too easy to learn the data pmlr-v9-gutmann10a. Mnih:2012:FSA:3042573.3042630 shows that using samples from the unigram distribution outperforms using samples from a naive uniform distribution for learning a neural probabilistic language model. Likewise, we create negative samples from the speaker sensitive utterances. is more similar to the than any other negative samples. We show the patterns by empirical observations in section 3.1 and experimental results in section 6.2. These speaker sensitive samples make the classification problem harder and lead to learning the function better than using the naive uniform distributed random samples.
To train SSREM, we need a conversation corpus that has many conversations from one speaker. We choose the Twitter conversation corpus bak-oh-2019-variational as it has 770K conversations with 27K Twitter users. We split the data as 80/10/10 for training/validation/test.
4 Annotating Human Scores
To measure the correlation SSREM with human judgments, we first gather human judgments of responses given a conversation context. We use Amazon Mechanical Turk (MTurk) to annotate the scores of the responses. We select 300 conversations from a dataset of Twitter conversations. And we generate responses for annotation using three conversation models and the ground truth response for each conversation.
Retrieval model pandey-etal-2018-exemplar: A BM25 retrieval model robertson2009probabilistic that uses TF-IDF vector space.
: A variational autoencoder model that has a global variable for a conversation.
VHUCM bak-oh-2019-variational: A variational autoencoder model that considers the speakers of a conversation.
Then we ask two questions to the MTurkers. (1) How appropriate is the response overall? (2) How on-topic is the response? These questions are used in lowe-etal-2017-towards. The authors show that these questions have high inter-annotator agreement among workers. They suggest using the first question to annotate the human score, and so we follow the suggestion. But we ask the second question to workers to filter out workers who submit random answers. Each worker answers these questions on a five-point Likert scale.
We annotate 1,200 responses in total. One worker answers ten conversations, four responses per conversation for a total of 40 responses. Each response is tagged by five workers for a total of 287 workers of which we retain the responses from 150 workers who passed all the tests. We tag the most selected score as the human score for each response. The inter-annotator Fleiss’ kappa fleiss1971measuring is which is consistent with the results in lowe-etal-2017-towards. Table 3 shows the basic statistics of the annotations.
5 Experiment 1 - Comparing with Human Scores
This section describes the experiment that looks at the correlation between the model scores and the human scores for given contexts and responses.
5.1 Experiment Setup
We use a Twitter conversation corpus bak-oh-2019-variational to train and validate SSREM and other baseline models. For the test, we remove the ground truth responses in human-annotated corpus since it always produces the maximum score on BLEU and ROUGE.
We compare SSREM with the following response evaluation methods:
BLEU P02-1040: We compute the sentence-level BLEU score with the smoothing seven technique W14-3346.
: We compute the F score of ROUGE-L.
: We compute the average cosine similarity between ground truth response and test response in a word embedding222We experimented with the greedy and extreme embedding for comparison, but these methods were not better than the average embedding.. We use pre-trained Google news word embedding NIPS2013_5021 to avoid the dependency between the training data and embedding.
RUBER tao2018ruber: We train with a random negative sample to train unreferenced metric in RUBER. And we use arithmetic averaging to hybrid the referenced and unreferenced metrics.
RSREM: We use the same structure of SSREM, but train with uniformly random negative samples, not speaker sensitive samples.
We choose functions in SSREM for the experiment. For function, We use the word averaging technique that averages the vectors of words in the sequence. We can use advanced methods such as RNN or sentence embeddings reimers-gurevych-2019-sentence. But for the fair comparisons with RUBER, we select a similar approach. We use GloVe Twitter 200d word embedding pennington2014glove. For function, we use sentence mover‘s similarity that is the state of the art evaluating reference-candidate pair of sentences by using word and sentence embeddings clark-etal-2019-sentence. To avoid dependency between the training data and embedding, we use Elmo embedding Peters:2018. For function, we use arithmetic averaging that shows good results in tao2018ruber.
5.2 Results and Discussion
. The red line is a linear regression line, and the coeff is the coefficient of the line. SSREM shows a higher positive correlation with human judgment than the other models.
Table 4 shows the Spearman and Pearson correlations between human scores and models scores. First, BLEU, ROUGE, and EMB are not correlated with human scores. It means evaluating responses with ground truth only is not useful. These results are the same in previous research D16-1230; lowe-etal-2017-towards; tao2018ruber. RUBER shows a higher correlation with human scores than other baselines but has a high -value that means low statistically significant. RSREM performs better than RUBER and other baselines. It shows using multiple negative samples improves the performance of learning the model. Finally, SSREM outperforms all other methods for two correlations with low -values. It shows the effectiveness of using speaker sensitive negative samples.
Figure 2 shows scatterplots of the human and model scores. A dot is one response, and a red line is a linear regression line. The x-axis is the human score, and the y-axis is each automatic evaluation metric. To visualize the dots better, we adopt the technique from lowe-etal-2017-towards that adds random number () to x-axis value. But, we train the linear regression with original scores. First, BLEU and ROUGE have many zero values since there are few overlapped words between the generated response and the ground-truth response. The dots in EMB that uses word embedding to overcome the limitation are more distributed. But there are few relationships with human scores, and the linear regression coefficient is flattened. RUBER is better than BLEU, ROUGE, and EMB. RSREM that uses more negative samples shows better than RUBER. Finally, SSREM shows a higher positive correlation with human scores than other baselines.
6 Experiment 2 - Identifying True and False Responses
The second experiment presents the performance of function in SSREM by comparing it with baselines. RUBER, RSREM, and SSREM compute the score from the context of the conversation and generated responses. To investigate the performance of the score, we set up the task that identifies the true and false responses for a given context. The true responses are ground-truth responses, and false ones are four negative samples that are described in section 3.3.
6.1 Experiment Setup
The data for this experiment is the test data of the Twitter conversation corpus. We extract contexts, true and false responses from the data. The true response is the ground-truth response (). And the false responses are four types that are described in section 3.3 (, , , ).
We compare SSREM with RUBER and RSREM that compute the similarity between a context and a response. We take the unreferenced metric score in RUBER. And we take the output of the function in RSREM and SSREM. We use the same trained models in section 5.
6.2 Results and Discussion
Figure 3 shows the results. The x-axis is the models, and the y-axis is the output of the unreferenced metric or function. All models perform well on distinguishing between utterances and utterances. But RUBER performs poor on identifying , , and . And RSREM cannot identify false responses from . Finally, SSREM outperforms the other two models for identifying all cases. It also maximizes the difference between and than the other two models. It is another clue for showing the effectiveness of using speaker sensitive negative samples.
One interesting result is that the output scores decrease from to . It is the same observation about the differences of speaker sensitive utterances in section 3.1. And it also means that identifying and is a harder problem than and pair. It is another evidence for why we use speaker sensitive negative samples, as we discussed in section 3.3.
consists of negative samples that are most difficult for the model to distinguish, so it makes sense to consider only negative samples. But we include and for the following two reasons. First, there are only a limited number of utterances because they must all come from the same conversation, whereas we need a pretty large number of negative samples to effectively train the model Mnih:2012:FSA:3042573.3042630. Second, we also sample from and because they represent different degree of similarity to the context utterances. utterances are from the same conversation, leading to decreased model generalization.
7 Experiment 3 - Applying New Corpus
In this section, we investigate the applicability of SSREM to a new conversation corpus. SSREM takes the speaker sensitive samples from Twitter. But there are many open-domain conversation corpora such as Movie scripts Danescu-Niculescu-Mizil+Lee:11a. tao2018ruber run a similar experiment with RUBER, but they use the similar domain of data, Chinese online forum (Training from Douban and testing on Baidu Tieba). We choose the Movie scripts corpus because it is written by the script writers whereas Twitter is personal causal online conversations. We present the performance of SSREM on the new corpus.
7.1 Experiment Setup
First, we annotate 1,200 responses to the movie dialog corpus. We use HRED sordoni2015hierarchical rather than VHUCM. The next procedure of annotation is the same when we create human scores for Twitter conversation responses in section 4. Two hundred forty-four workers tagged all responses. But, 94 workers failed the attention check question, so we collect the 150 workers’ answers. The inter-annotator Fleiss’ kappa fleiss1971measuring for Movie is . It is still consistent with the results in lowe-etal-2017-towards and annotated Twitter conversations. The bottom row in Table 3 shows the basic statistics of the annotated responses.
We run two experiments, comparing with human scores and identifying true and false responses. We use the same models in section 5. We use the Twitter conversation corpus to train RUBER, RSREM, and SSREM. And we test the models on annotated movie dialogs. Unlike the Twitter conversation corpus, the movie dialogs have a short length of conversations. So we choose and only to run the second experiment.
7.2 Results and Discussion
In the experiment on comparing with human scores on the movie dialogs corpus, Table 5 shows the results. First, BLEU, ROUGE, and EMB are not correlated with human scores. RUBER shows worse performance than testing on the Twitter corpus. RSREM performs better than RUBER and other baselines, but it also shows worse performance than testing on the Twitter corpus. Finally, SSREM outperforms all other methods for two correlations with low -values. It shows the effectiveness of using speaker sensitive negative samples for the new corpus. Figure 2 shows the similar results by plotting scatter plots.
In the experiment on identifying true and false responses with the movie dialogs corpus, Figure 5 shows the results of the identification task. RUBER performs poor on distinguishing between and statistically significantly. RSREM performs better than RUBER. And SSREM outperforms the other two models for identifying all cases in the new corpus.
8 Conclusion and Future Work
In this paper, we presented SSREM, an automatic evaluation model for conversational response generation. SSREM looks at the context of the conversation and the ground-truth response together. We proposed negative sampling with speaker sensitive samples to train SSREM. We showed that SSREM outperforms the other metrics including RSREM that uses random negative samples only. We also showed that SSREM is effective in evaluating a movie conversation corpus even when it is trained with Twitter conversations.
There are several future directions to improve SSREM. First, we can make SSREM more robust on adversarial attacks. sai2019reevaluating shows limitations of ADEM on adversarial attacks such as removing stopwords and replacing words with synonyms. We investigated another type of the adversarial attack named copy mechanism that copies one of the utterances in the context as the generated response. All existing automatic evaluation methods including RUBER that compare the context and the response can be cheated by the copy mechanism. SSREM is also susceptible. However, SSREM is fooled less than other existing models because SSREM learns with negative samples from the set of utterances in the same conversation. SSREM learns to differentiate among utterances in the same context. We show this empirically with an experiment to identify true and false responses (Sec 6.2). When we look at the mean score for the context utterances that shows this copy mechanism compared to the mean score of the ground-truth response (GT), the mean score of context utterances is 0.07 higher by RUBER, but only 0.01 higher by SSREM. SSREM does not give lower scores for the context utterances than GT, but it is not as bad as RUBER. We will make SSREM more robust on the attacks.
Second, we can improve SSREM for a higher correlation with human judgement. We chose to approach SSREM with a classification loss because it is simple and widely used to estimate the models using negative sampling. Although the classification loss is simple, SSREM outperforms all existing automatic evaluation models. However, as Table 2 and Figure 3 are shown, each negative samples has different correlation with the context. We will use ranking loss 6909576; 7298682 to learn the difference among samples. Recently, Zhang2020BERTScore uses BERT devlin-etal-2019-bert to evaluate generated candidate sentences by comparing reference sentence. We used word embeddings to represent an utterance to the vector for the simplicity, but contextual embeddings are much better since it generates more context-related representation than word embeddings. We will use the contextual embedding to represent utterances.
Third, we can extend using SSREM to various conversation corpora such as task-oriented dialogues. We trained and tested SSREM on open-domain conversation corpora. However, contextual coherence between the input context and the generated text is important in multi-turn conversations. We will apply SSREM to various conversation tasks for evaluating the generated text automatically. We will explore these directions in our future work.
We would like to thank Jeongmin Byun333https://jmbyun.github.io
for building the annotation webpage, and the anonymous reviewers for helpful questions and comments. This work was supported by Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2017-0-01779, A machine learning and statistical inference framework for explainable artificial intelligence).