Approximating Human Evaluation of Social Chatbots with Prompting

04/11/2023
by   Ekaterina Svikhnushina, et al.
0

Once powerful conversational models have become available for a wide audience, users started actively engaging in social interactions with this technology. Such unprecedented interaction experiences may pose considerable social and psychological risks to the users unless the technology is properly controlled. This creates an urgent need for scalable and robust evaluation metrics for conversational chatbots. Existing automatic evaluation metrics usually focus on objective quality measures and disregard subjective perceptions of social dimensions. Moreover, most of these approaches operate on pre-produced dialogs from available benchmark corpora, which implies human involvement for preparing the material for evaluation and, thus, impeded scalability of the metrics. To address this limitation, we propose to make use of the emerging large language models (LLMs) from the GPT-family and describe a new framework allowing to conduct dialog system evaluation with prompting. With this framework, we are able to achieve full automation of the evaluation pipeline and reach impressive correlation with the human judgement (up to Pearson r=0.95 on system level). The underlying concept is to collect synthetic chat logs of evaluated bots with a LLM in the other-play setting, where LLM is carefully conditioned to follow a specific scenario. We further explore different prompting approaches to produce evaluation scores with the same LLM. The best-performing prompts, containing few-show demonstrations and instructions, show outstanding performance on the tested dataset and demonstrate the ability to generalize to other dialog corpora.

READ FULL TEXT
research
10/05/2021

Investigating the Impact of Pre-trained Language Models on Dialog Evaluation

Recently, there is a surge of interest in applying pre-trained language ...
research
06/07/2021

A Comprehensive Assessment of Dialog Evaluation Metrics

Automatic evaluation metrics are a crucial component of dialog systems r...
research
10/17/2022

Social Biases in Automatic Evaluation Metrics for NLG

Many studies have revealed that word embeddings, language models, and mo...
research
05/21/2020

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Open Domain dialog system evaluation is one of the most important challe...
research
06/06/2023

Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs

Measurement of interaction quality is a critical task for the improvemen...
research
01/24/2023

Opportunities and Challenges in Neural Dialog Tutoring

Designing dialog tutors has been challenging as it involves modeling the...
research
05/24/2023

Human-Centered Metrics for Dialog System Evaluation

We present metrics for evaluating dialog systems through a psychological...

Please sign up or login with your details

Forgot password? Click here to reset