Benchmarking LLM powered Chatbots: Methods and Metrics

08/08/2023
by   Debarag Banerjee, et al.
0

Autonomous conversational agents, i.e. chatbots, are becoming an increasingly common mechanism for enterprises to provide support to customers and partners. In order to rate chatbots, especially ones powered by Generative AI tools like Large Language Models (LLMs) we need to be able to accurately assess their performance. This is where chatbot benchmarking becomes important. In this paper, we propose the use of a novel benchmark that we call the E2E (End to End) benchmark, and show how the E2E benchmark can be used to evaluate accuracy and usefulness of the answers provided by chatbots, especially ones powered by LLMs. We evaluate an example chatbot at different levels of sophistication based on both our E2E benchmark, as well as other available metrics commonly used in the state of art, and observe that the proposed benchmark show better results compared to others. In addition, while some metrics proved to be unpredictable, the metric associated with the E2E benchmark, which uses cosine similarity performed well in evaluating chatbots. The performance of our best models shows that there are several benefits of using the cosine similarity score as a metric in the E2E benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/28/2022

Comparing in context: Improving cosine similarity measures with a metric tensor

Cosine similarity is a widely used measure of the relatedness of pre-tra...
research
07/28/2023

Understanding the Benefits and Challenges of Using Large Language Model-based Conversational Agents for Mental Well-being Support

Conversational agents powered by large language models (LLM) have increa...
research
09/13/2022

AI-powered Language Assessment Tools for Dementia

The main objective of this paper is to propose an approach for developin...
research
08/06/2020

A critical analysis of metrics used for measuring progress in artificial intelligence

Comparing model performances on benchmark datasets is an integral part o...
research
10/27/2022

Software-hardware Integration and Human-centered Benchmarking for Socially-compliant Robot Navigation

The social compatibility (SC) is one of the most important parameters fo...
research
08/22/2023

Efficient Benchmarking (of Language Models)

The increasing versatility of language models LMs has given rise to a ne...

Please sign up or login with your details

Forgot password? Click here to reset