Evaluating Open Question Answering Evaluation

05/21/2023
by   Cunxiang Wang, et al.
0

This study focuses on the evaluation of Open Question Answering (Open-QA) tasks, which have become vital in the realm of artificial intelligence. Current automatic evaluation methods have shown limitations, indicating that human evaluation still remains the most reliable approach. We introduce a new task, QA Evaluation (QA-Eval), designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA. Our evaluation of these methods utilizes human-annotated results, and we employ accuracy and F1 score to measure their performance. Specifically, the work investigates methods that show high correlation with human evaluations, deeming them more reliable. We also discuss the pitfalls of current methods, such as their inability to accurately judge responses that contain excessive information. The dataset generated from this work is expected to facilitate the development of more effective automatic evaluation tools. We believe this new QA-Eval task and corresponding dataset will prove valuable for future research in this area.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/06/2021

SF-QA: Simple and Fair Evaluation Library for Open-domain Question Answering

Although open-domain question answering (QA) draws great attention in re...
research
07/21/2016

Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering

While question answering (QA) with neural network, i.e. neural QA, has a...
research
09/21/2023

SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References

Evaluation of QA systems is very challenging and expensive, with the mos...
research
06/05/2023

Evaluation of AI Chatbots for Patient-Specific EHR Questions

This paper investigates the use of artificial intelligence chatbots for ...
research
05/31/2023

Building Extractive Question Answering System to Support Human-AI Health Coaching Model for Sleep Domain

Non-communicable diseases (NCDs) are a leading cause of global deaths, n...
research
05/18/2023

Writing your own book: A method for going from closed to open book QA to improve robustness and performance of smaller LLMs

We introduce two novel methods, Tree-Search and Self-contextualizing QA,...
research
01/29/2018

Game of Sketches: Deep Recurrent Models of Pictionary-style Word Guessing

The ability of intelligent agents to play games in human-like fashion is...

Please sign up or login with your details

Forgot password? Click here to reset