Evaluation of ChatGPT as a Question Answering System for Answering Complex Questions

by   Yiming Tan, et al.

ChatGPT is a powerful large language model (LLM) that has made remarkable progress in natural language understanding. Nevertheless, the performance and limitations of the model still need to be extensively evaluated. As ChatGPT covers resources such as Wikipedia and supports natural language question answering, it has garnered attention as a potential replacement for traditional knowledge based question answering (KBQA) models. Complex question answering is a challenge task of KBQA, which comprehensively tests the ability of models in semantic parsing and reasoning. To assess the performance of ChatGPT as a question answering system (QAS) using its own knowledge, we present a framework that evaluates its ability to answer complex questions. Our approach involves categorizing the potential features of complex questions and describing each test question with multiple labels to identify combinatorial reasoning. Following the black-box testing specifications of CheckList proposed by Ribeiro et.al, we develop an evaluation method to measure the functionality and reliability of ChatGPT in reasoning for answering complex questions. We use the proposed framework to evaluate the performance of ChatGPT in question answering on 8 real-world KB-based CQA datasets, including 6 English and 2 multilingual datasets, with a total of approximately 190,000 test cases. We compare the evaluation results of ChatGPT, GPT-3.5, GPT-3, and FLAN-T5 to identify common long-term problems in LLMs. The dataset and code are available at https://github.com/tan92hl/Complex-Question-Answering-Evaluation-of-ChatGPT.


page 1

page 2

page 3

page 4


Evaluating Semantic Parsing against a Simple Web-based Question Answering Model

Semantic parsing shines at analyzing complex natural language that invol...

Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering

We introduce Mintaka, a complex, natural, and multilingual dataset desig...

ChatLog: Recording and Analyzing ChatGPT Across Time

While there are abundant researches about evaluating ChatGPT on natural ...

Chatbots in a Botnet World

Question-and-answer formats provide a novel experimental platform for in...

QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual Reasoning

Synthetic datasets have successfully been used to probe visual question-...

Investigating the use of Paraphrase Generation for Question Reformulation in the FRANK QA system

We present a study into the ability of paraphrase generation methods to ...

A Question-Answering framework for plots using Deep learning

Deep Learning has managed to push boundaries in a wide variety of tasks....

Please sign up or login with your details

Forgot password? Click here to reset