ToolQA: A Dataset for LLM Question Answering with External Tools

06/23/2023
by   Yuchen Zhuang, et al.
0

Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs' pre-training data, enabling a more precise evaluation of LLMs' tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available to the broader scientific community on GitHub.

READ FULL TEXT

page 7

page 8

research
04/20/2023

Why Does ChatGPT Fall Short in Answering Questions Faithfully?

Recent advancements in Large Language Models, such as ChatGPT, have demo...
research
09/20/2023

LLM Guided Inductive Inference for Solving Compositional Problems

While large language models (LLMs) have demonstrated impressive performa...
research
05/23/2023

Pre-training Language Models for Comparative Reasoning

In this paper, we propose a novel framework to pre-train language models...
research
05/23/2023

CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation

Large Language Models (LLMs) have demonstrated significant progress in u...
research
08/01/2023

Structural Embeddings of Tools for Large Language Models

It is evident that the current state of Large Language Models (LLMs) nec...
research
12/17/2021

ActKnow: Active External Knowledge Infusion Learning for Question Answering in Low Data Regime

Deep learning models have set benchmark results in various Natural Langu...
research
05/29/2023

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

The development of large language models (LLMs) such as ChatGPT has brou...

Please sign up or login with your details

Forgot password? Click here to reset