MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

09/19/2023
by   Xingyao Wang, et al.
0

To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation paradigms often focus solely on benchmark performance with single-turn exchanges, neglecting the intricate interactions among the user, LLMs, and external tools, creating a discrepancy between benchmark evaluation and real-world use cases. We introduce MINT benchmark to evaluate LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback. To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by executing Python code and receive natural language feedback from the user simulated with GPT-4. We repurpose a diverse set of established datasets and tasks focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset of instances for efficient evaluation. Our analysis of 20 open- and closed-source LLMs offers intriguing findings. (1) LLMs generally benefit from tool interactions and language feedback, with performance gains (absolute, same below) of 1–8 language feedback. (2) Better single-turn performance does not guarantee better multi-turn performance. (3) Surprisingly, on LLMs we evaluated, we found supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities. We hope MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation has been less accessible compared to commercial LLMs with a larger user base.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2023

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Despite the advancements of open-source large language models (LLMs) and...
research
08/07/2023

Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue

Recent advances in Large Language Models (LLMs) have achieved remarkable...
research
12/19/2022

Continual Learning for Instruction Following from Realtime Feedback

We study the problem of continually training an instruction-following ag...
research
10/14/2021

Practical Benefits of Feature Feedback Under Distribution Shift

In attempts to develop sample-efficient algorithms, researcher have expl...
research
09/06/2019

ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons

While dialogue remains an important end-goal of natural language researc...
research
03/22/2023

Can we trust the evaluation on ChatGPT?

ChatGPT, the first large language model (LLM) with mass adoption, has de...
research
09/05/2023

An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

Large language models (LLMs) have achieved significant success in intera...

Please sign up or login with your details

Forgot password? Click here to reset