AgentBench: Evaluating LLMs as Agents

08/07/2023
by   Xiao Liu, et al.
0

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench

READ FULL TEXT

page 2

page 4

page 24

page 26

page 27

page 32

page 36

page 37

research
08/28/2023

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

The unprecedented performance of large language models (LLMs) requires c...
research
08/11/2023

BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents

The massive successes of large language models (LLMs) encourage the emer...
research
08/08/2023

AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

With ChatGPT-like large language models (LLM) prevailing in the communit...
research
11/16/2020

NLPGym – A toolkit for evaluating RL agents on Natural Language Processing Tasks

Reinforcement learning (RL) has recently shown impressive performance in...
research
05/03/2022

ElitePLM: An Empirical Study on General Language Ability Evaluation of Pretrained Language Models

Nowadays, pretrained language models (PLMs) have dominated the majority ...
research
05/25/2023

Ghost in the Minecraft: Generally Capable Agents for Open-World Enviroments via Large Language Models with Text-based Knowledge and Memory

The captivating realm of Minecraft has attracted substantial research in...
research
06/15/2023

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

The unprecedented performance of large language models (LLMs) necessitat...

Please sign up or login with your details

Forgot password? Click here to reset