AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

08/08/2023
by   Jiaju Lin, et al.
0

With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the ability of LLMs is an open question. Existing evaluation methods suffer from following shortcomings: (1) constrained evaluation abilities, (2) vulnerable benchmarks, (3) unobjective metrics. We suggest that task-based evaluation, where LLM agents complete tasks in a simulated environment, is a one-for-all solution to solve above problems. We present AgentSims, an easy-to-use infrastructure for researchers from all disciplines to test the specific capacities they are interested in. Researchers can build their evaluation tasks by adding agents and buildings on an interactive GUI or deploy and test new support mechanisms, i.e. memory, planning and tool-use systems, by a few lines of codes. Our demo is available at https://agentsims.com .

READ FULL TEXT

page 3

page 4

page 5

research
09/02/2023

ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models

Large language models (LLMs) have recently demonstrated remarkable capab...
research
08/07/2023

AgentBench: Evaluating LLMs as Agents

Large Language Models (LLMs) are becoming increasingly smart and autonom...
research
05/24/2023

HuatuoGPT, towards Taming Language Model to Be a Doctor

In this paper, we present HuatuoGPT, a large language model (LLM) for me...
research
07/29/2023

RoCar: A Relationship Network-based Evaluation Method to Large Language Models

Large language models (LLMs) have received increasing attention. However...
research
08/07/2023

TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents

With recent advancements in natural language processing, Large Language ...
research
05/24/2023

ExpertPrompting: Instructing Large Language Models to be Distinguished Experts

The answering quality of an aligned large language model (LLM) can be dr...
research
05/15/2023

A Language Model of Java Methods with Train/Test Deduplication

This tool demonstration presents a research toolkit for a language model...

Please sign up or login with your details

Forgot password? Click here to reset