clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

05/22/2023
by   Kranti Chalamalasetti, et al.
0

Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents"-agents that operate in rich linguistic and non-linguistic contexts-through testing them in carefully constructed interactive settings. Other recent work has argued that Large Language Models (LLMs), if suitably set up, can be understood as (simulators of) such agents. A connection suggests itself, which this paper explores: Can LLMs be evaluated meaningfully by exposing them to constrained game-like settings that are built to challenge specific capabilities? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions. Both this capability and the quality of the game play, measured by how well the objectives of the different games are met, follows the development cycle, with newer models performing better. The metrics even for the comparatively simple example games are far from being saturated, suggesting that the proposed instrument will remain to have diagnostic value. Our general framework for implementing and evaluating games with LLMs is available at https://github.com/clp-research/clembench.

READ FULL TEXT

page 17

page 18

page 22

page 24

page 26

page 27

page 28

page 31

research
10/06/2020

Keep CALM and Explore: Language Models for Action Generation in Text-based Games

Text-based games present a unique challenge for autonomous agents to ope...
research
03/25/2019

Winning Isn't Everything: Enhancing Game Development with Intelligent Agents

Recently, there have been several high-profile achievements of agents le...
research
08/19/2023

GameEval: Evaluating LLMs on Conversational Games

The rapid advancements in large language models (LLMs) have presented ch...
research
10/29/2020

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

In this paper, we introduce an advanced Russian general language underst...
research
05/26/2023

Playing repeated games with Large Language Models

Large Language Models (LLMs) are transforming society and permeating int...
research
05/27/2022

NLU for Game-based Learning in Real: Initial Evaluations

Intelligent systems designed for play-based interactions should be conte...
research
01/08/2022

Multi-Vehicle Control in Roundabouts using Decentralized Game-Theoretic Planning

Safe navigation in dense, urban driving environments remains an open pro...

Please sign up or login with your details

Forgot password? Click here to reset