Analysing Results from AI Benchmarks: Key Indicators and How to Obtain Them

Item response theory (IRT) can be applied to the analysis of the evaluation of results from AI benchmarks. The two-parameter IRT model provides two indicators (difficulty and discrimination) on the side of the item (or AI problem) while only one indicator (ability) on the side of the respondent (or AI agent). In this paper we analyse how to make this set of indicators dual, by adding a fourth indicator, generality, on the side of the respondent. Generality is meant to be dual to discrimination, and it is based on difficulty. Namely, generality is defined as a new metric that evaluates whether an agent is consistently good at easy problems and bad at difficult ones. With the addition of generality, we see that this set of four key indicators can give us more insight on the results of AI benchmarks. In particular, we explore two popular benchmarks in AI, the Arcade Learning Environment (Atari 2600 games) and the General Video Game AI competition. We provide some guidelines to estimate and interpret these indicators for other AI benchmarks and competitions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/22/2023

Broadening the perspective for sustainable AI: Comprehensive sustainability criteria and indicators for AI systems

The increased use of AI systems is associated with multi-faceted societa...
research
07/06/2020

Towards Game-Playing AI Benchmarks via Performance Reporting Standards

While games have been used extensively as milestones to evaluate game-pl...
research
01/20/2020

Measuring Diversity of Artificial Intelligence Conferences

The lack of diversity of the Artificial Intelligence (AI) field is nowad...
research
11/26/2021

AI and the Everything in the Whole Wide World Benchmark

There is a tendency across different subfields in AI to valorize a small...
research
02/08/2013

Complexity distribution of agent policies

We analyse the complexity of environments according to the policies that...
research
03/30/2023

β^4-IRT: A New β^3-IRT with Enhanced Discrimination Estimation

Item response theory aims to estimate respondent's latent skills from th...
research
05/18/2021

DACBench: A Benchmark Library for Dynamic Algorithm Configuration

Dynamic Algorithm Configuration (DAC) aims to dynamically control a targ...

Please sign up or login with your details

Forgot password? Click here to reset