Efficient Benchmarking (of Language Models)

08/22/2023
by   Yotam Perlitz, et al.
0

The increasing versatility of language models LMs has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs reaching thousands of GPU hours per model. However the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work we present the problem of Efficient Benchmarking namely intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case we investigate how different benchmark design choices affect the computation-reliability tradeoff. We propose to evaluate the reliability of such decisions by using a new measure Decision Impact on Reliability DIoR for short. We find for example that the current leader on HELM may change by merely removing a low-ranked model from the benchmark and observe that a handful of examples suffice to obtain the correct benchmark ranking. Conversely a slightly different choice of HELM scenarios varies ranking widely. Based on our findings we outline a set of concrete recommendations for more efficient benchmark design and utilization practices leading to dramatic cost savings with minimal loss of benchmark reliability often reducing computation by x100 or more.

READ FULL TEXT

page 2

page 4

page 5

page 6

page 13

page 14

page 15

research
12/19/2022

Explainable Fuzzer Evaluation

While the aim of fuzzer evaluation is to establish fuzzer performance in...
research
08/25/2023

Leveraging Knowledge and Reinforcement Learning for Enhanced Reliability of Language Models

The Natural Language Processing(NLP) community has been using crowd sour...
research
11/23/2022

This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish

The availability of compute and data to train larger and larger language...
research
10/17/2022

Prompting GPT-3 To Be Reliable

Large language models (LLMs) show impressive abilities via few-shot prom...
research
06/16/2022

Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

Large language models produce human-like text that drive a growing numbe...
research
04/05/2021

What Will it Take to Fix Benchmarking in Natural Language Understanding?

Evaluation for many natural language understanding (NLU) tasks is broken...
research
08/08/2023

Benchmarking LLM powered Chatbots: Methods and Metrics

Autonomous conversational agents, i.e. chatbots, are becoming an increas...

Please sign up or login with your details

Forgot password? Click here to reset