Holistic Evaluation of Language Models

11/16/2022
∙
by   Percy Liang, et al.
∙
21
∙

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5 accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9 models not sharing a single scenario in common. We improve this to 96.0 all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

READ FULL TEXT
research
∙ 05/06/2023

NorBench – A Benchmark for Norwegian Language Models

We present NorBench: a streamlined suite of NLP tasks and probes for eva...
research
∙ 05/26/2023

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

As large language models (LLMs) are continuously being developed, their ...
research
∙ 03/25/2022

On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations

Multiple metrics have been introduced to measure fairness in various nat...
research
∙ 03/10/2022

Internet-augmented language models through few-shot prompting for open-domain question answering

In this work, we aim to capitalize on the unique few-shot capabilities o...
research
∙ 07/03/2023

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Recently, there has been an increase in interest in evaluating large lan...
research
∙ 08/15/2023

The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models

When deploying machine learning models in production for any product/app...
research
∙ 05/19/2023

SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models

Stereotype benchmark datasets are crucial to detect and mitigate social ...

Please sign up or login with your details

Forgot password? Click here to reset