KoLA: Carefully Benchmarking World Knowledge of Large Language Models

06/15/2023
by   Jifan Yu, et al.
0

The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For ability modeling, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering 19 tasks. (2) For data, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For evaluation criteria, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge hallucination. We evaluate 21 open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.

READ FULL TEXT

page 2

page 8

page 22

page 23

page 25

page 26

research
08/28/2023

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

The unprecedented performance of large language models (LLMs) requires c...
research
05/16/2023

Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

Memory is one of the most essential cognitive functions serving as a rep...
research
02/28/2022

KMIR: A Benchmark for Evaluating Knowledge Memorization, Identification and Reasoning Abilities of Language Models

Previous works show the great potential of pre-trained language models (...
research
04/29/2022

TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models

Language Models (LMs) become outdated as the world changes; they often f...
research
08/21/2023

RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models

Retrieval-augmented large language models (R-LLMs) combine pre-trained l...
research
08/07/2023

AgentBench: Evaluating LLMs as Agents

Large Language Models (LLMs) are becoming increasingly smart and autonom...
research
07/15/2023

Zero-shot NLG evaluation through Pairware Comparisons with LLMs

Evaluating Natural Language Generation (NLG) outputs is crucial but labo...

Please sign up or login with your details

Forgot password? Click here to reset