ARB: Advanced Reasoning Benchmark for Large Language Models

07/25/2023
by   Tomohiro Sawada, et al.
0

Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet reaching expert performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. ARB presents a more challenging test than prior benchmarks, featuring problems in mathematics, physics, biology, chemistry, and law. As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50 evaluation capabilities, we introduce a rubric-based evaluation approach, allowing GPT-4 to score its own intermediate reasoning steps. Further, we conduct a human evaluation of the symbolic subset of ARB, finding promising agreement between annotators and GPT-4 rubric evaluation scores.

READ FULL TEXT
research
05/24/2023

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

The performance on Large Language Models (LLMs) on existing reasoning be...
research
06/29/2022

Solving Quantitative Reasoning Problems with Language Models

Language models have achieved remarkable performance on a wide range of ...
research
12/15/2022

Improving Chess Commentaries by Combining Language Models with Symbolic Reasoning Engines

Despite many recent advancements in language modeling, state-of-the-art ...
research
05/29/2023

Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models

Large language models (LLMs) have scaled up to unlock a wide range of co...
research
06/02/2023

Evaluating Language Models for Mathematics through Interactions

The standard methodology of evaluating large language models (LLMs) base...
research
06/21/2023

Understanding Social Reasoning in Language Models with Language Models

As Large Language Models (LLMs) become increasingly integrated into our ...
research
05/23/2023

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

With the recent appearance of LLMs in practical settings, having methods...

Please sign up or login with your details

Forgot password? Click here to reset