Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

05/24/2023
by   Daman Arora, et al.
0

The performance on Large Language Models (LLMs) on existing reasoning benchmarks has shot up considerably over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem solving abilities of LLMs. We curate 450 challenging pre-engineering mathematics, physics and chemistry problems from the IIT JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is essential for solving problems in this benchmark. Our evaluation on the GPT series of models reveals that although performance improves with newer models, the best being GPT-4, the highest performance, even after using techniques like Self-Consistency and Chain-of-Thought prompting is less than 40 percent. Our analysis demonstrates that errors in algebraic manipulation and failure in retrieving relevant domain specific concepts are primary contributors to GPT4's low performance. Given the challenging nature of the benchmark, we hope that it can guide future research in problem solving using LLMs. Our code and dataset is available here.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/25/2023

ARB: Advanced Reasoning Benchmark for Large Language Models

Large Language Models (LLMs) have demonstrated remarkable performance on...
research
06/02/2023

An Empirical Study on Challenging Math Problem Solving with GPT-4

Employing Large Language Models (LLMs) to address mathematical problems ...
research
07/20/2023

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Recent advances in large language models (LLMs) have demonstrated notabl...
research
12/20/2022

True Detective: A Challenging Benchmark for Deep Abductive Reasoning in Foundation Models

Large language models (LLMs) have demonstrated strong performance in zer...
research
07/11/2023

Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration

Human intelligence thrives on the concept of cognitive synergy, where co...
research
08/22/2023

Diversity Measures: Domain-Independent Proxies for Failure in Language Model Queries

Error prediction in large language models often relies on domain-specifi...
research
08/21/2023

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

With the continuous evolution and refinement of LLMs, they are endowed w...

Please sign up or login with your details

Forgot password? Click here to reset