SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

07/20/2023
by   Xiaoxuan Wang, et al.
0

Recent advances in large language models (LLMs) have demonstrated notable progress on many mathematical benchmarks. However, most of these benchmarks only feature problems grounded in junior and senior high school subjects, contain only multiple-choice questions, and are confined to a limited scope of elementary arithmetic operations. To address these issues, this paper introduces an expansive benchmark suite SciBench that aims to systematically examine the reasoning capabilities required for complex scientific problem solving. SciBench contains two carefully curated datasets: an open set featuring a range of collegiate-level scientific problems drawn from mathematics, chemistry, and physics textbooks, and a closed set comprising problems from undergraduate-level exams in computer science and mathematics. Based on the two datasets, we conduct an in-depth benchmark study of two representative LLMs with various prompting strategies. The results reveal that current LLMs fall short of delivering satisfactory performance, with an overall score of merely 35.80 categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms others and some strategies that demonstrate improvements in certain problem-solving skills result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2023

Learning Multi-Step Reasoning by Solving Arithmetic Tasks

Mathematical reasoning is regarded as a necessary ability for Language M...
research
05/24/2023

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

The performance on Large Language Models (LLMs) on existing reasoning be...
research
03/14/2023

Can neural networks do arithmetic? A survey on the elementary numerical skills of state-of-the-art deep learning models

Creating learning models that can exhibit sophisticated reasoning skills...
research
06/02/2023

An Empirical Study on Challenging Math Problem Solving with GPT-4

Employing Large Language Models (LLMs) to address mathematical problems ...
research
08/25/2023

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Recently, there has been growing interest in using Large Language Models...
research
09/11/2023

Large Language Model for Science: A Study on P vs. NP

In this work, we use large language models (LLMs) to augment and acceler...

Please sign up or login with your details

Forgot password? Click here to reset