TheoremQA: A Theorem-driven Question Answering dataset

05/21/2023
by   Wenhu Chen, et al.
0

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. is curated by domain experts containing 800 high-quality questions covering 350 theorems[e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc] from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of , we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.

READ FULL TEXT
research
05/01/2020

Diverse Visuo-Lingustic Question Answering (DVLQA) Challenge

Existing question answering datasets mostly contain homogeneous contexts...
research
05/30/2021

GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning

Automatic math problem solving has recently attracted increasing attenti...
research
08/17/2023

MaScQA: A Question Answering Dataset for Investigating Materials Science Knowledge of Large Language Models

Information extraction and textual comprehension from materials literatu...
research
04/15/2020

HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data

Existing question answering datasets focus on dealing with homogeneous i...
research
06/06/2022

Learning to Ask Like a Physician

Existing question answering (QA) datasets derived from electronic health...
research
09/04/2020

KILT: a Benchmark for Knowledge Intensive Language Tasks

Challenging problems such as open-domain question answering, fact checki...
research
08/08/2017

Robust Computer Algebra, Theorem Proving, and Oracle AI

In the context of superintelligent AI systems, the term "oracle" has two...

Please sign up or login with your details

Forgot password? Click here to reset