Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

08/15/2023
by   Aojun Zhou, et al.
0

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the Code Usage Frequency of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit code-based self-verification (CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as “False”, the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset (53.9% → 84.3%).

READ FULL TEXT

page 3

page 6

page 16

page 17

research
08/01/2023

SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning

The recent progress in large language models (LLMs), especially the inve...
research
06/05/2023

Prompt to be Consistent is Better than Self-Consistent? Few-Shot and Zero-Shot Fact Verification with Pre-trained Language Models

Few-shot or zero-shot fact verification only relies on a few or no label...
research
08/03/2023

Reasoning in Large Language Models Through Symbolic Math Word Problems

Large language models (LLMs) have revolutionized NLP by solving downstre...
research
12/19/2022

Large Language Models are reasoners with Self-Verification

When a large language model (LLM) performs complex reasoning by chain of...
research
03/24/2021

Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2

Thinking aloud is an effective meta-cognitive strategy human reasoners a...
research
09/04/2023

MathAttack: Attacking Large Language Models Towards Math Solving Ability

With the boom of Large Language Models (LLMs), the research of solving M...
research
12/21/2022

Crowd Score: A Method for the Evaluation of Jokes using Large Language Model AI Voters as Judges

This paper presents the Crowd Score, a novel method to assess the funnin...

Please sign up or login with your details

Forgot password? Click here to reset