Can ChatGPT Defend the Truth? Automatic Dialectical Evaluation Elicits LLMs' Deficiencies in Reasoning

by   Boshi Wang, et al.

We explore testing the reasoning ability of large language models (LLMs), such as ChatGPT, by engaging with them in a debate-like conversation that probes deeper into their understanding of the subject. Specifically, we formulate a new task where given a question, the LLM can generate a correct solution while the user believes in a wrong solution in the beginning, and they need to discuss to make the correct decision through dialogue. Such a setting requires the LLM to not only achieve the correct answer on its own (which could be done by shallow memorization), but also be able to defend the truth instead of blindly believing or getting misled by the user's (invalid) arguments and critiques, thus testing in greater depth whether the LLM grasps the essence of the reasoning required to solve the problem. To automate this evaluation framework and save human labor, we simulate the user using another LLM conditioned on a synthesized wrong solution. Across a range of complex reasoning benchmarks spanning math, commonsense, logic and tasks from BIG-Bench, we find that despite being able to generate correct step-by-step solutions in the beginning, ChatGPT cannot maintain its belief in truth for a significant portion of examples when challenged by often-time absurdly invalid arguments. Our work reveals LLMs' weaknesses not captured by conventional benchmarking, and also points to danger zones of aligning models with human feedback.


page 1

page 2

page 3

page 4


STaR: Bootstrapping Reasoning With Reasoning

Generating step-by-step "chain-of-thought" rationales improves language ...

It Takes One to Tango but More Make Trouble? In-Context Training with Different Number of Demonstrations

Large language models (LLMs) are capable to perform complex reasoning by...

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Program synthesis has been long studied with recent approaches focused o...

SYNDICOM: Improving Conversational Commonsense with Error-Injection and Natural Language Feedback

Commonsense reasoning is a critical aspect of human communication. Despi...

Are Multilingual LLMs Culturally-Diverse Reasoners? An Investigation into Multicultural Proverbs and Sayings

Large language models (LLMs) are highly adept at question answering and ...

Are LLMs All You Need for Task-Oriented Dialogue?

Instructions-tuned Large Language Models (LLMs) gained recently huge pop...

Please sign up or login with your details

Forgot password? Click here to reset