Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

05/26/2023
by   Yao Fu, et al.
0

As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs; (2) we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations. Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs. Our current results show that: (1) model scale clearly correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind; (3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo. Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/18/2023

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Large language models (LLMs), such as GPT-4, have shown remarkable perfo...
research
06/14/2022

Can Foundation Models Talk Causality?

Foundation models are subject to an ongoing heated debate, leaving open ...
research
08/02/2023

LLMs Understand Glass-Box Models, Discover Surprises, and Suggest Repairs

We show that large language models (LLMs) are remarkably good at working...
research
08/07/2023

A Cost Analysis of Generative Language Models and Influence Operations

Despite speculation that recent large language models (LLMs) are likely ...
research
11/16/2022

Holistic Evaluation of Language Models

Language models (LMs) are becoming the foundation for almost all major l...
research
08/20/2023

Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models

Current literature, aiming to surpass the "Chain-of-Thought" approach, o...
research
05/19/2023

Examining the Inter-Consistency of Large Language Models: An In-depth Analysis via Debate

Large Language Models (LLMs) have demonstrated human-like intelligence a...

Please sign up or login with your details

Forgot password? Click here to reset