NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks

04/12/2022
by   Swaroop Mishra, et al.
6

Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE that was proposed in the context of natural language understanding, we propose NumGLUE, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans (lower by 46.4 sharing knowledge across tasks, especially those with limited training data as evidenced by the superior performance (average gain of 3.4 a model is jointly trained on all the tasks as opposed to task-specific modeling. Finally, we hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language, a first step towards being able to perform more complex mathematical reasoning.

READ FULL TEXT

page 13

page 14

page 18

research
10/31/2022

Lila: A Unified Benchmark for Mathematical Reasoning

Mathematical reasoning skills are essential for general-purpose intellig...
research
04/20/2018

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

For natural language understanding (NLU) technology to be maximally usef...
research
09/21/2023

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Large language models (LLMs) have pushed the limits of natural language ...
research
10/16/2021

Learning to Solve Complex Tasks by Talking to Agents

Humans often solve complex problems by interacting (in natural language)...
research
12/20/2022

True Detective: A Challenging Benchmark for Deep Abductive Reasoning in Foundation Models

Large language models (LLMs) have demonstrated strong performance in zer...
research
06/07/2023

Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning

Generative artificial intelligence (AI) is a promising direction for aug...
research
07/15/2021

Trusting RoBERTa over BERT: Insights from CheckListing the Natural Language Inference Task

The recent state-of-the-art natural language understanding (NLU) systems...

Please sign up or login with your details

Forgot password? Click here to reset