Lila: A Unified Benchmark for Mathematical Reasoning

10/31/2022
by   Swaroop Mishra, et al.
15

Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83 the best performing model only obtains 60.40 improvement in general mathematical reasoning and understanding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/12/2022

NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks

Given the ubiquitous nature of numbers in text, reasoning with numbers t...
research
05/18/2020

Towards Question Format Independent Numerical Reasoning: A Set of Prerequisite Tasks

Numerical reasoning is often important to accurately understand the worl...
research
11/03/2022

Overcoming Barriers to Skill Injection in Language Modeling: Case Study in Arithmetic

Through their transfer learning abilities, highly-parameterized large pr...
research
05/25/2023

UFO: Unified Fact Obtaining for Commonsense Question Answering

Leveraging external knowledge to enhance the reasoning ability is crucia...
research
11/28/2022

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

Current computer vision models, unlike the human visual system, cannot y...
research
02/25/2020

Hierarchical Conditional Relation Networks for Video Question Answering

Video question answering (VideoQA) is challenging as it requires modelin...
research
04/02/2019

Analysing Mathematical Reasoning Abilities of Neural Models

Mathematical reasoning---a core ability within human intelligence---pres...

Please sign up or login with your details

Forgot password? Click here to reset