Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability

Investigating the reasoning abilities of transformer models, and discovering new challenging tasks for them, has been a topic of much interest. Recent studies have found these models to be surprisingly strong at performing deductive reasoning over formal logical theories expressed in natural language. A shortcoming of these studies, however, is that they do not take into account that logical theories, when sampled uniformly at random, do not necessarily lead to hard instances. We propose a new methodology for creating challenging algorithmic reasoning datasets that focus on natural language satisfiability (NLSat) problems. The key idea is to draw insights from empirical sampling of hard propositional SAT problems and from complexity-theoretic studies of language. This methodology allows us to distinguish easy from hard instances, and to systematically increase the complexity of existing reasoning benchmarks such as RuleTaker. We find that current transformers, given sufficient training data, are surprisingly robust at solving the resulting NLSat problems of substantially increased difficulty. They also exhibit some degree of scale-invariance - the ability to generalize to problems of larger size and scope. Our results, however, reveal important limitations too: a careful sampling of training data is crucial for building models that generalize to larger problems, and transformer models' limited scale-invariance suggests they are far from learning robust deductive reasoning algorithms.


page 1

page 2

page 3

page 4


Transformers as Soft Reasoners over Language

AI has long pursued the goal of having systems reason over *explicitly p...

AbductionRules: Training Transformers to Explain Unexpected Inputs

Transformers have recently been shown to be capable of reliably performi...

Exploring Length Generalization in Large Language Models

The ability to extrapolate from short problem instances to longer ones i...

Measuring Systematic Generalization in Neural Proof Generation with Transformers

We are interested in understanding how well Transformer language models ...

FaiRR: Faithful and Robust Deductive Reasoning over Natural Language

Transformers have been shown to be able to perform deductive reasoning o...

ORCHARD: A Benchmark For Measuring Systematic Generalization of Multi-Hierarchical Reasoning

The ability to reason with multiple hierarchical structures is an attrac...

Generating Symbolic Reasoning Problems with Transformer GANs

Constructing training data for symbolic reasoning domains is challenging...

1 Introduction

Motivated by the impressive performance of recent pre-trained transformers devlin2018bert; raffel2019exploring on a wide range of natural language understanding (NLU) benchmarks wang2019glue; wang2019superglue; xu-etal-2020-clue, there has much been recent interest in investigating the linguistic and reasoning abilities of state-of-the-art neural models (linzen2016assessing; talmor2019olmpics; kassner2020pre; yanaka2020neural; hupkes2020compositionality; richardson2020probing, inter alia). One particular thread of work focuses on probing whether transformers can perform logical reasoning over formal theories expressed in natural language clark2020transformers. Specifically, given a set of systematically constructed natural language theories consisting of a set of explicitly stated rules and facts (e.g., the NL Theory in the bottom part of Figure LABEL:fig:top_fig containing fictional rules about characters Bob and Alan), the goal is to see whether a model can learn to perform deductive reasoning over such theories by correctly answering queries that require making novel inferences (e.g., predicating that Alan is green is true based on knowing that Alan is rough and applying the rule All rough people are green).