Unveiling Transformers with LEGO: a synthetic reasoning task

06/09/2022
by   Yi Zhang, et al.
8

We propose a synthetic task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the transformer architecture learns this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we are able to understand (to some extent) some of the attention heads as well as how the information flows in the network. Based on these observations we propose a hypothesis that here pretraining helps merely due to being a smart initialization rather than some deep knowledge stored in the network. We also observe that in some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning, which impedes the model's ability to generalize to simple variants of the main task, and moreover we find that one can prevent such shortcut with appropriate architecture modification or careful data preparation. Motivated by our findings, we begin to explore the task of learning to execute C programs, where a convolutional modification to transformers, namely adding convolutional structures in the key/query/value maps, shows an encouraging edge.

READ FULL TEXT

page 7

page 9

page 27

page 28

research
02/16/2022

The NLP Task Effectiveness of Long-Range Transformers

Transformer models cannot easily scale to long sequences due to their O(...
research
05/17/2021

Pay Attention to MLPs

Transformers have become one of the most important architectural innovat...
research
05/11/2023

IUST_NLP at SemEval-2023 Task 10: Explainable Detecting Sexism with Transformers and Task-adaptive Pretraining

This paper describes our system on SemEval-2023 Task 10: Explainable Det...
research
12/01/2020

Modifying Memories in Transformer Models

Large Transformer models have achieved impressive performance in many na...
research
11/08/2019

Graph-to-Graph Transformer for Transition-based Dependency Parsing

Transition-based dependency parsing is a challenging task for conditioni...
research
05/30/2023

Dissecting Chain-of-Thought: A Study on Compositional In-Context Learning of MLPs

Chain-of-thought (CoT) is a method that enables language models to handl...

Please sign up or login with your details

Forgot password? Click here to reset