Jump to Conclusions: Short-Cutting Transformers With Linear Transformations

03/16/2023
by   Alexander Yom Din, et al.
0

Transformer-based language models (LMs) create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. One way to elucidate this is to cast the hidden representations as final representations, bypassing the transformer computation in-between. In this work, we suggest a simple method for such casting, by using linear transformations. We show that our approach produces more accurate approximations than the prevailing practice of inspecting hidden representations from all layers in the space of the final layer. Moreover, in the context of language modeling, our method allows "peeking" into early layer representations of GPT-2 and BERT, showing that often LMs already predict the final output in early layers. We then demonstrate the practicality of our method to recent early exit strategies, showing that when aiming, for example, at retention of 95 additional 7.9 savings of the original approach. Last, we extend our method to linearly approximate sub-modules, finding that attention is most tolerant to this change.

READ FULL TEXT

page 4

page 12

page 14

research
09/11/2019

How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations

Bidirectional Encoder Representations from Transformers (BERT) reach sta...
research
11/11/2022

Depth and Representation in Vision Models

Deep learning models develop successive representations of their input i...
research
10/07/2022

Understanding Transformer Memorization Recall Through Idioms

To produce accurate predictions, language models (LMs) must balance betw...
research
02/21/2022

Transformer Quality in Linear Time

We revisit the design choices in Transformers, and propose methods to ad...
research
04/15/2019

Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection

It is known that a deep neural network model pre-trained with large-scal...
research
05/02/2023

The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers

Applying language models to natural language processing tasks typically ...
research
03/28/2022

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

Transformer-based language models (LMs) are at the core of modern NLP, b...

Please sign up or login with your details

Forgot password? Click here to reset