Transformer Language Models without Positional Encodings Still Learn Positional Information

03/30/2022
by   Adi Haviv, et al.
0

Transformers typically require some form of positional encoding, such as positional embeddings, to process natural language sequences. Surprisingly, we find that transformer language models without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire an implicit notion of absolute positions throughout the network, effectively compensating for the missing information. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position.

READ FULL TEXT
research
10/23/2022

The Curious Case of Absolute Position Embeddings

Transformer language models encode the notion of word order using positi...
research
04/20/2021

RoFormer: Enhanced Transformer with Rotary Position Embedding

Position encoding in transformer architecture provides supervision for d...
research
11/05/2022

Small Language Models for Tabular Data

Supervised deep learning is most commonly applied to difficult problems ...
research
03/15/2023

Attention-likelihood relationship in transformers

We analyze how large language models (LLMs) represent out-of-context wor...
research
05/08/2023

A Frustratingly Easy Improvement for Position Embeddings via Random Padding

Position embeddings, encoding the positional relationships among tokens ...
research
11/08/2022

Word Order Matters when you Increase Masking

Word order, an essential property of natural languages, is injected in T...
research
05/23/2022

Outliers Dimensions that Disrupt Transformers Are Driven by Frequency

Transformer-based language models are known to display anisotropic behav...

Please sign up or login with your details

Forgot password? Click here to reset