KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

05/20/2022
by   Ta-Chung Chi, et al.
0

Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. We achieve this goal using conditionally positive definite (CPD) kernels, a class of functions known for generalizing distance metrics. To maintain the inner product interpretation of self-attention, we show that a CPD kernel can be transformed into a PD kernel by adding a constant offset. This offset is implicitly absorbed in the Softmax normalization during self-attention. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way. Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/18/2020

Adaptive Attention Span in Computer Vision

Recent developments in Transformers for language modeling have opened ne...
research
06/03/2021

The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models

Mechanisms for encoding positional information are central for transform...
research
11/09/2018

Relative Error RKHS Embeddings for Gaussian Kernels

We show how to obliviously embed into the reproducing kernel Hilbert spa...
research
06/10/2021

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

In this paper, we detail the relationship between convolutions and self-...
research
07/09/2018

Position-aware Self-attention with Relative Positional Encodings for Slot Filling

This paper describes how to apply self-attention with relative positiona...
research
10/14/2020

DA-Transformer: Distance-aware Transformer

Transformer has achieved great success in the NLP field by composing var...
research
05/31/2023

Monotonic Location Attention for Length Generalization

We explore different ways to utilize position-based cross-attention in s...

Please sign up or login with your details

Forgot password? Click here to reset