Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

06/10/2021
by   Tyler A. Chang, et al.
0

In this paper, we detail the relationship between convolutions and self-attention in natural language tasks. We show that relative position embeddings in self-attention layers are equivalent to recently-proposed dynamic lightweight convolutions, and we consider multiple new ways of integrating convolutions into Transformer self-attention. Specifically, we propose composite attention, which unites previous relative position embedding methods under a convolutional framework. We conduct experiments by training BERT with composite attention, finding that convolutions consistently improve performance on multiple downstream tasks, replacing absolute position embeddings. To inform future work, we present results comparing lightweight convolutions, dynamic convolutions, and depthwise-separable convolutions in language model pre-training, considering multiple injection points for convolutions in self-attention layers.

READ FULL TEXT

page 5

page 6

research
06/03/2021

The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models

Mechanisms for encoding positional information are central for transform...
research
04/06/2022

Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding

Pre-trained language models (PLM) have demonstrated their effectiveness ...
research
10/10/2020

What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding

In recent years, pre-trained Transformers have dominated the majority of...
research
01/29/2019

Pay Less Attention with Lightweight and Dynamic Convolutions

Self-attention is a useful mechanism to build generative models for lang...
research
05/20/2022

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Relative positional embeddings (RPE) have received considerable attentio...
research
09/27/2021

Multiplicative Position-aware Transformer Models for Language Understanding

Transformer models, which leverage architectural improvements like self-...
research
11/29/2021

On the Integration of Self-Attention and Convolution

Convolution and self-attention are two powerful techniques for represent...

Please sign up or login with your details

Forgot password? Click here to reset