Rethinking the Positional Encoding in Language Pre-training

06/28/2020
by   Guolin Ke, et al.
0

How to explicitly encode positional information into neural networks is an important problem in natural language processing. In the Transformer model, the positional information is simply encoded as embedding vectors, which are used in the input layer, or encoded as a bias term in the self-attention module. In this work, we investigate the problems in the previous formulations and propose a new positional encoding method for BERT called Transformer with Untied Positional Encoding (TUPE). Different from all other works, TUPE only uses the word embedding as input. In the self-attention module, the word correlation and positional correlation are computed separately with different parameterizations and then added together. This design removes the noisy word-position correlation and gives more expressiveness to characterize the relationship between words/positions by using different projection matrices. Furthermore, TUPE unties the [CLS] symbol from other positions to provide it with a more specific role to capture the global representation of the sentence. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness and efficiency of the proposed method: TUPE outperforms several baselines on almost all tasks by a large margin. In particular, it can achieve a higher score than baselines while only using 30% pre-training computational costs. We release our code at https://github.com/guolinke/TUPE.

READ FULL TEXT
research
06/28/2020

Rethinking Positional Encoding in Language Pre-training

How to explicitly encode positional information into neural networks is ...
research
02/16/2021

Revisiting Language Encoding in Learning Multilingual Representations

Transformer has demonstrated its great power to learn contextual word re...
research
04/19/2022

DecBERT: Enhancing the Language Understanding of BERT with Causal Attention Masks

Since 2017, the Transformer-based models play critical roles in various ...
research
07/29/2021

Rethinking and Improving Relative Position Encoding for Vision Transformer

Relative position encoding (RPE) is important for transformer to capture...
research
10/09/2022

Improve Transformer Pre-Training with Decoupled Directional Relative Position Encoding and Representation Differentiations

In this work, we revisit the Transformer-based pre-trained language mode...
research
09/14/2019

Tree Transformer: Integrating Tree Structures into Self-Attention

Pre-training Transformer from large-scale raw texts and fine-tuning on t...
research
09/15/2020

Cascaded Semantic and Positional Self-Attention Network for Document Classification

Transformers have shown great success in learning representations for la...

Please sign up or login with your details

Forgot password? Click here to reset