Improve Transformer Pre-Training with Decoupled Directional Relative Position Encoding and Representation Differentiations

10/09/2022
by   Haojie Zhang, et al.
0

In this work, we revisit the Transformer-based pre-trained language models and identify two problems that may limit the expressiveness of the model. Firstly, existing relative position encoding models (e.g., T5 and DEBERTA) confuse two heterogeneous information: relative distance and direction. It may make the model unable to capture the associative semantics of the same direction or the same distance, which in turn affects the performance of downstream tasks. Secondly, we notice the pre-trained BERT with Mask Language Modeling (MLM) pre-training objective outputs similar token representations and attention weights of different heads, which may impose difficulties in capturing discriminative semantic representations. Motivated by the above investigation, we propose two novel techniques to improve pre-trained language models: Decoupled Directional Relative Position (DDRP) encoding and MTH pre-training objective. DDRP decouples the relative distance features and the directional features in classical relative position encoding for better position information understanding. MTH designs two novel auxiliary losses besides MLM to enlarge the dissimilarities between (a) last hidden states of different tokens, and (b) attention weights of different heads, alleviating homogenization and anisotropic problem in representation learning for better optimization. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of our proposed methods.

READ FULL TEXT
research
04/30/2020

SegaBERT: Pre-training of Segment-aware BERT for Language Understanding

Pre-trained language models have achieved state-of-the-art results in va...
research
02/02/2022

Relative Position Prediction as Pre-training for Text Encoders

Meaning is defined by the company it keeps. However, company is two-fold...
research
11/08/2022

Word Order Matters when you Increase Masking

Word order, an essential property of natural languages, is injected in T...
research
02/03/2022

Pre-Trained Language Models for Interactive Decision-Making

Language model (LM) pre-training has proven useful for a wide variety of...
research
08/30/2023

LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

In recent years, there have been remarkable advancements in the performa...
research
06/28/2020

Rethinking the Positional Encoding in Language Pre-training

How to explicitly encode positional information into neural networks is ...
research
10/11/2022

Can Language Models Be Specific? How?

A good speaker not only needs to be correct, but also has the ability to...

Please sign up or login with your details

Forgot password? Click here to reset