Addressing Token Uniformity in Transformers via Singular Value Transformation

08/24/2022
by   Hanqi Yan, et al.
0

Token uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and empirically illustrate that a less skewed singular value distribution can alleviate the `token uniformity' problem. Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. We show that apart from alleviating token uniformity, the transformation function should preserve the local neighbourhood structure in the original embedding space. Our proposed singular value transformation function is applied to a range of transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in semantic textual similarity evaluation and a range of GLUE tasks. Our source code is available at https://github.com/hanqi-qi/tokenUni.git.

READ FULL TEXT

page 7

page 15

research
04/22/2021

So-ViT: Mind Visual Tokens for Vision Transformer

Recently the vision transformer (ViT) architecture, where the backbone p...
research
05/31/2021

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Transformers have offered a new methodology of designing neural networks...
research
05/23/2022

Outliers Dimensions that Disrupt Transformers Are Driven by Frequency

Transformer-based language models are known to display anisotropic behav...
research
08/13/2020

On the Importance of Local Information in Transformer Based Models

The self-attention module is a key component of Transformer-based models...
research
05/23/2023

Grounding and Distinguishing Conceptual Vocabulary Through Similarity Learning in Embodied Simulations

We present a novel method for using agent experiences gathered through a...
research
02/13/2023

Distinguishability Calibration to In-Context Learning

Recent years have witnessed increasing interests in prompt-based learnin...
research
06/13/2023

Is Anisotropy Inherent to Transformers?

The representation degeneration problem is a phenomenon that is widely o...

Please sign up or login with your details

Forgot password? Click here to reset