Mimetic Initialization of Self-Attention Layers

05/16/2023
by   Asher Trockman, et al.
0

It is notoriously difficult to train Transformers on small datasets; typically, large pre-trained models are instead used as the starting point. We explore the weights of such pre-trained Transformers (particularly for vision) to attempt to find reasons for this discrepancy. Surprisingly, we find that simply initializing the weights of self-attention layers so that they "look" more like their pre-trained counterparts allows us to train vanilla Transformers faster and to higher final accuracies, particularly on vision tasks such as CIFAR-10 and ImageNet classification, where we see gains in accuracy of over 5 form, learning-free, and very simple: we set the product of the query and key weights to be approximately the identity, and the product of the value and projection weights to approximately the negative identity. As this mimics the patterns we saw in pre-trained Transformers, we call the technique "mimetic initialization".

READ FULL TEXT

page 2

page 3

page 8

page 11

page 12

research
01/14/2022

The Dark Side of the Language: Pre-trained Transformers in the DarkNet

Pre-trained Transformers are challenging human performances in many natu...
research
09/06/2023

Combining pre-trained Vision Transformers and CIDER for Out Of Domain Detection

Out-of-domain (OOD) detection is a crucial component in industrial appli...
research
06/10/2021

Transformed CNNs: recasting pre-trained convolutional layers with self-attention

Vision Transformers (ViT) have recently emerged as a powerful alternativ...
research
02/19/2021

A Projection Algorithm for the Unitary Weights

Unitary neural networks are promising alternatives for solving the explo...
research
02/20/2023

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

Skip connections and normalisation layers form two standard architectura...
research
08/04/2022

MVSFormer: Multi-View Stereo with Pre-trained Vision Transformers and Temperature-based Depth

Feature representation learning is the key recipe for learning-based Mul...
research
12/03/2021

Linear algebra with transformers

Most applications of transformers to mathematics, from integration to th...

Please sign up or login with your details

Forgot password? Click here to reset