Not all parameters are born equal: Attention is mostly what you need

10/22/2020
by   Nikolay Bogoychev, et al.
0

Transformers are widely used in state-of-the-art machine translation, but the key to their success is still unknown. To gain insight into this, we consider three groups of parameters: embeddings, attention, and feed forward neural network (FFN) layers. We examine the relative importance of each by performing an ablation study where we initialise them at random and freeze them, so that their weights do not change over the course of the training. Through this, we show that the attention and FFN are equally important and fulfil the same functionality in a model. We show that the decision about whether a component is frozen or allowed to train is at least as important for the final model performance as its number of parameters. At the same time, the number of parameters alone is not indicative of a component's importance. Finally, while the embedding layer is the least essential for machine translation tasks, it is the most important component for language modelling tasks.

READ FULL TEXT
research
07/14/2018

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

In Neural Machine Translation (NMT), the most common practice is to stac...
research
07/02/2019

Augmenting Self-attention with Persistent Memory

Transformer networks have lead to important progress in language modelin...
research
06/18/2021

Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation

In deep neural network modeling, the most common practice is to stack a ...
research
04/12/2021

LocalViT: Bringing Locality to Vision Transformers

We study how to introduce locality mechanisms into vision transformers. ...
research
08/11/2023

Optimizing transformer-based machine translation model for single GPU training: a hyperparameter ablation study

In machine translation tasks, the relationship between model complexity ...
research
04/18/2021

On the Strengths of Cross-Attention in Pretrained Transformers for Machine Translation

We study the power of cross-attention in the Transformer architecture wi...
research
05/24/2023

Emergent inabilities? Inverse scaling over the course of pretraining

Does inverse scaling only occur as a function of model parameter size, o...

Please sign up or login with your details

Forgot password? Click here to reset