Greedy Ordering of Layer Weight Matrices in Transformers Improves Translation

02/04/2023
by   Elicia Ye, et al.
0

Prior work has attempted to understand the internal structures and functionalities of Transformer-based encoder-decoder architectures on the level of multi-head attention and feed-forward sublayers. Interpretations have focused on the encoder and decoder, along with the combinatorial possibilities of the self-attention, cross-attention, and feed-forward sublayers. However, without examining the low-level structures, one gains limited understanding of the motivation behind sublayer reordering. Could we dive into the sublayer abstraction and permute layer weight matrices to improve the quality of translation? We propose AEIUOrder to greedily reorder layer weight matrices in the encoder by their well-trainedness, as measured by Heavy-Tailed Self-Regularization (HT-SR) metrics, and order the decoder matrices correspondingly. Our results suggest that greedily reordering layer weight matrices to maximize Total well-trainedness facilitates the model to learn representations and generate translations more effectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/28/2021

Understanding How Encoder-Decoder Architectures Attend

Encoder-decoder networks with attention have proven to be a powerful way...
research
07/26/2023

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

Existing analyses of the expressive capacity of Transformer models have ...
research
03/21/2020

Analyzing Word Translation of Transformer Layers

The Transformer translation model is popular for its effective paralleli...
research
03/05/2021

IOT: Instance-wise Layer Reordering for Transformer Structures

With sequentially stacked self-attention, (optional) encoder-decoder att...
research
11/01/2019

Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding

Transformer showed promising results in many sequence to sequence transf...
research
11/21/2019

Generating Diverse Translation by Manipulating Multi-Head Attention

Transformer model has been widely used on machine translation tasks and ...

Please sign up or login with your details

Forgot password? Click here to reset