IOT: Instance-wise Layer Reordering for Transformer Structures

by   Jinhua Zhu, et al.

With sequentially stacked self-attention, (optional) encoder-decoder attention, and feed-forward layers, Transformer achieves big success in natural language processing (NLP), and many variants have been proposed. Currently, almost all these models assume that the layer order is fixed and kept the same across data samples. We observe that different data samples actually favor different orders of the layers. Based on this observation, in this work, we break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure. Our Instance-wise Ordered Transformer (IOT) can model variant functions by reordered layers, which enables each sample to select the better one to improve the model performance under the constraint of almost the same number of parameters. To achieve this, we introduce a light predictor with negligible parameter and inference cost to decide the most capable and favorable layer order for any input sequence. Experiments on 3 tasks (neural machine translation, abstractive summarization, and code generation) and 9 datasets demonstrate consistent improvements of our method. We further show that our method can also be applied to other architectures beyond Transformer. Our code is released at Github.


GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation

Transformer structure, stacked by a sequence of encoder and decoder netw...

On the Sub-Layer Functionalities of Transformer Decoder

There have been significant efforts to interpret the encoder of Transfor...

Contrastive Triple Extraction with Generative Transformer

Triple extraction is an essential task in information extraction for nat...

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

The Transformer architecture is widely used in natural language processi...

GLU Variants Improve Transformer

Gated Linear Units (arXiv:1612.08083) consist of the component-wise prod...

Multi-Head Attention: Collaborate Instead of Concatenate

Attention layers are widely used in natural language processing (NLP) an...

Greedy Ordering of Layer Weight Matrices in Transformers Improves Translation

Prior work has attempted to understand the internal structures and funct...

Please sign up or login with your details

Forgot password? Click here to reset