Lessons on Parameter Sharing across Layers in Transformers

by   Sho Takase, et al.

We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.


page 1

page 2

page 3

page 4


Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers

The advent of the Transformer can arguably be described as a driving for...

Dropout against Deep Leakage from Gradients

As the scale and size of the data increases significantly nowadays, fede...

A task in a suit and a tie: paraphrase generation with semantic augmentation

Paraphrasing is rooted in semantics. We show the effectiveness of transf...

Finetuning Pretrained Transformers into Variational Autoencoders

Text variational autoencoders (VAEs) are notorious for posterior collaps...

Stabilizing Transformer-Based Action Sequence Generation For Q-Learning

Since the publication of the original Transformer architecture (Vaswani ...

Sliced Recursive Transformer

We present a neat yet effective recursive operation on vision transforme...

On Layer Normalizations and Residual Connections in Transformers

In the perspective of a layer normalization (LN) position, the architect...