Lessons on Parameter Sharing across Layers in Transformers

04/13/2021
by   Sho Takase, et al.
0

We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.

READ FULL TEXT

page 1

page 2

page 3

page 4

01/01/2021

Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers

The advent of the Transformer can arguably be described as a driving for...
08/25/2021

Dropout against Deep Leakage from Gradients

As the scale and size of the data increases significantly nowadays, fede...
10/31/2018

A task in a suit and a tie: paraphrase generation with semantic augmentation

Paraphrasing is rooted in semantics. We show the effectiveness of transf...
08/05/2021

Finetuning Pretrained Transformers into Variational Autoencoders

Text variational autoencoders (VAEs) are notorious for posterior collaps...
10/23/2020

Stabilizing Transformer-Based Action Sequence Generation For Q-Learning

Since the publication of the original Transformer architecture (Vaswani ...
11/09/2021

Sliced Recursive Transformer

We present a neat yet effective recursive operation on vision transforme...
06/01/2022

On Layer Normalizations and Residual Connections in Transformers

In the perspective of a layer normalization (LN) position, the architect...