Log In Sign Up

Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers

by   Machel Reid, et al.

The advent of the Transformer can arguably be described as a driving force behind many of the recent advances in natural language processing. However, despite their sizeable performance improvements, as recently shown, the model is severely over-parameterized, being parameter inefficient and computationally expensive to train. Inspired by the success of parameter-sharing in pretrained deep contextualized word representation encoders, we explore parameter-sharing methods in Transformers, with a specific focus on encoder-decoder models for sequence-to-sequence tasks such as neural machine translation. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer, a parameter efficient Transformer-based model which combines the newly proposed Sandwich-style parameter sharing technique - designed to overcome the deficiencies in naive cross-layer parameter sharing for generative models - and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.


page 1

page 2

page 3

page 4


Analyzing Architectures for Neural Machine Translation Using Low Computational Resources

With the recent developments in the field of Natural Language Processing...

Lessons on Parameter Sharing across Layers in Transformers

We propose a parameter sharing method for Transformers (Vaswani et al., ...

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models

In multilingual neural machine translation, it has been shown that shari...

Hierarchical Transformer for Multilingual Machine Translation

The choice of parameter sharing strategy in multilingual machine transla...

Cascaded Head-colliding Attention

Transformers have advanced the field of natural language processing (NLP...

EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation

We propose EdgeFormer – a parameter-efficient Transformer of the encoder...

DeLighT: Very Deep and Light-weight Transformer

We introduce a very deep and light-weight transformer, DeLighT, that del...