DeepAI
Log In Sign Up

Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers

01/01/2021
by   Machel Reid, et al.
0

The advent of the Transformer can arguably be described as a driving force behind many of the recent advances in natural language processing. However, despite their sizeable performance improvements, as recently shown, the model is severely over-parameterized, being parameter inefficient and computationally expensive to train. Inspired by the success of parameter-sharing in pretrained deep contextualized word representation encoders, we explore parameter-sharing methods in Transformers, with a specific focus on encoder-decoder models for sequence-to-sequence tasks such as neural machine translation. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer, a parameter efficient Transformer-based model which combines the newly proposed Sandwich-style parameter sharing technique - designed to overcome the deficiencies in naive cross-layer parameter sharing for generative models - and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

11/06/2021

Analyzing Architectures for Neural Machine Translation Using Low Computational Resources

With the recent developments in the field of Natural Language Processing...
04/13/2021

Lessons on Parameter Sharing across Layers in Transformers

We propose a parameter sharing method for Transformers (Vaswani et al., ...
09/01/2018

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models

In multilingual neural machine translation, it has been shown that shari...
03/05/2021

Hierarchical Transformer for Multilingual Machine Translation

The choice of parameter sharing strategy in multilingual machine transla...
05/31/2021

Cascaded Head-colliding Attention

Transformers have advanced the field of natural language processing (NLP...
02/16/2022

EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation

We propose EdgeFormer – a parameter-efficient Transformer of the encoder...
08/03/2020

DeLighT: Very Deep and Light-weight Transformer

We introduce a very deep and light-weight transformer, DeLighT, that del...