Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers

01/01/2021
by   Machel Reid, et al.
0

The advent of the Transformer can arguably be described as a driving force behind many of the recent advances in natural language processing. However, despite their sizeable performance improvements, as recently shown, the model is severely over-parameterized, being parameter inefficient and computationally expensive to train. Inspired by the success of parameter-sharing in pretrained deep contextualized word representation encoders, we explore parameter-sharing methods in Transformers, with a specific focus on encoder-decoder models for sequence-to-sequence tasks such as neural machine translation. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer, a parameter efficient Transformer-based model which combines the newly proposed Sandwich-style parameter sharing technique - designed to overcome the deficiencies in naive cross-layer parameter sharing for generative models - and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2021

Analyzing Architectures for Neural Machine Translation Using Low Computational Resources

With the recent developments in the field of Natural Language Processing...
research
09/01/2018

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models

In multilingual neural machine translation, it has been shown that shari...
research
06/15/2023

Understanding Parameter Sharing in Transformers

Parameter sharing has proven to be a parameter-efficient approach. Previ...
research
04/13/2021

Lessons on Parameter Sharing across Layers in Transformers

We propose a parameter sharing method for Transformers (Vaswani et al., ...
research
03/05/2021

Hierarchical Transformer for Multilingual Machine Translation

The choice of parameter sharing strategy in multilingual machine transla...
research
02/16/2022

EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation

We propose EdgeFormer – a parameter-efficient Transformer of the encoder...
research
05/30/2023

Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input

Despite the great success of Transformer networks in various application...

Please sign up or login with your details

Forgot password? Click here to reset