Scalable Transformers for Neural Machine Translation

06/04/2021
by   Peng Gao, et al.
0

Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. However, the deployment of Transformer is challenging because different scenarios require models of different complexities and scales. Naively training multiple Transformers is redundant in terms of both computation and memory. In this paper, we propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters. Each sub-Transformer can be easily obtained by cropping the parameters of the largest Transformer. A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers, which introduces additional supervisions from word-level and sequence-level self-distillation. Extensive experiments were conducted on WMT EN-De and En-Fr to validate our proposed scalable Transformers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2020

Multi-Unit Transformers for Neural Machine Translation

Transformer models achieve remarkable success in Neural Machine Translat...
research
08/25/2021

The Next 700 Program Transformers

In this paper, we describe a hierarchy of program transformers in which ...
research
08/11/2022

Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages

Machine translation has seen rapid progress with the advent of Transform...
research
02/21/2020

Accessing Higher-level Representations in Sequential Transformers with Feedback Memory

Transformers are feedforward networks that can process input tokens in p...
research
09/17/2020

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Recent work on the lottery ticket hypothesis has produced highly sparse ...
research
02/09/2023

Binarized Neural Machine Translation

The rapid scaling of language models is motivating research using low-bi...
research
05/04/2023

BranchNorm: Robustly Scaling Extremely Deep Transformers

Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000 l...

Please sign up or login with your details

Forgot password? Click here to reset