Log In Sign Up

Scalable Transformers for Neural Machine Translation

by   Peng Gao, et al.

Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. However, the deployment of Transformer is challenging because different scenarios require models of different complexities and scales. Naively training multiple Transformers is redundant in terms of both computation and memory. In this paper, we propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters. Each sub-Transformer can be easily obtained by cropping the parameters of the largest Transformer. A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers, which introduces additional supervisions from word-level and sequence-level self-distillation. Extensive experiments were conducted on WMT EN-De and En-Fr to validate our proposed scalable Transformers.


page 1

page 2

page 3

page 4


Multi-Unit Transformers for Neural Machine Translation

Transformer models achieve remarkable success in Neural Machine Translat...

The Next 700 Program Transformers

In this paper, we describe a hierarchy of program transformers in which ...

Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages

Machine translation has seen rapid progress with the advent of Transform...

Deep Transformers with Latent Depth

The Transformer model has achieved state-of-the-art performance in many ...

Accessing Higher-level Representations in Sequential Transformers with Feedback Memory

Transformers are feedforward networks that can process input tokens in p...

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Recent work on the lottery ticket hypothesis has produced highly sparse ...

What is my math transformer doing? – Three results on interpretability and generalization

This paper investigates the failure cases and out-of-distribution behavi...