DeepAI
Log In Sign Up

Scalable Transformers for Neural Machine Translation

06/04/2021
by   Peng Gao, et al.
0

Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. However, the deployment of Transformer is challenging because different scenarios require models of different complexities and scales. Naively training multiple Transformers is redundant in terms of both computation and memory. In this paper, we propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters. Each sub-Transformer can be easily obtained by cropping the parameters of the largest Transformer. A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers, which introduces additional supervisions from word-level and sequence-level self-distillation. Extensive experiments were conducted on WMT EN-De and En-Fr to validate our proposed scalable Transformers.

READ FULL TEXT

page 1

page 2

page 3

page 4

10/21/2020

Multi-Unit Transformers for Neural Machine Translation

Transformer models achieve remarkable success in Neural Machine Translat...
08/25/2021

The Next 700 Program Transformers

In this paper, we describe a hierarchy of program transformers in which ...
08/11/2022

Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages

Machine translation has seen rapid progress with the advent of Transform...
09/28/2020

Deep Transformers with Latent Depth

The Transformer model has achieved state-of-the-art performance in many ...
02/21/2020

Accessing Higher-level Representations in Sequential Transformers with Feedback Memory

Transformers are feedforward networks that can process input tokens in p...
09/17/2020

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Recent work on the lottery ticket hypothesis has produced highly sparse ...
10/31/2022

What is my math transformer doing? – Three results on interpretability and generalization

This paper investigates the failure cases and out-of-distribution behavi...