Balancing Cost and Benefit with Tied-Multi Transformers

02/20/2020
by   Raj Dabre, et al.
0

We propose and evaluate a novel procedure for training multiple Transformers with tied parameters which compresses multiple models into one enabling the dynamic choice of the number of encoder and decoder layers during decoding. In sequence-to-sequence modeling, typically, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute loss. Instead, our method computes a single loss consisting of NxM losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers. Such a model subsumes NxM models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers. We then propose a mechanism to choose a priori the number of encoder and decoder layers for faster decoding, and also explore recurrent stacking of layers and knowledge distillation for model compression. We present a cost-benefit analysis of applying the proposed approaches for neural machine translation and show that they reduce decoding costs while preserving translation quality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/27/2019

Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers

This paper proposes a novel procedure for training an encoder-decoder ba...
research
05/16/2020

Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning

In sequence-to-sequence learning, the attention mechanism has been a gre...
research
09/06/2023

Dynamic Encoding and Decoding of Information for Split Learning in Mobile-Edge Computing: Leveraging Information Bottleneck Theory

Split learning is a privacy-preserving distributed learning paradigm in ...
research
08/23/2021

Regularizing Transformers With Deep Probabilistic Layers

Language models (LM) have grown with non-stop in the last decade, from s...
research
11/21/2022

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

Large-scale Transformer models bring significant improvements for variou...
research
06/18/2021

Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation

In deep neural network modeling, the most common practice is to stack a ...
research
04/24/2020

On Sparsifying Encoder Outputs in Sequence-to-Sequence Models

Sequence-to-sequence models usually transfer all encoder outputs to the ...

Please sign up or login with your details

Forgot password? Click here to reset