DeLighT: Very Deep and Light-weight Transformer

08/03/2020
by   Sachin Mehta, et al.
0

We introduce a very deep and light-weight transformer, DeLighT, that delivers similar or better performance than transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using DExTra, a deep and light-weight transformation and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. Experiments on machine translation and language modeling tasks show that DeLighT matches the performance of baseline Transformers with significantly fewer parameters. On the WMT'14 En-Fr high resource dataset, DeLighT requires 1.8 times fewer parameters and 2 times fewer operations and achieves better performance (+0.4 BLEU score) than baseline transformers. On the WMT'16 En-Ro low resource dataset, DeLighT delivers similar performance with 2.8 times fewer parameters than baseline transformers.

READ FULL TEXT
03/08/2022

EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers

Recently, vision transformers started to show impressive results which o...
11/10/2019

Improving Transformer Models by Reordering their Sublayers

Multilayer transformer networks consist of interleaved self-attention an...
11/27/2019

DeFINE: DEep Factorized INput Word Embeddings for Neural Sequence Modeling

For sequence models with large word-level vocabularies, a majority of ne...
08/11/2022

Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages

Machine translation has seen rapid progress with the advent of Transform...
10/16/2021

Transformer with a Mixture of Gaussian Keys

Multi-head attention is a driving force behind state-of-the-art transfor...
10/19/2019

Fast Portrait Segmentation with extremely light-weight network

In this paper, we describe a fast and light-weight portrait segmentation...
05/25/2022

BiT: Robustly Binarized Multi-distilled Transformer

Modern pre-trained transformers have rapidly advanced the state-of-the-a...

Code Repositories

delight

DeLighT: Very Deep and Light-Weight Transformers


view repo