DeLighT: Very Deep and Light-weight Transformer

by   Sachin Mehta, et al.

We introduce a very deep and light-weight transformer, DeLighT, that delivers similar or better performance than transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using DExTra, a deep and light-weight transformation and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. Experiments on machine translation and language modeling tasks show that DeLighT matches the performance of baseline Transformers with significantly fewer parameters. On the WMT'14 En-Fr high resource dataset, DeLighT requires 1.8 times fewer parameters and 2 times fewer operations and achieves better performance (+0.4 BLEU score) than baseline transformers. On the WMT'16 En-Ro low resource dataset, DeLighT delivers similar performance with 2.8 times fewer parameters than baseline transformers.


EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers

Recently, vision transformers started to show impressive results which o...

Improving Transformer Models by Reordering their Sublayers

Multilayer transformer networks consist of interleaved self-attention an...

DeFINE: DEep Factorized INput Word Embeddings for Neural Sequence Modeling

For sequence models with large word-level vocabularies, a majority of ne...

Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages

Machine translation has seen rapid progress with the advent of Transform...

Transformer with a Mixture of Gaussian Keys

Multi-head attention is a driving force behind state-of-the-art transfor...

Fast Portrait Segmentation with extremely light-weight network

In this paper, we describe a fast and light-weight portrait segmentation...

BiT: Robustly Binarized Multi-distilled Transformer

Modern pre-trained transformers have rapidly advanced the state-of-the-a...

Code Repositories


DeLighT: Very Deep and Light-Weight Transformers

view repo