
Weighted Transformer Network for Machine Translation
Stateoftheart results on neural machine translation often use attenti...
read it

DeFINE: DEep Factorized INput Word Embeddings for Neural Sequence Modeling
For sequence models with large wordlevel vocabularies, a majority of ne...
read it

Controlling Computation versus Quality for Neural Sequence Models
Most neural networks utilize the same amount of compute for every exampl...
read it

Explicitly Modeling Adaptive Depths for Transformer
The vanilla Transformer conducts a fixed number of computations over all...
read it

Attending to Mathematical Language with Transformers
Mathematical expressions were generated, evaluated and used to train neu...
read it

Reformer: The Efficient Transformer
Large Transformer models routinely achieve stateoftheart results on a...
read it

GLU Variants Improve Transformer
Gated Linear Units (arXiv:1612.08083) consist of the componentwise prod...
read it
DepthAdaptive Transformer
State of the art sequencetosequence models perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict how much computation is required for a particular sequence. Unlike dynamic computation in Universal Transformers, which applies the same set of layers iteratively, we apply different layers at every step to adjust both the amount of computation as well as the model capacity. Experiments on machine translation benchmarks show that this approach can match the accuracy of a baseline Transformer while using only half the number of decoder layers.
READ FULL TEXT
Comments
There are no comments yet.