Reservoir Transformer

12/30/2020
by   Sheng Shen, et al.
23

We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

09/08/2021

What's Hidden in a One-layer Randomly Weighted Transformer?

We demonstrate that, hidden within one-layer randomly weighted neural ne...
09/28/2020

Deep Transformers with Latent Depth

The Transformer model has achieved state-of-the-art performance in many ...
11/24/2021

Sparse is Enough in Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are...
10/22/2019

Depth-Adaptive Transformer

State of the art sequence-to-sequence models perform a fixed number of c...
04/06/2021

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Neural Machine Translation

It has been found that residual networks are an Euler discretization of ...
09/23/2020

Multi-Pass Transformer for Machine Translation

In contrast with previous approaches where information flows only toward...
12/27/2021

ViR:the Vision Reservoir

The most recent year has witnessed the success of applying the Vision Tr...