GLU Variants Improve Transformer

02/12/2020
by   Noam Shazeer, et al.
0

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2019

Sequence-to-sequence Singing Synthesis Using the Feed-forward Transformer

We propose a sequence-to-sequence singing synthesizer, which avoids the ...
research
10/26/2021

Gradient representations in ReLU networks as similarity functions

Feed-forward networks can be interpreted as mappings with linear decisio...
research
03/05/2021

IOT: Instance-wise Layer Reordering for Transformer Structures

With sequentially stacked self-attention, (optional) encoder-decoder att...
research
07/21/2023

On the Universality of Linear Recurrences Followed by Nonlinear Projections

In this note (work in progress towards a full-length paper) we show that...
research
03/04/2020

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment

Targeting at both high efficiency and performance, we propose AlignTTS t...
research
06/13/2021

Thinking Like Transformers

What is the computational model behind a Transformer? Where recurrent ne...
research
05/01/2017

Efficient Natural Language Response Suggestion for Smart Reply

This paper presents a computationally efficient machine-learned method f...

Please sign up or login with your details

Forgot password? Click here to reset