Talking-Heads Attention

03/05/2020
by   Noam Shazeer, et al.
0

We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation.While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2021

Structural analysis of an all-purpose question answering model

Attention is a key component of the now ubiquitous pre-trained language ...
research
10/07/2021

Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

While Transformer-based models have shown impressive language modeling p...
research
03/25/2018

Pay More Attention - Neural Architectures for Question-Answering

Machine comprehension is a representative task of natural language under...
research
09/25/2019

Reducing Transformer Depth on Demand with Structured Dropout

Overparameterized transformer networks have obtained state of the art re...
research
10/11/2022

Mixture of Attention Heads: Selecting Attention Heads Per Token

Mixture-of-Experts (MoE) networks have been proposed as an efficient way...
research
10/08/2021

A Few More Examples May Be Worth Billions of Parameters

We investigate the dynamics of increasing the number of model parameters...
research
09/28/2021

Text Simplification for Comprehension-based Question-Answering

Text simplification is the process of splitting and rephrasing a sentenc...

Please sign up or login with your details

Forgot password? Click here to reset