Approximating How Single Head Attention Learns

03/13/2021
by   Charlie Snell, et al.
4

Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word `i` to `o` if they co-occur frequently. Later, the model learns to attend to `i` while the correct output is o because it knows `i` translates to `o`. To formalize, we define a model property, Knowledge to Translate Individual Words (KTIW) (e.g. knowing that `i` translates to `o`), and claim that it drives the learning of the attention. This claim is supported by the fact that before the attention mechanism is learned, KTIW can be learned from word co-occurrence statistics, but not the other way around. Particularly, we can construct a training distribution that makes KTIW hard to learn, the learning of the attention fails, and the model cannot even learn the simple task of copying the input words to the output. Our approximation explains why models sometimes attend to salient words, and inspires a toy example where a multi-head attention model can overcome the above hard training distribution by improving learning dynamics rather than expressiveness.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2020

Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic

In this paper, we present a character-based BiLSTM model for splitting I...
research
11/01/2020

Seeing Both the Forest and the Trees: Multi-head Attention for Joint Classification on Different Compositional Levels

In natural languages, words are used in association to construct sentenc...
research
10/16/2020

Generating Fact Checking Summaries for Web Claims

We present SUMO, a neural attention-based approach that learns to establ...
research
10/10/2019

Orthogonality Constrained Multi-Head Attention For Keyword Spotting

Multi-head attention mechanism is capable of learning various representa...
research
02/15/2018

Teaching Machines to Code: Neural Markup Generation with Visual Attention

We present a deep recurrent neural network model with soft visual attent...
research
12/31/2018

Advancing Acoustic-to-Word CTC Model with Attention and Mixed-Units

The acoustic-to-word model based on the Connectionist Temporal Classific...
research
11/05/2019

Video Captioning with Text-based Dynamic Attention and Step-by-Step Learning

Automatically describing video content with natural language has been at...

Please sign up or login with your details

Forgot password? Click here to reset