On the Learning Dynamics of Attention Networks

07/25/2023
by   Rahul Vashisht, et al.
0

Attention models are typically learned by optimizing one of three standard loss functions that are variously called – soft attention, hard attention, and latent variable marginal likelihood (LVML) attention. All three paradigms are motivated by the same goal of finding two models – a `focus' model that `selects' the right segment of the input and a `classification' model that processes the selected segment into the target label. However, they differ significantly in the way the selected segments are aggregated, resulting in distinct dynamics and final results. We observe a unique signature of models learned using these paradigms and explain this as a consequence of the evolution of the classification model under gradient descent when the focus model is fixed. We also analyze these paradigms in a simple setting and derive closed-form expressions for the parameter trajectory under gradient flow. With the soft attention loss, the focus model improves quickly at initialization and splutters later on. On the other hand, hard attention loss behaves in the opposite fashion. Based on our observations, we propose a simple hybrid approach that combines the advantages of the different loss functions and demonstrates it on a collection of semi-synthetic and real-world datasets

READ FULL TEXT

page 3

page 17

page 18

page 19

page 20

page 21

page 22

page 23

research
10/12/2016

Optimistic Semi-supervised Least Squares Classification

The goal of semi-supervised learning is to improve supervised classifier...
research
05/03/2023

A Curriculum View of Robust Loss Functions

Robust loss functions are designed to combat the adverse impacts of labe...
research
08/30/2019

Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments

End-to-end text-to-speech (TTS) synthesis is a method that directly conv...
research
03/30/2021

A study of latent monotonic attention variants

End-to-end models reach state-of-the-art performance for speech recognit...
research
02/24/2022

Loss as the Inconsistency of a Probabilistic Dependency Graph: Choose Your Model, Not Your Loss Function

In a world blessed with a great diversity of loss functions, we argue th...
research
06/06/2019

Learning in Gated Neural Networks

Gating is a key feature in modern neural networks including LSTMs, GRUs ...

Please sign up or login with your details

Forgot password? Click here to reset