The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention

02/11/2022
by   Kazuki Irie, et al.
15

Linear layers in neural networks (NNs) trained by gradient descent can be expressed as a key-value memory system which stores all training datapoints and the initial weights, and produces outputs using unnormalised dot attention over the entire training experience. While this has been technically known since the '60s, no prior work has effectively studied the operations of NNs in such a form, presumably due to prohibitive time and space complexities and impractical model sizes, all of them growing linearly with the number of training patterns which may get very large. However, this dual formulation offers a possibility of directly visualizing how an NN makes use of training patterns at test time, by examining the corresponding attention weights. We conduct experiments on small scale supervised image classification tasks in single-task, multi-task, and continual learning settings, as well as language modelling, and discuss potentials and limits of this view for better understanding and interpreting how NNs exploit training patterns. Our code is public.

READ FULL TEXT

page 7

page 11

page 12

page 14

page 15

page 16

page 17

page 18

research
08/13/2021

Continual Backprop: Stochastic Gradient Descent with Persistent Randomness

The Backprop algorithm for learning in neural networks utilizes two mech...
research
02/18/2016

Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity

We develop a general duality between neural networks and compositional k...
research
07/17/2022

A Simple Test-Time Method for Out-of-Distribution Detection

Neural networks are known to produce over-confident predictions on input...
research
06/18/2019

Fast and Flexible Multi-Task Classification Using Conditional Neural Adaptive Processes

The goal of this paper is to design image classification systems that, a...
research
12/06/2022

Statistical mechanics of continual learning: variational principle and mean-field potential

An obstacle to artificial general intelligence is set by the continual l...
research
08/26/2021

Re-using Adversarial Mask Discriminators for Test-time Training under Distribution Shifts

Thanks to their ability to learn flexible data-driven losses, Generative...
research
02/27/2019

Multi-task hypernetworks

Hypernetworks mechanism allows to generate and train neural networks (ta...

Please sign up or login with your details

Forgot password? Click here to reset