Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks

03/08/2021
by   George Dasoulas, et al.
0

Attention based neural networks are state of the art in a large range of applications. However, their performance tends to degrade when the number of layers increases. In this work, we show that enforcing Lipschitz continuity by normalizing the attention scores can significantly improve the performance of deep attention models. First, we show that, for deep graph attention networks (GAT), gradient explosion appears during training, leading to poor performance of gradient-based training algorithms. To address this issue, we derive a theoretical analysis of the Lipschitz continuity of attention modules and introduce LipschitzNorm, a simple and parameter-free normalization for self-attention mechanisms that enforces the model to be Lipschitz continuous. We then apply LipschitzNorm to GAT and Graph Transformers and show that their performance is substantially improved in the deep setting (10 to 30 layers). More specifically, we show that a deep GAT model with LipschitzNorm achieves state of the art results for node label prediction tasks that exhibit long-range dependencies, while showing consistent improvements over their unnormalized counterparts in benchmark node classification tasks.

READ FULL TEXT

page 6

page 8

research
06/02/2023

Centered Self-Attention Layers

The self-attention mechanism in transformers and the message-passing mec...
research
06/08/2020

The Lipschitz Constant of Self-Attention

Lipschitz constants of neural networks have been explored in various con...
research
04/12/2018

Regularisation of Neural Networks by Enforcing Lipschitz Continuity

We investigate the effect of explicitly enforcing the Lipschitz continui...
research
06/16/2021

Invertible Attention

Attention has been proved to be an efficient mechanism to capture long-r...
research
10/19/2021

Inductive Biases and Variable Creation in Self-Attention Mechanisms

Self-attention, an architectural motif designed to model long-range inte...
research
07/06/2020

A Mathematical Theory of Attention

Attention is a powerful component of modern neural networks across a wid...
research
05/31/2021

Choose a Transformer: Fourier or Galerkin

In this paper, we apply the self-attention from the state-of-the-art Tra...

Please sign up or login with your details

Forgot password? Click here to reset