Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

08/16/2021
by   Shulun Wang, et al.
40

Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms. The statistical assumption that the input is normal distributed supports the gradient stability of Softmax. However, when used in attention mechanisms such as transformers, since the correlation scores between embeddings are often not normally distributed, the gradient vanishing problem appears, and we prove this point through experimental confirmation. In this work, we suggest that replacing the exponential function by periodic functions, and we delve into some potential periodic alternatives of Softmax from the view of value and gradient. Through experiments on a simply designed demo referenced to LeViT, our method is proved to be able to alleviate the gradient problem and yield substantial improvements compared to Softmax and its variants. Further, we analyze the impact of pre-normalization for Softmax and our methods through mathematics and experiments. Lastly, we increase the depth of the demo and prove the applicability of our method in deep structures.

READ FULL TEXT

page 1

page 2

page 10

page 15

page 16

research
11/23/2020

Exploring Alternatives to Softmax Function

Softmax function is widely used in artificial neural networks for multic...
research
09/15/2023

Replacing softmax with ReLU in Vision Transformers

Previous research observed accuracy degradation when replacing the atten...
research
06/22/2023

Pre or Post-Softmax Scores in Gradient-based Attribution Methods, What is Best?

Gradient based attribution methods for neural networks working as classi...
research
04/14/2021

Sparse Attention with Linear Units

Recently, it has been argued that encoder-decoder models can be made mor...
research
06/12/2020

Sparse and Continuous Attention Mechanisms

Exponential families are widely used in machine learning; they include m...
research
09/19/2016

A Cheap Linear Attention Mechanism with Fast Lookups and Fixed-Size Representations

The softmax content-based attention mechanism has proven to be very bene...
research
11/21/2021

Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism

There has been a rapid advance of custom hardware (HW) for accelerating ...

Please sign up or login with your details

Forgot password? Click here to reset