# Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms. The statistical assumption that the input is normal distributed supports the gradient stability of Softmax. However, when used in attention mechanisms such as transformers, since the correlation scores between embeddings are often not normally distributed, the gradient vanishing problem appears, and we prove this point through experimental confirmation. In this work, we suggest that replacing the exponential function by periodic functions, and we delve into some potential periodic alternatives of Softmax from the view of value and gradient. Through experiments on a simply designed demo referenced to LeViT, our method is proved to be able to alleviate the gradient problem and yield substantial improvements compared to Softmax and its variants. Further, we analyze the impact of pre-normalization for Softmax and our methods through mathematics and experiments. Lastly, we increase the depth of the demo and prove the applicability of our method in deep structures.

• 1 publication
• 105 publications
• 102 publications
11/23/2020

### Exploring Alternatives to Softmax Function

Softmax function is widely used in artificial neural networks for multic...
04/14/2021

### Sparse Attention with Linear Units

Recently, it has been argued that encoder-decoder models can be made mor...
05/08/2018

### Online normalizer calculation for softmax

The Softmax function is ubiquitous in machine learning, multiple previou...
09/19/2016

### A Cheap Linear Attention Mechanism with Fast Lookups and Fixed-Size Representations

The softmax content-based attention mechanism has proven to be very bene...
06/12/2020

### Sparse and Continuous Attention Mechanisms

Exponential families are widely used in machine learning; they include m...
11/21/2021

### Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism

There has been a rapid advance of custom hardware (HW) for accelerating ...
03/23/2022

### Enhancing Classifier Conservativeness and Robustness by Polynomiality

We illustrate the detrimental effect, such as overconfident decisions, t...

## 1 Introduction

Gradient Vanishing in Attention block: Since transformer is a pixel-wise process, the distribution of the input is more important than that for CNN. For CNN, the deviation caused by the value exceeding the expected range will be diluted in patches. The statistical assumption that the input is normal distributed supports the gradient stability of Softmax. However, there are always some values that exceed the expected value range, causing the gradient vanishing. This situation becomes worse in the attention mechanisms. The input of Softmax represents to the relationships of embedding (patches), and the distribution of the input should be variant for different images. That means in the attention mechanisms, there are always a part of the input stuck in the saturation area as shown in Figure 2, leading to the gradient vanishing and a long training.

Motivations: We are interested in the observation from the transformer-based models that the formation of attention corresponding to the objects often seems to lag behind that corresponding to the boundary, as shown in the first row of Figure 1. Attention can be formed on the boundary in the early stages of training, but slowly appear on objects only in the mid-late stages. However, object should be a more preferred position to put attention on, since there are no inductive bias such as translation equivariance and locality 1, and objects should be more important than boundary for transformer. By investigating the value of , we find that the value corresponding to the object is larger than boundary, and is more likely to fall into the saturation area of Softmax. Therefore, we speculate that, object is indeed more important, but it is difficult to form attention on object, since the value of is too large and falls into the saturation area. In contrast, the value of boundary is moderate, so attention can be formed smoothly. We believe that this situation is one of the reasons why transformer needs a long training.

In this work, we suggest that replacing the exponential function by periodic functions, and we delve into some potential periodic alternatives of Softmax from the view of value and gradient. Through experiments on a simply designed demo referenced to LeViT, our method is proved to be able to alleviate the gradient problem and perform better compared to Softmax and its variants.

To summarize, the main contributions of this paper are:

• Explore the gradient performance of Softmax in the transformer block, and prove that the input of Softmax is not normal distributed and the gradient vanishing does happens;

• Introducing a series of periodic alternatives for Softmax in attention mechanisms, which compress the input into the unsaturation area by periodic functions to escape the gradient vanishing;

• Explore the impact of pre-normalization for Softmax and our methods and make an observation that pre-normalization is just a conditional solution but not always a good choice;

## 2 Related works

There are few studies on alternatives of Softmax, since Softmax is mostly used to output classification results, and we can avoid the gradient vanishing by using a joint gradient analytical solution of Softmax and Cross-Entropy. But in the attention mechanism, Softmax is used alone, so the gradient vanishing problem appears. There are other works devoted to enhancing the input feature of Softmax by normalization 3; 4; 5; 6; 7. However, they are all focused on the representation of features, but not addressing the gradient vanishing problem.

Taylor softmax: Vincent et al. 8 used second order Taylor series approximation for as and derive the Taylor softmax as follows:

 Sj=1+xj+0.5⋅x2j∑di1+xi+0.5⋅x2i,xj⊆(x1,x2…,xj,…xd)

where is the dimension of the input of Taylor softmax. Since the quadratic function changes smoothly, Taylor softmax can generate more soft classification results to alleviate overfitting. However, when used without Cross-Entropy, Taylor softmax will cause gradient vanishing too, because of the saturation area.

Soft-margin softmax: Liang et al. 9 introduced a distance margin into Softmax to strengthen the inter-class compactness and inter-class separability between learned features. The Soft-margin (SM) softmax can be described as follows:

 Sj=exj−1exj−1+∑di≠jei,xj⊆(x1,x2…,xj,…xd)

where is manually set and when is set to zero, the SM-Softmax becomes identical to the original Softmax. SM-Softmax can be considered as a shifted version of Softmax. Similar to Taylor softmax, SM-Softmax and its variant, Ensemble Soft-Margin Softmax 10, are proposed to encourage the discriminability of features, and the gradient vanishing problem is still not addressed.

SM-Taylor softmax: Kunal et al. 11 explored higher order Taylor series approximation of to come up with an order Taylor softmax where:

 fn(x)=n∑i=0xii!

They proved that is always positive definite if n is even. Additionally, they combined the strengths of Taylor softmax and Soft-margin softmax, and proposed the SM-Taylor softmax as follows:

 Sj=fn(xj)∑difn(xi),xj⊆(x1,x2…,xj,…xd)

However, it is still a method to enhance features, but not a solution to the gradient problem.

## 3 Formulation

For convenience, we denote input, inter-value and scores by , and . In this section, we try to build some periodic functions as the alternatives of Softmax. In addition, there are five aspects to determine whether a function is a favorable alternative: (1) value stability; (2) gradient stability; (3) saturation area; (4) zero-region gradient; (5) information submerged. Furthermore, when judging the gradient-related properties of the functions, we only consider aspects related to instead of the other elements contained by the input. Since the scores are mapped by , the correlation between the gradient and the other elements is unavoidable for the periodic functions. The plots of and more discussion on how the other elements in the input influence the gradient of are provided in A.1.

According to the research in the Taylor Softmax function proposed by Vincent et al. 8, and the higher order Taylor Softmax proposed by Kunal et al. 11, it is reasonable to map the input with a monotonic function, since the input can adapt to a suitable value range as parameters update. Therefore, we suggest to use a periodic function to compress the input, so as to avoid approximating the small input to a fixed value (to keep them positive), and also avoid the output being too large to have an appropriate gradient.

### 3.1 Softmax

What Softmax does is mapping the input to an inter-value , , and mapping the inter-value to scores . The exponential function keeps the negative input positive, but also makes the positive input extremely high. For a large input , is too large and dominates , which means , ; and for a small input , , and since , . Therefore, the major cause of gradient vanishing in Softmax is that we need to compress the value with unknown upper and lower into . To do this, there has to be a saturation area for large and small value.

Before the discussion of the alternatives of Softmax, it is necessary to clarify the advantages of Softmax. First of all, the output of Softmax is positive definite. And due to exponential function, the difference between inputs will be magnified, that means Softmax can show the difference between inputs well, which is a good characteristic for attention mechanisms. Besides, according to the definition:

 Ssoftmax j=exj/d∑iexi,xj⊆(x1,x2…,xj,…xd)∂Sj/∂xj=M⋅exj/(M+exj)2,M=d∑i≠jexi∂2Sj∂x2j=M⋅exj⋅(exj+M)−2−2⋅M⋅e2xj⋅(exj+M)−3

Let , we have:

 M⋅exj⋅(exj+M)−2−2⋅M⋅e2xj⋅(exj+M)−3=0exj=M Extre (∂Sj∂xj)=M⋅M/(M+M)2=1/4

which means the max gradient of Softmax is 0.25, so it will not cause gradient explosions. Additionally, in spite of the saturation problem, no matter how large the other elements of input are, a sufficiently large value can always get an appropriate gradient.

### 3.2 Sin-max-constant / Cos-max:

When it comes to periodic functions, there is a good reason to use sine function since it is widely used and derivable. Therefore, we set the to keep the function positive definite following the suggestion in 12, and Sin-max-constant is defined as follows:

 Ssinmaxj=1+sin(xj)d+M+sin(xj),M=d∑i≠jsin(xi)

where represents to the dimension of input. Sin-max compress into and for , there is no saturation area.

However, let , then , we have:

 E(Ssinmaxj)=1+sin(xj)d=1d+sin(xj)d,d≫sin(xj)
 E(Ssinmaxj)≈1d

which means as the dimension of increases, the influence of will be weakened by the constant term, causing to be overwhelmed and cannot be mapped to correctly. Besides, consider from the view of gradient:

 ∂Sj∂xj=(M+d−1)⋅cos(xj)(M+d+sin(xj))2
 E(∂Sj∂xj)=(−sin(xj)+d−1)⋅cos(xj)d2

Similar to the , as the dimension of increases, the gradient of Sin-max will drop to zero, and gradient vanish will occur in the entire value range. The main reason for these defects is the constant term 1 in .

We try to remove the constant term in , and the expression becomes to:

 Ssinmaxj=sin(xj)M+sin(xj),M=d∑i≠jsin(xi)

Now let , then , we have:

 E(Ssinmaxj)=sin(xj)0→±∞

which means the value of is unstable, causing the network to have a great risk of breaking down. And for gradient, since and , we have:

 ∂Sj∂xj=M⋅cos(xj)(M+sin(xj))2
 E(∂Sj∂xj)=−0.5⋅sin(2⋅xj)0→±∞

which is also unstable because is not positive definite.

The reason why is not good is similar to Sin-max-constant where the difference within will be submerged due to the constant term as the dimension growing. Let , and , we can define Cos-max as:

 Scosmaxj=cos(xj)M+cos(xj),M=d∑i≠jcos(xi)∂Sj∂xj=−M⋅sin(xj)(M+cos(xj))2

Assume that strictly belongs to , then Cos-max can be considered as the Sin-max shifted to a positive definite range. However, a gradient vanishing problem will appear when clusters in the zero-region. Besides, from the view of gradient stability, we have:

 ∂2Sj∂x2j=−M⋅cos(xj)⋅(M+cos(xj))−2+2⋅M⋅sin(xj)⋅(M+cos(xj))−3

Let , we have:

 cos(xj)−2⋅sin(xj)⋅(M+cos(xj))−1=0
 −4+(M2+1)⋅cos2(xj)+2⋅M⋅cos3(xj)+cos4(xj)=0

According to the solving provided in A.5, we can have:

 Extre(∂Sj∂xj)M∈(−∞,+∞)

As increases, Extreme approaches , which means Cos-max is gradient unstable, causing the network to have a great risk of breaking down like Sin-max.

### 3.3 Sin2-max-shifted

To ensure that is positive definite, and no extra constant term is introduced, is a reasonable choice. So we can define Sin2-max as:

 Ssin2maxj=sin2(xj)M+sin2(xj),M=d∑i≠jsin2(xi)

Note that although , Sin2-max is not just a scaled double-frequency version of Cos-max owing to the constant terms. Therefore, the numerical and gradient characteristics of Sin2-max and Cos-max are different.

As shown in Figure 4, the possible problem of Sin2-max is that, assuming , most of clusters in the region close to 0 so most of the gradients close to 0, which makes the parameters difficult to update. To solve this ‘conditional’ problem, we can shift to the non-zero region by adding a phase to .

Let We have:

 cos(2⋅xj)=−12⋅(2⋅M+1±√8+(2⋅M+1)2)
 xj for max(∂Sj∂xj)=12arcos(−12⋅(2⋅M+1±√8+(2⋅M+1)2))

Unfortunately, for will change with , so we have to find an approximate solution. Besides, with change, the gradient will oscillate in the range of , causing gradient explosion or vanishing in the entire value range. Since and has a same cycle of period , the period of is . We set to make the gradient stable. So we get Sin2-max-shifted as follows:

 ∂Sj∂xj=M⋅sin(2⋅xj+0.5⋅π)(M+sin2(xi+0.25⋅π))2

### 3.4 Sin-Softmax

From another view, instead of replacing the exponential function, compressing the input into the unsaturation area is also a reasonable choice. To keep the gradient of input in zero region away from 0, we choose sine but not cosine. We define Sin-Softmax as follows:

 Ssin-softmaxj=esin(xj)M+esin(xj),M=d∑i≠jesin(xi)
 ∂Sj∂xj=M⋅esin(xj)⋅cos(xj)(M+esin(xj))2

Sin-Softmax can be considered as a periodic-normalized version of Softmax, which is also similar to the periodic activation function proposed in SirenNet

13. The best part of Sin-Softmax is that, the input is compressed into the well performed region of Softmax by periodic function, so the value and gradient are both stable, as show in Figure 4 and A.1. Additionally, is positive definite owing to the exponential function, and the gradient in the zero-near region is also well performed. However, the possible defect of Sin-Softmax is that, the largest score can only be times the smallest, since for

 sin(xj)∈(−1,+1),esin(xj)∈(1e,e)

which might cause oblivion of the most contributing value covered in a large number of low contributing values, as the dimension of the score maps (or the number of embeddings) increasing.

### 3.5 Siren-max

Inspired by SirenNet 13 where sine is used as the activation function, we define a with benefic gradient properties by:

 f(x)=sin(x)1−sin(x)

Since , to make positive definite, we add to and define Siren-max as follows:

 Ssiren-max j=1+sin(xj)2−2∗sin(xj)/(M+1+sin(xj)2−2∗sin(xj)),M=d∑i≠j1+sin(xi)2−2∗sin(xi)

Note that, the upper bound of is infinity, so adding a constant term to it will not cause the difference between being submerged like Sin-max-constant. As shown in Figure 4, there is no saturation area in Siren-max and it is well performed in zero-near region. The possible defect is that, the gradient has periodic jump points which might make training unstable.

### 3.6 Pre-normalization

Note that the pre-normalization discussed is not parameterized like Batch-norm 14, Layer-norm 15 or Group-norm 16. We denote the pre-normalization by normalizing the elements of in row. Considering the saturation problem, normalizing the input is also a reasonable operation. However, since the distribution of attention score maps is different for variant images, the normalization can hardly compress the maps into a specified value range precisely. Besides, the function also might be saturated, and the saturation area shifts with and . Therefore, the gradient situation of is similar to that of Softmax, which may cause gradient vanishing too. As a result, although pre-normalization can roughly gather the values in a specified range, but on the contrary, it may bring new gradient problems. More discussion and the gradient plots of normalization, pre-normalized version of Softmax and periodic alternatives are provide in A.2.

## 4 Experiments

To eliminate the unexpected effects of various tricks, the experiments are operated on a simply designed demo referenced to LeViT 1, as shown in Figure 5.

In experiments, there is an observation that most of the gradient of Softmax are very small, and only a small part of updates can be successfully back-propagated, even in the early stages of training, as shown in Figure 6. This phenomenon proves our point that the value, which are used to generate the attention scores, are related to the input images content and are not strictly normal distributed. Therefore, even if divided by following the original designed transformer block, the value still might fall into the saturation area of Softmax, making updates difficult. In our method, there is no saturation area in the functions, so the gradient is satisfactory at each training stage, which promotes the updating of parameters. More 3D graphs of gradients extracted from experiment are provide in A.3.

As shown in Figure 1 and Figure 7, due to the gradient vanishing problem, Softmax might cause difficulty in the formation of attention, especially in the early stages of training. We observe that attention is formed more smoothly on the boundary, and on the contrary, the attention corresponding to the objects can only be formed in the later stages of training. A possible reason is that the scores of the boundary between object and content are moderate, so the gradient flows smoothly. While, the scores of the objects are larger and might fall into the saturation area of Softmax, causing the gradient vanishing and the formation of attention being locked. While under the periodic alternatives, the attention is updated unrestrained on the image, which strengthens our arguments.

The gradient performance in the zero-region is crucial for training, and the early breaking down of training under Cos-max and Sin2-max can be ascribed to this. Besides, the stability of gradient is also very important. Since there are jump points within the range of input, the training under Siren-max breaks down too. In addition, since the input is submerged in the constant terms, the training under Sin-max-constant diverges.

Encouragingly, Sin2-max-shifted, Sin-Softmax exceed Softmax in the result just as we speculate, and norm-Siren-max is also surprisingly well performed. The result are shown in Table 2. The major drawbacks of Cos-max and Sin-max-constant are gradient performance in zero-region and information submergence respectively, which cannot be optimized by pre-normalization. As for Siren-max, pre-normalization optimizes the distribution of input and helps Siren-max avoid the gradient jump points, resulting in a satisfactory performance. Softmax can also be improved since pre-normalization helps the input escape from the saturation area to some degree. However, Sin2-max-shifted and Sin-Softmax are not subject to input distribution, so they cannot get benefits from pre-normalization. On the contrary, since pre-normalization will bring unexpected gradient problems, the performance of norm-Sin2-max-shifted and norm-Sin-Softmax decrease slightly. The plots and complete results of the experiment are provided in A.4.

## 5 Conclusion

Through the visualization of attention and gradient extracted from transformer blocks, we prove that in the attention mechanism, Softmax does lead to the gradient vanishing problem and makes training difficult. To address the problem, we propose a series of periodic alternatives of Softmax, and the experimental results prove that Sin-Softmax, Sin2-max-shifted, and norm-Siren-max are better performed than Softmax in attention mechanism. Additionally, we make an observation that pre-normalization is just a conditional solution but not always a good choice.

In the periodic alternatives, embedding requiring more attention does not necessarily require a larger value, which makes the generation of , more free, and it is hard to say whether this will lead to unexpected problems. This change might affect the representation of the model, and we will explore how this change happens in further works.