1 Introduction
Gradient Vanishing in Attention block: Since transformer is a pixelwise process, the distribution of the input is more important than that for CNN. For CNN, the deviation caused by the value exceeding the expected range will be diluted in patches. The statistical assumption that the input is normal distributed supports the gradient stability of Softmax. However, there are always some values that exceed the expected value range, causing the gradient vanishing. This situation becomes worse in the attention mechanisms. The input of Softmax represents to the relationships of embedding (patches), and the distribution of the input should be variant for different images. That means in the attention mechanisms, there are always a part of the input stuck in the saturation area as shown in Figure 2, leading to the gradient vanishing and a long training.
Motivations: We are interested in the observation from the transformerbased models that the formation of attention corresponding to the objects often seems to lag behind that corresponding to the boundary, as shown in the first row of Figure 1. Attention can be formed on the boundary in the early stages of training, but slowly appear on objects only in the midlate stages. However, object should be a more preferred position to put attention on, since there are no inductive bias such as translation equivariance and locality 1, and objects should be more important than boundary for transformer. By investigating the value of , we find that the value corresponding to the object is larger than boundary, and is more likely to fall into the saturation area of Softmax. Therefore, we speculate that, object is indeed more important, but it is difficult to form attention on object, since the value of is too large and falls into the saturation area. In contrast, the value of boundary is moderate, so attention can be formed smoothly. We believe that this situation is one of the reasons why transformer needs a long training.
In this work, we suggest that replacing the exponential function by periodic functions, and we delve into some potential periodic alternatives of Softmax from the view of value and gradient. Through experiments on a simply designed demo referenced to LeViT, our method is proved to be able to alleviate the gradient problem and perform better compared to Softmax and its variants.
To summarize, the main contributions of this paper are:

Explore the gradient performance of Softmax in the transformer block, and prove that the input of Softmax is not normal distributed and the gradient vanishing does happens;

Introducing a series of periodic alternatives for Softmax in attention mechanisms, which compress the input into the unsaturation area by periodic functions to escape the gradient vanishing;

Explore the impact of prenormalization for Softmax and our methods and make an observation that prenormalization is just a conditional solution but not always a good choice;
2 Related works
There are few studies on alternatives of Softmax, since Softmax is mostly used to output classification results, and we can avoid the gradient vanishing by using a joint gradient analytical solution of Softmax and CrossEntropy. But in the attention mechanism, Softmax is used alone, so the gradient vanishing problem appears. There are other works devoted to enhancing the input feature of Softmax by normalization 3; 4; 5; 6; 7. However, they are all focused on the representation of features, but not addressing the gradient vanishing problem.
Taylor softmax: Vincent et al. 8 used second order Taylor series approximation for as and derive the Taylor softmax as follows:
where is the dimension of the input of Taylor softmax. Since the quadratic function changes smoothly, Taylor softmax can generate more soft classification results to alleviate overfitting. However, when used without CrossEntropy, Taylor softmax will cause gradient vanishing too, because of the saturation area.
Softmargin softmax: Liang et al. 9 introduced a distance margin into Softmax to strengthen the interclass compactness and interclass separability between learned features. The Softmargin (SM) softmax can be described as follows:
where is manually set and when is set to zero, the SMSoftmax becomes identical to the original Softmax. SMSoftmax can be considered as a shifted version of Softmax. Similar to Taylor softmax, SMSoftmax and its variant, Ensemble SoftMargin Softmax 10, are proposed to encourage the discriminability of features, and the gradient vanishing problem is still not addressed.
SMTaylor softmax: Kunal et al. 11 explored higher order Taylor series approximation of to come up with an order Taylor softmax where:
They proved that is always positive definite if n is even. Additionally, they combined the strengths of Taylor softmax and Softmargin softmax, and proposed the SMTaylor softmax as follows:
However, it is still a method to enhance features, but not a solution to the gradient problem.
3 Formulation
For convenience, we denote input, intervalue and scores by , and . In this section, we try to build some periodic functions as the alternatives of Softmax. In addition, there are five aspects to determine whether a function is a favorable alternative: (1) value stability; (2) gradient stability; (3) saturation area; (4) zeroregion gradient; (5) information submerged. Furthermore, when judging the gradientrelated properties of the functions, we only consider aspects related to instead of the other elements contained by the input. Since the scores are mapped by , the correlation between the gradient and the other elements is unavoidable for the periodic functions. The plots of and more discussion on how the other elements in the input influence the gradient of are provided in A.1.
According to the research in the Taylor Softmax function proposed by Vincent et al. 8, and the higher order Taylor Softmax proposed by Kunal et al. 11, it is reasonable to map the input with a monotonic function, since the input can adapt to a suitable value range as parameters update. Therefore, we suggest to use a periodic function to compress the input, so as to avoid approximating the small input to a fixed value (to keep them positive), and also avoid the output being too large to have an appropriate gradient.
3.1 Softmax
What Softmax does is mapping the input to an intervalue , , and mapping the intervalue to scores . The exponential function keeps the negative input positive, but also makes the positive input extremely high. For a large input , is too large and dominates , which means , ; and for a small input , , and since , . Therefore, the major cause of gradient vanishing in Softmax is that we need to compress the value with unknown upper and lower into . To do this, there has to be a saturation area for large and small value.
Before the discussion of the alternatives of Softmax, it is necessary to clarify the advantages of Softmax. First of all, the output of Softmax is positive definite. And due to exponential function, the difference between inputs will be magnified, that means Softmax can show the difference between inputs well, which is a good characteristic for attention mechanisms. Besides, according to the definition:
Let , we have:
which means the max gradient of Softmax is 0.25, so it will not cause gradient explosions. Additionally, in spite of the saturation problem, no matter how large the other elements of input are, a sufficiently large value can always get an appropriate gradient.
3.2 Sinmaxconstant / Cosmax:
When it comes to periodic functions, there is a good reason to use sine function since it is widely used and derivable. Therefore, we set the to keep the function positive definite following the suggestion in 12, and Sinmaxconstant is defined as follows:
where represents to the dimension of input. Sinmax compress into and for , there is no saturation area.
However, let , then , we have:
which means as the dimension of increases, the influence of will be weakened by the constant term, causing to be overwhelmed and cannot be mapped to correctly. Besides, consider from the view of gradient:
Similar to the , as the dimension of increases, the gradient of Sinmax will drop to zero, and gradient vanish will occur in the entire value range. The main reason for these defects is the constant term 1 in .
We try to remove the constant term in , and the expression becomes to:
Now let , then , we have:
which means the value of is unstable, causing the network to have a great risk of breaking down. And for gradient, since and , we have:
which is also unstable because is not positive definite.
The reason why is not good is similar to Sinmaxconstant where the difference within will be submerged due to the constant term as the dimension growing. Let , and , we can define Cosmax as:
Assume that strictly belongs to , then Cosmax can be considered as the Sinmax shifted to a positive definite range. However, a gradient vanishing problem will appear when clusters in the zeroregion. Besides, from the view of gradient stability, we have:
Let , we have:
According to the solving provided in A.5, we can have:
As increases, Extreme approaches , which means Cosmax is gradient unstable, causing the network to have a great risk of breaking down like Sinmax.
3.3 Sin2maxshifted
To ensure that is positive definite, and no extra constant term is introduced, is a reasonable choice. So we can define Sin2max as:
Note that although , Sin2max is not just a scaled doublefrequency version of Cosmax owing to the constant terms. Therefore, the numerical and gradient characteristics of Sin2max and Cosmax are different.
As shown in Figure 4, the possible problem of Sin2max is that, assuming , most of clusters in the region close to 0 so most of the gradients close to 0, which makes the parameters difficult to update. To solve this ‘conditional’ problem, we can shift to the nonzero region by adding a phase to .
Let We have:
Unfortunately, for will change with , so we have to find an approximate solution. Besides, with change, the gradient will oscillate in the range of , causing gradient explosion or vanishing in the entire value range. Since and has a same cycle of period , the period of is . We set to make the gradient stable. So we get Sin2maxshifted as follows:
3.4 SinSoftmax
From another view, instead of replacing the exponential function, compressing the input into the unsaturation area is also a reasonable choice. To keep the gradient of input in zero region away from 0, we choose sine but not cosine. We define SinSoftmax as follows:
SinSoftmax can be considered as a periodicnormalized version of Softmax, which is also similar to the periodic activation function proposed in SirenNet
13. The best part of SinSoftmax is that, the input is compressed into the well performed region of Softmax by periodic function, so the value and gradient are both stable, as show in Figure 4 and A.1. Additionally, is positive definite owing to the exponential function, and the gradient in the zeronear region is also well performed. However, the possible defect of SinSoftmax is that, the largest score can only be times the smallest, since forwhich might cause oblivion of the most contributing value covered in a large number of low contributing values, as the dimension of the score maps (or the number of embeddings) increasing.
3.5 Sirenmax
Inspired by SirenNet 13 where sine is used as the activation function, we define a with benefic gradient properties by:
Since , to make positive definite, we add to and define Sirenmax as follows:
Note that, the upper bound of is infinity, so adding a constant term to it will not cause the difference between being submerged like Sinmaxconstant. As shown in Figure 4, there is no saturation area in Sirenmax and it is well performed in zeronear region. The possible defect is that, the gradient has periodic jump points which might make training unstable.
Value stability?  Gradient stability?  No saturation area?  Zeroregion gradient good?  Info Submerged?  
Softmax  ✓  ✓  ✘/✓  ✓  ✓ 
Sinmaxconstant  ✘  ✓  ✓  ✓  ✘ 
Cosmax  ✘/✓  ✘/✓  ✓  ✘  ✓ 
Sin2max  ✓  ✓  ✓  ✘  ✓ 
Sin2maxshifted  ✓  ✓  ✓  ✓  ✓ 
SinSoftmax  ✓  ✓  ✓  ✓  ✓ 
Sirenmax  ✓  ✘/✓  ✓  ✓  ✓ 
3.6 Prenormalization
Note that the prenormalization discussed is not parameterized like Batchnorm 14, Layernorm 15 or Groupnorm 16. We denote the prenormalization by normalizing the elements of in row. Considering the saturation problem, normalizing the input is also a reasonable operation. However, since the distribution of attention score maps is different for variant images, the normalization can hardly compress the maps into a specified value range precisely. Besides, the function also might be saturated, and the saturation area shifts with and . Therefore, the gradient situation of is similar to that of Softmax, which may cause gradient vanishing too. As a result, although prenormalization can roughly gather the values in a specified range, but on the contrary, it may bring new gradient problems. More discussion and the gradient plots of normalization, prenormalized version of Softmax and periodic alternatives are provide in A.2.
4 Experiments
To eliminate the unexpected effects of various tricks, the experiments are operated on a simply designed demo referenced to LeViT 1, as shown in Figure 5.
In experiments, there is an observation that most of the gradient of Softmax are very small, and only a small part of updates can be successfully backpropagated, even in the early stages of training, as shown in Figure 6. This phenomenon proves our point that the value, which are used to generate the attention scores, are related to the input images content and are not strictly normal distributed. Therefore, even if divided by following the original designed transformer block, the value still might fall into the saturation area of Softmax, making updates difficult. In our method, there is no saturation area in the functions, so the gradient is satisfactory at each training stage, which promotes the updating of parameters. More 3D graphs of gradients extracted from experiment are provide in A.3.
As shown in Figure 1 and Figure 7, due to the gradient vanishing problem, Softmax might cause difficulty in the formation of attention, especially in the early stages of training. We observe that attention is formed more smoothly on the boundary, and on the contrary, the attention corresponding to the objects can only be formed in the later stages of training. A possible reason is that the scores of the boundary between object and content are moderate, so the gradient flows smoothly. While, the scores of the objects are larger and might fall into the saturation area of Softmax, causing the gradient vanishing and the formation of attention being locked. While under the periodic alternatives, the attention is updated unrestrained on the image, which strengthens our arguments.

Depth = 1  Depth = 2  Depth = 4  Depth = 8  
norm  norm  norm  norm  
Softmax  52.67  53.80  58.21  59.57  67.01  68.75  81.12  82.31  
Sin2maxshifted  53.90  53.16  60.37  59.84  71.48  72.38  84.81  83.20  
SinSoftmax  54.64  54.27  60.40  60.18  72.39  72.84  85.14  84.63  
Sirenmax  \  55.19  \  59.62  \  73.25  \  84.70  
* \ means training breaks down in the early stages 
The gradient performance in the zeroregion is crucial for training, and the early breaking down of training under Cosmax and Sin2max can be ascribed to this. Besides, the stability of gradient is also very important. Since there are jump points within the range of input, the training under Sirenmax breaks down too. In addition, since the input is submerged in the constant terms, the training under Sinmaxconstant diverges.
Encouragingly, Sin2maxshifted, SinSoftmax exceed Softmax in the result just as we speculate, and normSirenmax is also surprisingly well performed. The result are shown in Table 2. The major drawbacks of Cosmax and Sinmaxconstant are gradient performance in zeroregion and information submergence respectively, which cannot be optimized by prenormalization. As for Sirenmax, prenormalization optimizes the distribution of input and helps Sirenmax avoid the gradient jump points, resulting in a satisfactory performance. Softmax can also be improved since prenormalization helps the input escape from the saturation area to some degree. However, Sin2maxshifted and SinSoftmax are not subject to input distribution, so they cannot get benefits from prenormalization. On the contrary, since prenormalization will bring unexpected gradient problems, the performance of normSin2maxshifted and normSinSoftmax decrease slightly. The plots and complete results of the experiment are provided in A.4.
5 Conclusion
Through the visualization of attention and gradient extracted from transformer blocks, we prove that in the attention mechanism, Softmax does lead to the gradient vanishing problem and makes training difficult. To address the problem, we propose a series of periodic alternatives of Softmax, and the experimental results prove that SinSoftmax, Sin2maxshifted, and normSirenmax are better performed than Softmax in attention mechanism. Additionally, we make an observation that prenormalization is just a conditional solution but not always a good choice.
In the periodic alternatives, embedding requiring more attention does not necessarily require a larger value, which makes the generation of , more free, and it is hard to say whether this will lead to unexpected problems. This change might affect the representation of the model, and we will explore how this change happens in further works.