1 Introduction
In many applications, especially in a wide range of machine learning classifiers such as multinomial linear regression and naive Bayes classifiers
[4], [23], [26], one needs to compute an expression of the form(1) 
where . The function is often referred to as logsumexp or LSE. Its gradient , given by
(2) 
is called softmax and is also a key function in classification algorithms [7, p. 355], [8, p. 78], [10]. It is often the case that both logsumexp and softmax are required simultaneously.
The most obvious danger in evaluating (1) and (2) is overflow. We are interested in IEEE arithmetic in the precisions half (fp16), single (fp32), and double (fp64) [16], as well as the bfloat16 half precision format [17]. Table 1 shows the key parameters of interest for these precisions: the unit roundoff , the largest finite number , and the smallest positive normalized and subnormal floatingpoint numbers. If some exceeds the relevant value in Table 2 then overflow will occur. Clearly, overflow is possible even for quite modestly sized , especially for half and single precision.
Underflow is also possible. For example, for , if is a finite floatingpoint number with then^{1}^{1}1 is the value recommended by the IEEE standard [16, p. 43]. , whereas . For , underflow in the exponential evaluations is a problem when the sum of the terms that underflow is significant compared with the sum of the other terms; otherwise underflows are harmless. As well as avoiding harmful underflow, it is desirable to avoid generating subnormal numbers, which incur a performance penalty if handled in software^{2}^{2}2https://devblogs.nvidia.com/cudaprotipflushdenormalsconfidence/, https://en.wikipedia.org/wiki/Denormal_number.; see [12] or [22] for details of subnormal numbers.
A way to avoid overflow, and to attempt to avoid underflow and subnormal numbers, in evaluating logsumexp is to rewrite
If then
(3) 
Here and throughout, denotes the principal logarithm: the logarithm whose imaginary part lies in . Equation (3) is not, in general, true for [1, Lem. 2.5]. The softmax can be expressed in a related form (for any ):
(4) 
This shifting, typically with , is a well known way to attempt to avoid overflow and underflow in the evaluation of and , described in many places, including on Wikipedia^{3}^{3}3https://en.wikipedia.org/wiki/LogSumExp, in blog posts^{4}^{4}4For example, https://hips.seas.harvard.edu/blog/2013/01/09/computinglogsumexp/, http://bayesjumping.net/logsumexptrick/, and https://jblevins.org/log/logsumexp. And similarly for the softmax: https://timvieira.github.io/blog/post/2014/02/11/expnormalizetrick/., and even in a YouTube video^{5}^{5}5https://youtu.be/RVM21Voo7Q. The functions logsumexp in SciPy 1.3.1 [18] and LogSumExp in R [25] both implement (3) with . The function softmax
in the MATLAB Deep Learning Toolbox (R2019a)
[6] uses (4) with .bfloat16  

fp16  
fp32  
fp64 
bfloat16  

fp16  
fp32  
fp64 
An alternative to (4), which removes the denominator of (2) by subtracting logsumexp from the argument of in the numerator, is
(5) 
The conciseness of this divisionfree formula makes it attractive for implementing softmax when a logsumexp function is available. This formula is used in the SciPy 1.3.1 function softmax, in a MATLAB toolbox [20] associated with the book [2], and in the internal function softmax in the MATLAB Statistics and Machine Learning Toolbox (R2019a) [24]; in each case the logsumexp term is computed by (3) with . The formula (5) can also be found in codes posted in online communities such as Stack Exchange.
The accuracy properties of the formulas above are not clear. In particular, when , in (3) is computed as a sum of two terms of opposite sign, so there could potentially be damaging subtractive cancellation.
In this work we analyze the unshifted and shifted formulas and (5) in order to determine which choices of formulas give the best accuracy and reliability. In particular, we carry out a rounding error analysis of algorithms for the evaluation and relate the error bounds to the conditioning of and . We show that the shifted formulas have broadly similar error bounds to the unshifted ones, and so are entirely appropriate for practical use. We find, however, that the alternative softmax formula (5) has a less favorable error bound than the shifted formula and tends to produce larger errors in practice.
We begin, in the next section, by investigating the conditioning of the logsumexp and softmax functions. In section 3 we give detailed rounding error analyses of the basic formulas. In section 4 we analyze the shifted formulas and (5) and compare their error bounds with those for unshifted formulas. Numerical experiments are given in section 5
to test the accuracy of the evaluations and also to examine how the sum of the computed softmax vector entries compares with the exact value
. Conclusions are given in section 6.From this point on, we assume that the are real and we write
(6) 
We will use the standard model of floatingpoint arithmetic [12, sec. 2.2]
(7) 
2 Condition number
Before considering algorithms for computing logsumexp and softmax we investigate the conditioning of these functions, that is, the sensitivity of and in (1) and (2) to small perturbations in .
We define the condition number of in the usual way (see, e.g., [13, chap. 3]), by
This definition implies that
(8) 
so that measures the worstcase relative change in corresponding to a small relative change in . It is easy to show that for the norm,
(9) 
since by (2).
We identify two extreme cases. First, the condition number is infinite for , because . Hence when for all the condition number must be large. Second, if then by (23) below, so and the problem is perfectly conditioned.
A forward stable algorithm for computing logsumexp is one for which the relative error of the computed result is bounded by , for some low degree polynomial . Ideally, we would like the algorithm that we use to be forward stable. To see whether it is reasonable to expect forward stability, consider the case . Then , so : the problem is perfectly conditioned. When we compute using standard library functions we can expect to obtain relative errors in the computed exponential and logarithm bounded by [5], [21], [22, Chap. 10], that is,
(10) 
The term just causes a small relative perturbation of the output, so we have
Hence, since ,
(11) 
This relative error bound is much larger than for , even though the problem is perfectly conditioned. So it is not reasonable to expect an algorithm to be unconditionally forward stable in floatingpoint arithmetic. For this trivial computation, backward error and forward error are the same, so we also conclude that we cannot expect to obtain an algorithm that is unconditionally backward stable.
The softmax function has condition number
which is given explicitly by
Here, the matrix is the Jacobian of and denotes any vector norm and the corresponding subordinate matrix norm. Now
We have, for each ,
that is, . Hence
because . We note in passing that is the Hessian of and can be shown to be symmetric positive semidefinite for all [3, p. 74].
3 Basic algorithms and error analysis
Algorithm
Given , this algorithm computes and the gradient .

1 for i = 1:n end for i = 1:n end
What can be said about the accuracy of this algorithm when it is implemented in floatingpoint arithmetic? To answer this question we carry out a rounding error analysis. Throughout this section, we assume that there is no overflow or underflow.
First, we consider the error in evaluating the sum of nonnegative terms
Evaluating yields a computed result satisfying
(12) 
where, as noted in Section 2, we can expect the relative error from the exponential evaluation to satisfy . Therefore
Write the (exact) sum of computed quantities as
The rounding error analysis in [11], [12, sec 4.2] shows that the computed sum satisfies
where , so that, since ,
Writing , we obtain
(13) 
since . Hence
(14) 
Then the computed logsumexp is
(15) 
Using (14) we obtain
which gives the following result.
Theorem 1 (Basic logsumexp algorithm)
Comparing this bound with in (9) we see that it is larger by the factor . But by (23) below, so this factor is bounded by . Hence we have forward stability as long as , but for the bound does not guarantee forward stability. This is consistent with the bound (11) for the case .
Turning to the evaluation of the softmax function from its definition (2), by (12) we have
where accounts for the division, and so by (14),
Therefore
This bound guarantees a relative error of order at most in every component of . We weaken the bound into a normwise bound for the next theorem.
Theorem 2 (Basic softmax algorithm)
While the error bounds of Theorem 1 and 2 have a very satisfactory form, they provide no useful information when , and for fp16 this happens for as small as . We note, however, that the terms, which come from the summation, are pessimistic. It is shown by Higham and Mary [14, Thm. 3.1] that, under a probabilistic model of rounding errors, in the error bound for summation can be replaced by a small constant multiple of
with high probability, and the same holds for the bounds of Theorem
1 and 2.Next, consider the alternative formula (5), which we rewrite here:
(18) 
With evaluated in floatingpoint arithmetic by Algorithm 3, we obtain
(19)  
(20) 
where, using Theorem 1,
We summarize this result as follows.
Theorem 3 (Alternative softmax algorithm)
4 Algorithms with shifting
Now we consider the use of shifts in the logsumexp and softmax evaluations in order to avoid overflow and reduce the chance of harmful underflow.
Recall the definition (6) of and . Overflow in the exponential evaluations in (3) is certainly avoided if we take , as we then have and hence for all . We can rewrite (3) as
(22) 
where . From this expression we see that
(23) 
It follows that when , the sum “” that produces cannot suffer cancellation.
For later use, we note that (23) implies that, for any ,
(24) 
The term in (22) has the form , where . If is very small then will round to and the logarithm will evaluate as zero, even though . To avoid this loss of information we will use the function provided in, for example, C, MATLAB, and Numpy. These functions guarantee an accurate result for small (which can be achieved with a simple formula based on [9], [12, Prob. 1.5]).
These considerations lead to Algorithm 4.
Algorithm (logsumexp and softmax with shift)
This algorithm computes and the gradient for .

1 % for if , , end end for i = 1:n end
Note that while it is important to avoid forming for the evaluation, for we can safely form because if is small it has little influence on .
Algorithm 4 avoids overflow. If underflow occurs in the exponential then it is in a term in the sum added to in (22), so that term is negligible and the underflow is harmless. Note, in particular, that if for all then whereas Algorithm 3 returns , Algorithm 4 suffers no underflow and returns .
The main question is how shifting affects the accuracy of the evaluations. We give a rounding error analysis to assess this question. The analysis is a generalization of that in the previous section for the unshifted algorithm.
We first examine the error in evaluating the sum of nonnegative terms
(25) 
Evaluating yields a computed result satisfying
Therefore
and hence
Assuming for notational simplicity that , we can write the (exact) sum of computed quantities as
The rounding error analysis in [11], [12, sec 4.2] shows that the computed sum satisfies
where , so that, since ,
Hence
(26) 
since . Hence
(27) 
which guarantees an accurate computed sum as long as is not too large.
The final stage of the computation is to evaluate using the computed , for which we have
Here, we are assuming that the function has the property
Ignoring the innocuous term and writing, by (27),
(28) 
we have
using a Taylor series expansion about of the logarithm. Hence
Bounding using (28) gives
(29) 
or, as a relative error bound, since ,
(30) 
Simplifying the bound gives the next result.
Theorem 4 (Shifted logsumexp algorithm)
The main question is how this result compares with Theorem 1 for the unshifted algorithm. The only difference in the bounds is that in (16) is replaced by here. Now is possible only if and , so let us assume that these two inequalities hold. The term comes from bounding the term , where is defined in (25) and , and if then . Hence the potentially large constant is mitigated by the term that it multiplies—something that is lost in the manipulations to achieve a readable bound. We conclude that shifting should have little effect on the accuracy.
We note that (31) is weaker than necessary when (recall that ), since we bounded by in going from (29) to (30). If then (29) becomes
Since also implies for and hence , we have
which is a factor smaller than (31).
Turning to the evaluation of the softmax function from the shifted formula (4), we have, using (27),
where corresponds to the exponential evaluation and to the division, and
Therefore
Hence we have obtained the following result.
Theorem 5 (Shifted softmax algorithm)
Again, this is broadly commensurate with Theorem 2 for the unshifted evaluation, bearing in mind the comments following Theorem 4.
Finally, we consider (5) with the logsumexp computed by Algorithm 4. In floatingpoint arithmetic we have the same equation (19) as for the unshifted algorithm, but now with bounded by, using (31),
We have obtained the following result.
Theorem 6 (Alternative shifted softmax algorithm)
This is broadly similar to Theorem 3 for the unshifted alternative softmax algorithm.
5 Computational experiments
We now perform some experiments in a realistic setting, using MATLAB R2019a. The codes and data used for the experiments are available online^{6}^{6}6https://github.com/higham/logsumexpsoftmaxtests.
Our aims are to examine the sharpness of the rounding error bounds and to give a pairwise comparison of the accuracy of the algorithms in floatingpoint arithmetic. Our data comes from a deep learning application. To generate the data, we first set up and trained an artificial neural network, using the MATLAB Deep Learning Toolbox
[6]. More precisely, we trained a network to classify handwritten digit data from the widely used MNIST data set [19]. Here each data point is a grayscale pixel image and there are ten categories: , , …, . We used a network whose architecture has the following general form:
[nosep]

Image Input with normalization.

Batch Normalization 8 channels.

Max Pool stride [2 2] padding [0 0 0 0].

Convolution 16 stride [1 1] padding ’same’ .

Batch Normalization 16 channels.

ReLU.

Max Pool stride [2 2] padding [0 0 0 0].

Convolution 32 stride [1 1] padding ’same’.

Batch Normalization 32 channels.

ReLU.

Fully Connected 10 layer.

Softmax.

Classification Output crossentropy.
This is the default architecture from [6], where further details may be found.
The network was trained on 7500 images (750 from each of the ten categories), with 2500 further images (250 from each of the ten categories) used for validation.
The network takes as input a matrix corresponding to the pixels in the image and returns a nonnegative vector whose th component may be interpreted as the probability that the image came from category . If we categorize according to the highest probability from the output, then the trained network misclassifed 27 of the 2500 validation images, corresponding to a 98.9% success rate.
The network uses single precision arithmetic, fp32
. In our experiments, we are concerned only with floatingpoint arithmetic issues, and we treat the trained network as a means to produce a realistic data set. To do this, we extracted the 2500 single precision vectors from the validation set that were passed into the softmax layer and converted them to fp16 or bfloat16. We then used this data in our implementation of the softmax and logsumexp algorithms that we have studied in the previous sections.
To record errors in computed results we applied the basic algorithm, Algorithm 3, in single precision to provide a reference solution and used the chop function of [15] to simulate half precision arithmetic, in both the fp16 format and the bfloat16 format.
We first describe experiments in fp16. The components in the 2500 test vectors vary between about and . As indicated in Table 2, overflows in fp16 for . Hence, in these tests, overflow is an issue for the basic logsumexp implementation in Algorithm 3: it generated an Inf for 475 of the 2500 test vectors. The shifted version of logsumexp in Algorithm 4 did not overflow. In the plots below, we do not include results for the cases where Algorithm 3 produced overflow.
First, we look at the logsumexp algorithms. In the upper left plot of Figure 1 we used the basic implementation of logsumexp, Algorithm 3. We scatter plot over the 2025 vectors where no overflow occurred. For each such vector, the horizontal coordinate is the leading term in the error bound of Theorem 1, scaled by , that is, . Here, as shown in Table 1, for fp16. The vertical coordinate is the actual scaled relative error . The plot also gives a reference line of slope from the origin. We see that the bound is always satisfied and is reasonably sharp in many cases.
In the upper right plot of Figure 1 we show corresponding results for the shifted logsumexp implementation in Algorithm 4, using the bound from Theorem 4.
In the lower part of Figure 1
we scatter plot the floatingpoint errors for the basic and shifted algorithms. Here, for 1863 out of the 2025 cases (92%) the two errors were identical to all digits in the half precision computation. In more detail, over all the data points the ratio of the error in the basic logsumexp (horizontal axis) divided by the error in the shifted version (vertical axis) varied between 0.19 and 59, with a mean of 1.07 and a standard error of 0.03. This indicates that the two versions perform similarly, with the shift producing slightly better results.
We now move on to the four softmax implementations. In Figure 2 we use the shifted softmax implementation from Algorithm 4, analysed in Theorem 5, as the basis for comparison. The upper left plot has the scaled error from Algorithm 4 on the horizontal axis and the scaled error from the basic softmax in Algorithm 3 on the vertical axis. The upper right plot compares the shifted softmax against the alternative algorithm analyzed in Theorem 3. Similarly, the lower plot compares against the alternative shifted softmax algorithm analyzed in Theorem 6. We see that the softmax values obtained from Algorithms 3 and 4 have similar accuracy, whereas the alternative softmax versions based on the rewrite in (5) are typically less accurate.
A further test is to compute the sum of each softmax vector, which should equal . In Figure 3 we compare the softmax sums for the basic algorithm (red circles) analyzed in Theorem 2 and the alternative version (blue crosses) analyzed in Theorem 3. Similarly, Figure 4 compares the shifted softmax algorithm analyzed in Theorem 5 and its alternative analyzed in Theorem 6. The order along the axis is arbitrary; it corresponds to the order in which the data vectors were generated. These figures provide further evidence that the alternative softmax algorithms are less accurate than the basic or shifted algorithms.
We also conducted the corresponding experiments in simulated bfloat16 arithmetic. Here, as indicated in Tables 1 and 2, the number range is increased at the expense of reduced precision. In this case there was no overflow in any of the algorithms. The results were very similar to those for fp16, so they are not shown here.
6 Conclusions
The logsumexp and softmax functions both feature in many computational pipelines, so it is important to compute them accurately and to avoid generating infs or NaNs because of overflow or underflow. To this end, a shift is usually incorporated into the defining formulas, yielding (3) and (4). It is important to understand the effect of the shift on the accuracy of the computed result, especially when computations are carried out in a low precision such as bfloat16 or fp16, which have the equivalent of only 3 or 4 decimal digits of precision.
Our rounding error analysis shows that shifting by the largest element of the input vector does not lessen the accuracy of the computed logsumexp and softmax. Underlying this pleasing fact is the phenomenon that any large coefficients caused by shifting are canceled by multiplication with small exponentials.
We obtained an explicit formula for the condition number of logsumexp and bounds for the condition number of softmax, and we were able to identify situations in which the logsumexp algorithms are guaranteed to be forward stable.
For the alternative and widely used softmax formula that avoids division, (5), we obtained larger error bounds than for the shifted formula (4). Since our numerical experiments confirm that larger errors are typically obtained in practice, we recommend using (4) instead of (5) to evaluate softmax.
In summary, Algorithm 4 is our recommendation for computing logsumexp and softmax. It avoids overflow, reduces the chance of harmful underflow, and generally produces results as accurate as those from the unshifted formulas.
References
 [1] Mary Aprahamian and Nicholas J. Higham. The matrix unwinding function, with an application to computing the matrix exponential. SIAM J. Matrix Anal. Appl., 35(1):88–109, 2014.
 [2] Christopher M. Bishop. Pattern Recognition and Machine Learning. SpringerVerlag, New York, 2006. xx+738 pp. ISBN 9780387310732.
 [3] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. xiii+716 pp. ISBN 0521833787.
 [4] Giuseppe C. Calafiore, Stephane Gaubert, and Corrado Possieri. Logsumexp neural networks and posynomial models for convex and loglogconvex data. IEEE Trans. Neural Networks and Learning Systems, pages 1–12, 2019.
 [5] Florent de Dinechin, Christoph Lauter, and JeanMichel Muller. Fast and correctly rounded logarithms in doubleprecision. RAIROInf. Theor. Appl., 41(1):85–102, 2007.
 [6] Deep Learning Toolbox. The MathWorks, Inc., Natick, MA, USA. http://www.mathworks.co.uk/products/deeplearning/.
 [7] Bradley Efron and Trevor Hastie. Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. Cambridge University Press, Cambridge, UK, 2016. xix+475 pp. ISBN 9781107149892.
 [8] Ian Goodfellow, Yoshua Bengio, and Aaaron Courville. Deep Learning. The MIT Press, Cambridge, MA, USA, 2016. xxii+775 pp. ISBN 9780262035613.
 [9] HP15C Advanced Functions Handbook. HewlettPackard, Portable Computer Division, Corvallis, OR, USA, 1982. 221 pp. Part number 0001590011 Rev. C.
 [10] Catherine F. Higham and Desmond J. Higham. Deep learning: An introduction for applied mathematicians. SIAM Review, to appear, 2019.
 [11] Nicholas J. Higham. The accuracy of floating point summation. SIAM J. Sci. Comput., 14(4):783–799, 1993.
 [12] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. Second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002. xxx+680 pp. ISBN 0898715210.
 [13] Nicholas J. Higham. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008. xx+425 pp. ISBN 9780898716467.
 [14] Nicholas J. Higham and Theo Mary. A new approach to probabilistic rounding error analysis. MIMS EPrint 2018.33, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, November 2018. 22 pp. Revised March 2019. To appear in SIAM J. Sci. Comput.
 [15] Nicholas J. Higham and Srikara Pranesh. Simulating low precision floatingpoint arithmetic. MIMS EPrint 2019.4, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, March 2019. 18 pp. Revised July 2019. To appear in SIAM J. Sci. Comput.
 [16] IEEE Standard for FloatingPoint Arithmetic, IEEE Std 7542008 (revision of IEEE Std 7541985). IEEE Computer Society, New York, 2008. 58 pp. ISBN 9780738157528.
 [17] Intel Corporation. BFLOAT16—hardware numerics definition, November 2018. White paper. Document number 338302001US.
 [18] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. http://www.scipy.org/.
 [19] Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. The MNIST database of handwritten digits. Accessed June 17, 2019.
 [20] Matlab code for machine learning algorithms in book PRML. https://github.com/PRML/PRMLT.
 [21] JeanMichel Muller. Elementary Functions: Algorithms and Implementation. Third edition, Birkhäuser, Boston, MA, USA, 2016. xxv+283 pp. ISBN 9781489979810.
 [22] JeanMichel Muller, Nicolas Brunie, Florent de Dinechin, ClaudePierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, and Serge Torres. Handbook of FloatingPoint Arithmetic. Second edition, Birkhäuser, Boston, MA, USA, 2018. xxv+627 pp. ISBN 9783319765259.
 [23] Kevin P. Murphy. Machine Learning: a Probabilistic Approach. Cambridge University Press, Cambridge, UK, 2012.
 [24] Statistics and Machine Learning Toolbox. The MathWorks, Inc., Natick, MA, USA. https://uk.mathworks.com/products/statistics.html.
 [25] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
 [26] Christopher K. I. Williams and David Barber. Bayesian classification with Gaussian processes. IEEE Trans. Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.
Comments
There are no comments yet.