where . The function is often referred to as log-sum-exp or LSE. Its gradient , given by
The most obvious danger in evaluating (1) and (2) is overflow. We are interested in IEEE arithmetic in the precisions half (fp16), single (fp32), and double (fp64) , as well as the bfloat16 half precision format . Table 1 shows the key parameters of interest for these precisions: the unit roundoff , the largest finite number , and the smallest positive normalized and subnormal floating-point numbers. If some exceeds the relevant value in Table 2 then overflow will occur. Clearly, overflow is possible even for quite modestly sized , especially for half and single precision.
Underflow is also possible. For example, for , if is a finite floating-point number with then111 is the value recommended by the IEEE standard [16, p. 43]. , whereas . For , underflow in the exponential evaluations is a problem when the sum of the terms that underflow is significant compared with the sum of the other terms; otherwise underflows are harmless. As well as avoiding harmful underflow, it is desirable to avoid generating subnormal numbers, which incur a performance penalty if handled in software222https://devblogs.nvidia.com/cuda-pro-tip-flush-denormals-confidence/, https://en.wikipedia.org/wiki/Denormal_number.; see  or  for details of subnormal numbers.
A way to avoid overflow, and to attempt to avoid underflow and subnormal numbers, in evaluating log-sum-exp is to rewrite
Here and throughout, denotes the principal logarithm: the logarithm whose imaginary part lies in . Equation (3) is not, in general, true for [1, Lem. 2.5]. The softmax can be expressed in a related form (for any ):
This shifting, typically with , is a well known way to attempt to avoid overflow and underflow in the evaluation of and , described in many places, including on Wikipedia333https://en.wikipedia.org/wiki/LogSumExp, in blog posts444For example, https://hips.seas.harvard.edu/blog/2013/01/09/computing-log-sum-exp/, http://bayesjumping.net/log-sum-exp-trick/, and https://jblevins.org/log/log-sum-exp. And similarly for the softmax: https://timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick/., and even in a YouTube video555https://youtu.be/-RVM21Voo7Q. The functions logsumexp in SciPy 1.3.1  and LogSumExp in R  both implement (3) with . The function softmax
in the MATLAB Deep Learning Toolbox (R2019a) uses (4) with .
The conciseness of this division-free formula makes it attractive for implementing softmax when a log-sum-exp function is available. This formula is used in the SciPy 1.3.1 function softmax, in a MATLAB toolbox  associated with the book , and in the internal function softmax in the MATLAB Statistics and Machine Learning Toolbox (R2019a) ; in each case the log-sum-exp term is computed by (3) with . The formula (5) can also be found in codes posted in online communities such as Stack Exchange.
The accuracy properties of the formulas above are not clear. In particular, when , in (3) is computed as a sum of two terms of opposite sign, so there could potentially be damaging subtractive cancellation.
In this work we analyze the unshifted and shifted formulas and (5) in order to determine which choices of formulas give the best accuracy and reliability. In particular, we carry out a rounding error analysis of algorithms for the evaluation and relate the error bounds to the conditioning of and . We show that the shifted formulas have broadly similar error bounds to the unshifted ones, and so are entirely appropriate for practical use. We find, however, that the alternative softmax formula (5) has a less favorable error bound than the shifted formula and tends to produce larger errors in practice.
We begin, in the next section, by investigating the conditioning of the log-sum-exp and softmax functions. In section 3 we give detailed rounding error analyses of the basic formulas. In section 4 we analyze the shifted formulas and (5) and compare their error bounds with those for unshifted formulas. Numerical experiments are given in section 5
to test the accuracy of the evaluations and also to examine how the sum of the computed softmax vector entries compares with the exact value. Conclusions are given in section 6.
From this point on, we assume that the are real and we write
We will use the standard model of floating-point arithmetic [12, sec. 2.2]
2 Condition number
We define the condition number of in the usual way (see, e.g., [13, chap. 3]), by
This definition implies that
so that measures the worst-case relative change in corresponding to a small relative change in . It is easy to show that for the -norm,
since by (2).
We identify two extreme cases. First, the condition number is infinite for , because . Hence when for all the condition number must be large. Second, if then by (23) below, so and the problem is perfectly conditioned.
A forward stable algorithm for computing log-sum-exp is one for which the relative error of the computed result is bounded by , for some low degree polynomial . Ideally, we would like the algorithm that we use to be forward stable. To see whether it is reasonable to expect forward stability, consider the case . Then , so : the problem is perfectly conditioned. When we compute using standard library functions we can expect to obtain relative errors in the computed exponential and logarithm bounded by , , [22, Chap. 10], that is,
The term just causes a small relative perturbation of the output, so we have
Hence, since ,
This relative error bound is much larger than for , even though the problem is perfectly conditioned. So it is not reasonable to expect an algorithm to be unconditionally forward stable in floating-point arithmetic. For this trivial computation, backward error and forward error are the same, so we also conclude that we cannot expect to obtain an algorithm that is unconditionally backward stable.
The softmax function has condition number
which is given explicitly by
Here, the matrix is the Jacobian of and denotes any vector norm and the corresponding subordinate matrix norm. Now
We have, for each ,
that is, . Hence
because . We note in passing that is the Hessian of and can be shown to be symmetric positive semidefinite for all [3, p. 74].
3 Basic algorithms and error analysis
Given , this algorithm computes and the gradient .
1 for i = 1:n end for i = 1:n end
What can be said about the accuracy of this algorithm when it is implemented in floating-point arithmetic? To answer this question we carry out a rounding error analysis. Throughout this section, we assume that there is no overflow or underflow.
First, we consider the error in evaluating the sum of nonnegative terms
Evaluating yields a computed result satisfying
where, as noted in Section 2, we can expect the relative error from the exponential evaluation to satisfy . Therefore
Write the (exact) sum of computed quantities as
where , so that, since ,
Writing , we obtain
since . Hence
Then the computed log-sum-exp is
Using (14) we obtain
which gives the following result.
Theorem 1 (Basic log-sum-exp algorithm)
In the absence of overflow and underflow, the computed log-sum-exp from Algorithm 3 satisfies
Comparing this bound with in (9) we see that it is larger by the factor . But by (23) below, so this factor is bounded by . Hence we have forward stability as long as , but for the bound does not guarantee forward stability. This is consistent with the bound (11) for the case .
where accounts for the division, and so by (14),
This bound guarantees a relative error of order at most in every component of . We weaken the bound into a normwise bound for the next theorem.
Theorem 2 (Basic softmax algorithm)
In the absence of overflow and underflow, the computed softmax from Algorithm 3 satisfies
While the error bounds of Theorem 1 and 2 have a very satisfactory form, they provide no useful information when , and for fp16 this happens for as small as . We note, however, that the terms, which come from the summation, are pessimistic. It is shown by Higham and Mary [14, Thm. 3.1] that, under a probabilistic model of rounding errors, in the error bound for summation can be replaced by a small constant multiple of
with high probability, and the same holds for the bounds of Theorem1 and 2.
Next, consider the alternative formula (5), which we rewrite here:
With evaluated in floating-point arithmetic by Algorithm 3, we obtain
where, using Theorem 1,
We summarize this result as follows.
Theorem 3 (Alternative softmax algorithm)
4 Algorithms with shifting
Now we consider the use of shifts in the log-sum-exp and softmax evaluations in order to avoid overflow and reduce the chance of harmful underflow.
where . From this expression we see that
It follows that when , the sum “” that produces cannot suffer cancellation.
For later use, we note that (23) implies that, for any ,
The term in (22) has the form , where . If is very small then will round to and the logarithm will evaluate as zero, even though . To avoid this loss of information we will use the function provided in, for example, C, MATLAB, and Numpy. These functions guarantee an accurate result for small (which can be achieved with a simple formula based on , [12, Prob. 1.5]).
These considerations lead to Algorithm 4.
Algorithm (log-sum-exp and softmax with shift)
This algorithm computes and the gradient for .
1 % for if , , end end for i = 1:n end
Note that while it is important to avoid forming for the -evaluation, for we can safely form because if is small it has little influence on .
Algorithm 4 avoids overflow. If underflow occurs in the exponential then it is in a term in the sum added to in (22), so that term is negligible and the underflow is harmless. Note, in particular, that if for all then whereas Algorithm 3 returns , Algorithm 4 suffers no underflow and returns .
The main question is how shifting affects the accuracy of the evaluations. We give a rounding error analysis to assess this question. The analysis is a generalization of that in the previous section for the unshifted algorithm.
We first examine the error in evaluating the sum of nonnegative terms
Evaluating yields a computed result satisfying
Assuming for notational simplicity that , we can write the (exact) sum of computed quantities as
where , so that, since ,
since . Hence
which guarantees an accurate computed sum as long as is not too large.
The final stage of the computation is to evaluate using the computed , for which we have
Here, we are assuming that the function has the property
Ignoring the innocuous term and writing, by (27),
using a Taylor series expansion about of the logarithm. Hence
Bounding using (28) gives
or, as a relative error bound, since ,
Simplifying the bound gives the next result.
Theorem 4 (Shifted log-sum-exp algorithm)
The computed log-sum-exp from Algorithm 4 satisfies
The main question is how this result compares with Theorem 1 for the unshifted algorithm. The only difference in the bounds is that in (16) is replaced by here. Now is possible only if and , so let us assume that these two inequalities hold. The term comes from bounding the term , where is defined in (25) and , and if then . Hence the potentially large constant is mitigated by the term that it multiplies—something that is lost in the manipulations to achieve a readable bound. We conclude that shifting should have little effect on the accuracy.
Since also implies for and hence , we have
which is a factor smaller than (31).
where corresponds to the exponential evaluation and to the division, and
Hence we have obtained the following result.
Theorem 5 (Shifted softmax algorithm)
The computed from Algorithm 4 satisfies
We have obtained the following result.
Theorem 6 (Alternative shifted softmax algorithm)
This is broadly similar to Theorem 3 for the unshifted alternative softmax algorithm.
5 Computational experiments
We now perform some experiments in a realistic setting, using MATLAB R2019a. The codes and data used for the experiments are available online666https://github.com/higham/logsumexp-softmax-tests.
Our aims are to examine the sharpness of the rounding error bounds and to give a pairwise comparison of the accuracy of the algorithms in floating-point arithmetic. Our data comes from a deep learning application. To generate the data, we first set up and trained an artificial neural network, using the MATLAB Deep Learning Toolbox. More precisely, we trained a network to classify handwritten digit data from the widely used MNIST data set . Here each data point is a grayscale pixel image and there are ten categories: , , …, . We used a network whose architecture has the following general form:
Image Input with normalization.
Batch Normalization 8 channels.
Max Pool stride [2 2] padding [0 0 0 0].
Convolution 16 stride [1 1] padding ’same’ .
Batch Normalization 16 channels.
Max Pool stride [2 2] padding [0 0 0 0].
Convolution 32 stride [1 1] padding ’same’.
Batch Normalization 32 channels.
Fully Connected 10 layer.
Classification Output crossentropy.
This is the default architecture from , where further details may be found.
The network was trained on 7500 images (750 from each of the ten categories), with 2500 further images (250 from each of the ten categories) used for validation.
The network takes as input a matrix corresponding to the pixels in the image and returns a nonnegative vector whose th component may be interpreted as the probability that the image came from category . If we categorize according to the highest probability from the output, then the trained network misclassifed 27 of the 2500 validation images, corresponding to a 98.9% success rate.
The network uses single precision arithmetic, fp32
. In our experiments, we are concerned only with floating-point arithmetic issues, and we treat the trained network as a means to produce a realistic data set. To do this, we extracted the 2500 single precision vectors from the validation set that were passed into the softmax layer and converted them to fp16 or bfloat16. We then used this data in our implementation of the softmax and log-sum-exp algorithms that we have studied in the previous sections.
To record errors in computed results we applied the basic algorithm, Algorithm 3, in single precision to provide a reference solution and used the chop function of  to simulate half precision arithmetic, in both the fp16 format and the bfloat16 format.
We first describe experiments in fp16. The components in the 2500 test vectors vary between about and . As indicated in Table 2, overflows in fp16 for . Hence, in these tests, overflow is an issue for the basic log-sum-exp implementation in Algorithm 3: it generated an Inf for 475 of the 2500 test vectors. The shifted version of log-sum-exp in Algorithm 4 did not overflow. In the plots below, we do not include results for the cases where Algorithm 3 produced overflow.
First, we look at the log-sum-exp algorithms. In the upper left plot of Figure 1 we used the basic implementation of log-sum-exp, Algorithm 3. We scatter plot over the 2025 vectors where no overflow occurred. For each such vector, the horizontal coordinate is the leading term in the error bound of Theorem 1, scaled by , that is, . Here, as shown in Table 1, for fp16. The vertical coordinate is the actual scaled relative error . The plot also gives a reference line of slope from the origin. We see that the bound is always satisfied and is reasonably sharp in many cases.
In the lower part of Figure 1
we scatter plot the floating-point errors for the basic and shifted algorithms. Here, for 1863 out of the 2025 cases (92%) the two errors were identical to all digits in the half precision computation. In more detail, over all the data points the ratio of the error in the basic log-sum-exp (horizontal axis) divided by the error in the shifted version (vertical axis) varied between 0.19 and 59, with a mean of 1.07 and a standard error of 0.03. This indicates that the two versions perform similarly, with the shift producing slightly better results.
We now move on to the four softmax implementations. In Figure 2 we use the shifted softmax implementation from Algorithm 4, analysed in Theorem 5, as the basis for comparison. The upper left plot has the scaled error from Algorithm 4 on the horizontal axis and the scaled error from the basic softmax in Algorithm 3 on the vertical axis. The upper right plot compares the shifted softmax against the alternative algorithm analyzed in Theorem 3. Similarly, the lower plot compares against the alternative shifted softmax algorithm analyzed in Theorem 6. We see that the softmax values obtained from Algorithms 3 and 4 have similar accuracy, whereas the alternative softmax versions based on the rewrite in (5) are typically less accurate.
A further test is to compute the sum of each softmax vector, which should equal . In Figure 3 we compare the softmax sums for the basic algorithm (red circles) analyzed in Theorem 2 and the alternative version (blue crosses) analyzed in Theorem 3. Similarly, Figure 4 compares the shifted softmax algorithm analyzed in Theorem 5 and its alternative analyzed in Theorem 6. The order along the -axis is arbitrary; it corresponds to the order in which the data vectors were generated. These figures provide further evidence that the alternative softmax algorithms are less accurate than the basic or shifted algorithms.
We also conducted the corresponding experiments in simulated bfloat16 arithmetic. Here, as indicated in Tables 1 and 2, the number range is increased at the expense of reduced precision. In this case there was no overflow in any of the algorithms. The results were very similar to those for fp16, so they are not shown here.
The log-sum-exp and softmax functions both feature in many computational pipelines, so it is important to compute them accurately and to avoid generating infs or NaNs because of overflow or underflow. To this end, a shift is usually incorporated into the defining formulas, yielding (3) and (4). It is important to understand the effect of the shift on the accuracy of the computed result, especially when computations are carried out in a low precision such as bfloat16 or fp16, which have the equivalent of only 3 or 4 decimal digits of precision.
Our rounding error analysis shows that shifting by the largest element of the input vector does not lessen the accuracy of the computed log-sum-exp and softmax. Underlying this pleasing fact is the phenomenon that any large coefficients caused by shifting are canceled by multiplication with small exponentials.
We obtained an explicit formula for the condition number of log-sum-exp and bounds for the condition number of softmax, and we were able to identify situations in which the log-sum-exp algorithms are guaranteed to be forward stable.
For the alternative and widely used softmax formula that avoids division, (5), we obtained larger error bounds than for the shifted formula (4). Since our numerical experiments confirm that larger errors are typically obtained in practice, we recommend using (4) instead of (5) to evaluate softmax.
In summary, Algorithm 4 is our recommendation for computing log-sum-exp and softmax. It avoids overflow, reduces the chance of harmful underflow, and generally produces results as accurate as those from the unshifted formulas.
-  Mary Aprahamian and Nicholas J. Higham. The matrix unwinding function, with an application to computing the matrix exponential. SIAM J. Matrix Anal. Appl., 35(1):88–109, 2014.
-  Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, New York, 2006. xx+738 pp. ISBN 978-0-387-31073-2.
-  Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. xiii+716 pp. ISBN 0-521-83378-7.
-  Giuseppe C. Calafiore, Stephane Gaubert, and Corrado Possieri. Log-sum-exp neural networks and posynomial models for convex and log-log-convex data. IEEE Trans. Neural Networks and Learning Systems, pages 1–12, 2019.
-  Florent de Dinechin, Christoph Lauter, and Jean-Michel Muller. Fast and correctly rounded logarithms in double-precision. RAIRO-Inf. Theor. Appl., 41(1):85–102, 2007.
-  Deep Learning Toolbox. The MathWorks, Inc., Natick, MA, USA. http://www.mathworks.co.uk/products/deep-learning/.
-  Bradley Efron and Trevor Hastie. Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. Cambridge University Press, Cambridge, UK, 2016. xix+475 pp. ISBN 978-1-107-14989-2.
-  Ian Goodfellow, Yoshua Bengio, and Aaaron Courville. Deep Learning. The MIT Press, Cambridge, MA, USA, 2016. xxii+775 pp. ISBN 978-0-262-03561-3.
-  HP-15C Advanced Functions Handbook. Hewlett-Packard, Portable Computer Division, Corvallis, OR, USA, 1982. 221 pp. Part number 00015-90011 Rev. C.
-  Catherine F. Higham and Desmond J. Higham. Deep learning: An introduction for applied mathematicians. SIAM Review, to appear, 2019.
-  Nicholas J. Higham. The accuracy of floating point summation. SIAM J. Sci. Comput., 14(4):783–799, 1993.
-  Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. Second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002. xxx+680 pp. ISBN 0-89871-521-0.
-  Nicholas J. Higham. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008. xx+425 pp. ISBN 978-0-898716-46-7.
-  Nicholas J. Higham and Theo Mary. A new approach to probabilistic rounding error analysis. MIMS EPrint 2018.33, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, November 2018. 22 pp. Revised March 2019. To appear in SIAM J. Sci. Comput.
-  Nicholas J. Higham and Srikara Pranesh. Simulating low precision floating-point arithmetic. MIMS EPrint 2019.4, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, March 2019. 18 pp. Revised July 2019. To appear in SIAM J. Sci. Comput.
-  IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2008 (revision of IEEE Std 754-1985). IEEE Computer Society, New York, 2008. 58 pp. ISBN 978-0-7381-5752-8.
-  Intel Corporation. BFLOAT16—hardware numerics definition, November 2018. White paper. Document number 338302-001US.
-  Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. http://www.scipy.org/.
-  Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. The MNIST database of handwritten digits. Accessed June 17, 2019.
-  Matlab code for machine learning algorithms in book PRML. https://github.com/PRML/PRMLT.
-  Jean-Michel Muller. Elementary Functions: Algorithms and Implementation. Third edition, Birkhäuser, Boston, MA, USA, 2016. xxv+283 pp. ISBN 978-1-4899-7981-0.
-  Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, and Serge Torres. Handbook of Floating-Point Arithmetic. Second edition, Birkhäuser, Boston, MA, USA, 2018. xxv+627 pp. ISBN 978-3-319-76525-9.
-  Kevin P. Murphy. Machine Learning: a Probabilistic Approach. Cambridge University Press, Cambridge, UK, 2012.
-  Statistics and Machine Learning Toolbox. The MathWorks, Inc., Natick, MA, USA. https://uk.mathworks.com/products/statistics.html.
-  R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
-  Christopher K. I. Williams and David Barber. Bayesian classification with Gaussian processes. IEEE Trans. Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.