    # Accurate Computation of the Log-Sum-Exp and Softmax Functions

Evaluating the log-sum-exp function or the softmax function is a key step in many modern data science algorithms, notably in inference and classification. Because of the exponentials that these functions contain, the evaluation is prone to overflow and underflow, especially in low precision arithmetic. Software implementations commonly use alternative formulas that avoid overflow and reduce the chance of harmful underflow, employing a shift or another rewriting. Although mathematically equivalent, these variants behave differently in floating-point arithmetic. We give rounding error analyses of different evaluation algorithms and interpret the error bounds using condition numbers for the functions. We conclude, based on the analysis and numerical experiments, that the shifted formulas are of similar accuracy to the unshifted ones and that the shifted softmax formula is typically more accurate than a division-free variant.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many applications, especially in a wide range of machine learning classifiers such as multinomial linear regression and naive Bayes classifiers

, , , one needs to compute an expression of the form

 y=f(x)=logn∑i=1exi, (1)

where . The function is often referred to as log-sum-exp or LSE. Its gradient , given by

 gj(x)=∂∂xjf(x)=exj∑ni=1exi,j=1:n, (2)

is called softmax and is also a key function in classification algorithms [7, p. 355], [8, p. 78], . It is often the case that both log-sum-exp and softmax are required simultaneously.

The most obvious danger in evaluating (1) and (2) is overflow. We are interested in IEEE arithmetic in the precisions half (fp16), single (fp32), and double (fp64) , as well as the bfloat16 half precision format . Table 1 shows the key parameters of interest for these precisions: the unit roundoff , the largest finite number , and the smallest positive normalized and subnormal floating-point numbers. If some exceeds the relevant value in Table 2 then overflow will occur. Clearly, overflow is possible even for quite modestly sized , especially for half and single precision.

Underflow is also possible. For example, for , if is a finite floating-point number with then111 is the value recommended by the IEEE standard [16, p. 43]. , whereas . For , underflow in the exponential evaluations is a problem when the sum of the terms that underflow is significant compared with the sum of the other terms; otherwise underflows are harmless. As well as avoiding harmful underflow, it is desirable to avoid generating subnormal numbers, which incur a performance penalty if handled in software; see  or  for details of subnormal numbers.

A way to avoid overflow, and to attempt to avoid underflow and subnormal numbers, in evaluating log-sum-exp is to rewrite

 y =logn∑i=1exi=logn∑i=1eaexi−a=log(ean∑i=1exi−a).

If then

 y=a+logn∑i=1exi−a. (3)

Here and throughout, denotes the principal logarithm: the logarithm whose imaginary part lies in . Equation (3) is not, in general, true for [1, Lem. 2.5]. The softmax can be expressed in a related form (for any ):

 gj=exj−a∑ni=1exi−a,j=1:n. (4)

This shifting, typically with , is a well known way to attempt to avoid overflow and underflow in the evaluation of and , described in many places, including on Wikipedia, in blog posts4, and even in a YouTube video. The functions logsumexp in SciPy 1.3.1  and LogSumExp in R  both implement (3) with . The function softmax

in the MATLAB Deep Learning Toolbox (R2019a)

 uses (4) with .

An alternative to (4), which removes the denominator of (2) by subtracting log-sum-exp from the argument of in the numerator, is

 gj=exp(xj−logn∑i=1exi). (5)

The conciseness of this division-free formula makes it attractive for implementing softmax when a log-sum-exp function is available. This formula is used in the SciPy 1.3.1 function softmax, in a MATLAB toolbox  associated with the book , and in the internal function softmax in the MATLAB Statistics and Machine Learning Toolbox (R2019a) ; in each case the log-sum-exp term is computed by (3) with . The formula (5) can also be found in codes posted in online communities such as Stack Exchange.

The accuracy properties of the formulas above are not clear. In particular, when , in (3) is computed as a sum of two terms of opposite sign, so there could potentially be damaging subtractive cancellation.

In this work we analyze the unshifted and shifted formulas and (5) in order to determine which choices of formulas give the best accuracy and reliability. In particular, we carry out a rounding error analysis of algorithms for the evaluation and relate the error bounds to the conditioning of and . We show that the shifted formulas have broadly similar error bounds to the unshifted ones, and so are entirely appropriate for practical use. We find, however, that the alternative softmax formula (5) has a less favorable error bound than the shifted formula and tends to produce larger errors in practice.

We begin, in the next section, by investigating the conditioning of the log-sum-exp and softmax functions. In section 3 we give detailed rounding error analyses of the basic formulas. In section 4 we analyze the shifted formulas and (5) and compare their error bounds with those for unshifted formulas. Numerical experiments are given in section 5

to test the accuracy of the evaluations and also to examine how the sum of the computed softmax vector entries compares with the exact value

. Conclusions are given in section 6.

From this point on, we assume that the are real and we write

 xmax=maxixi,xmin=minixi. (6)

We will use the standard model of floating-point arithmetic [12, sec. 2.2]

 fl(aopb)=(aopb)(1+δ),|δ|≤u,op∈{+,−,×,/}. (7)

## 2 Condition number

Before considering algorithms for computing log-sum-exp and softmax we investigate the conditioning of these functions, that is, the sensitivity of and in (1) and (2) to small perturbations in .

We define the condition number of in the usual way (see, e.g., [13, chap. 3]), by

 cond(f,x):=limϵ→0sup∥e∥≤ϵ∥x∥|f(x+e)−f(x)|ϵ|f(x)|.

This definition implies that

 |f(x+e)−f(x)||f(x)|≤cond(f,x)∥e∥∥x∥+o(∥e∥), (8)

so that measures the worst-case relative change in corresponding to a small relative change in . It is easy to show that for the -norm,

 (9)

since by (2).

We identify two extreme cases. First, the condition number is infinite for , because . Hence when for all the condition number must be large. Second, if then by (23) below, so and the problem is perfectly conditioned.

A forward stable algorithm for computing log-sum-exp is one for which the relative error of the computed result is bounded by , for some low degree polynomial . Ideally, we would like the algorithm that we use to be forward stable. To see whether it is reasonable to expect forward stability, consider the case . Then , so : the problem is perfectly conditioned. When we compute using standard library functions we can expect to obtain relative errors in the computed exponential and logarithm bounded by , , [22, Chap. 10], that is,

 ˆy=fl(f(x))=log(ex(1+δ1))(1+δ2),|δ1|,|δ2|≤u. (10)

The term just causes a small relative perturbation of the output, so we have

 ˆy≈log(ex(1+δ1)) =x+log(1+δ1)=x+δ1+O(δ21).

Hence, since ,

 |y−ˆy||y|≲u|x|+O(u2). (11)

This relative error bound is much larger than for , even though the problem is perfectly conditioned. So it is not reasonable to expect an algorithm to be unconditionally forward stable in floating-point arithmetic. For this trivial computation, backward error and forward error are the same, so we also conclude that we cannot expect to obtain an algorithm that is unconditionally backward stable.

The softmax function has condition number

 cond(g,x):=limϵ→0sup∥e∥≤ϵ∥x∥∥g(x+e)−g(x)∥ϵ∥g(x)∥,

which is given explicitly by

 cond(g,x)=∥G(x)∥∥x∥∥g(x)∥.

Here, the matrix is the Jacobian of and denotes any vector norm and the corresponding subordinate matrix norm. Now

We have, for each ,

 n∑j=1∣∣∣∂gi∂xj∣∣∣ =2exin∑j=1j≠iexj(n∑k=1exk)2≤1,

that is, . Hence

 cond∞(g,x)≤∥x∥∞∥g(x)∥∞≤n∥x∥∞,

because . We note in passing that is the Hessian of and can be shown to be symmetric positive semidefinite for all  [3, p. 74].

We also note that shifting, as in (3) and (4), does not change the functions so does not change their condition numbers; likewise for (5). These reformulations may, of course, affect the accuracy of the floating-point evaluation.

## 3 Basic algorithms and error analysis

Algorithm 3 gives a naive implementation of (1) and (2).

###### Algorithm

Given , this algorithm computes and the gradient .

•  1   s=0 for i = 1:n wi=exp(xi) s=s+wi end f=log(s) for i = 1:n gi=wi/s end

What can be said about the accuracy of this algorithm when it is implemented in floating-point arithmetic? To answer this question we carry out a rounding error analysis. Throughout this section, we assume that there is no overflow or underflow.

First, we consider the error in evaluating the sum of nonnegative terms

 s=n∑i=1exi≡n∑i=1wi.

Evaluating yields a computed result satisfying

 ˆwi=exi(1+δ1), (12)

where, as noted in Section 2, we can expect the relative error from the exponential evaluation to satisfy . Therefore

 |ˆwi−wi|≤wiu.

Write the (exact) sum of computed quantities as

 ˜s=n∑i=1ˆwi.

The rounding error analysis in , [12, sec 4.2] shows that the computed sum satisfies

 |˜s−ˆs|≤un−1∑i=1|ti|+O(u2),

where , so that, since ,

 |˜s−ˆs|≤u(n−1)(ˆw1+ˆw2)+un∑i=3(n+1−i)ˆwi+O(u2).

Writing , we obtain

 |s−ˆs| ≤n∑i=1|ˆwi−wi|+|˜s−ˆs| ≤un∑i=1wi+un∑i=1(n+1−i)ˆwi+O(u2) =n∑i=1(n+2−i)wi+O(u2), (13)

since . Hence

 ˆs=s+\mathchar28929\relaxs,|\mathchar28929\relaxs|≤(n+1)us+O(u2). (14)

Then the computed log-sum-exp is

 ˆy =fl(logˆs)=log(ˆs)(1+ϵ),|ϵ|≤u, =log(s+\mathchar28929\relaxs)(1+ϵ) =(logs+\mathchar28929\relaxss+O(u2))(1+ϵ) =y(1+ϵ)+\mathchar28929\relaxss+O(u2). (15)

Using (14) we obtain

 |y−ˆy|≤u|y|+(n+1)u+O(u2),

which gives the following result.

###### Theorem 1 (Basic log-sum-exp algorithm)

In the absence of overflow and underflow, the computed log-sum-exp  from Algorithm 3 satisfies

 ∣∣∣y−ˆyy∣∣∣≤(1+n+1|y|)u+O(u2). (16)

Comparing this bound with in (9) we see that it is larger by the factor . But by (23) below, so this factor is bounded by . Hence we have forward stability as long as , but for the bound does not guarantee forward stability. This is consistent with the bound (11) for the case .

Turning to the evaluation of the softmax function from its definition (2), by (12) we have

 ˆgj=exj(1+δ1)ˆs(1+δ2),|δ2|≤u,

where accounts for the division, and so by (14),

 ˆgj=exjs(1+η)(1+δ1)(1+δ2),|η|≤(n+1)u+O(u2).

Therefore

 ˆgj =gj(1+θ),|θ|≤(n+3)u+O(u2).

This bound guarantees a relative error of order at most in every component of . We weaken the bound into a normwise bound for the next theorem.

###### Theorem 2 (Basic softmax algorithm)

In the absence of overflow and underflow, the computed softmax  from Algorithm 3 satisfies

 ∥g−ˆg∥∞∥g∥∞≤(n+3)u+O(u2). (17)

While the error bounds of Theorem 1 and 2 have a very satisfactory form, they provide no useful information when , and for fp16 this happens for as small as . We note, however, that the terms, which come from the summation, are pessimistic. It is shown by Higham and Mary [14, Thm. 3.1] that, under a probabilistic model of rounding errors, in the error bound for summation can be replaced by a small constant multiple of

with high probability, and the same holds for the bounds of Theorem

1 and 2.

Next, consider the alternative formula (5), which we rewrite here:

 gj=exp(xj−logn∑i=1exi)=exp(xj−y). (18)

With evaluated in floating-point arithmetic by Algorithm 3, we obtain

 ˆgj =(1+δ)exp[(xj−ˆy)(1+ϵ)],|δ|,|ϵ|≤u, (19) =(1+δ)gjexp[(xj−y)ϵ+(y−ˆy)(1+ϵ)] =(1+δ)gj[(1+(xj−y)ϵ+(y−ˆy)(1+ϵ)+O(u2)] =(1+θ)gj, (20)

where, using Theorem 1,

 |θ|≤(|y|+|xj−y|+n+2)u+O(u2).

We summarize this result as follows.

###### Theorem 3 (Alternative softmax algorithm)

In the absence of overflow and underflow, the computed from (18) with the log-sum-exp computed by Algorithm 3 satisfies

 ∥g−ˆg∥∞∥g∥∞≤(|y|+maxj|xj−y|+n+2)u+O(u2). (21)

From (23) and (24) below, using the notation (6), we have

 |y|+maxj|xj−y|≤|xmax|+|xmax−xmin|+2logn.

Hence (21) is less favorable than (17) when or . The analysis therefore suggests that (2) should be preferred to (5).

To give an intuitive explanation for the potential inaccuracy in (18), we refer to the steps leading to (20). A large absolute error in the argument of the final exp may lead to a large relative error in the result. This effect can be traced back to the appearance of in (19).

## 4 Algorithms with shifting

Now we consider the use of shifts in the log-sum-exp and softmax evaluations in order to avoid overflow and reduce the chance of harmful underflow.

Recall the definition (6) of and . Overflow in the exponential evaluations in (3) is certainly avoided if we take , as we then have and hence for all . We can rewrite (3) as

 y=xmax+log(1+n∑i=1i≠kexi−xmax), (22)

where . From this expression we see that

 xmax≤y≤xmax+logn. (23)

It follows that when , the sum “” that produces cannot suffer cancellation.

Note that for , (22) trivially provides the exact result , in contrast to the basic formula (1).

For later use, we note that (23) implies that, for any ,

 |y−xj|≤|xmax−xj|+logn≤|xmax−xmin|+logn. (24)

The term in (22) has the form , where . If is very small then will round to and the logarithm will evaluate as zero, even though . To avoid this loss of information we will use the function provided in, for example, C, MATLAB, and Numpy. These functions guarantee an accurate result for small (which can be achieved with a simple formula based on , [12, Prob. 1.5]).

These considerations lead to Algorithm 4.

###### Algorithm (log-sum-exp and softmax with shift)

This algorithm computes and the gradient for .

•  1   [a,k]=maxixi % a=xk=maxixi s=0 for i=1:n wi=exp(xi−a) if i≠k, s=s+wi, end end f=a+log1p(s) for i = 1:n gi=wi/(1+s) end

Note that while it is important to avoid forming for the -evaluation, for we can safely form because if is small it has little influence on .

Algorithm 4 avoids overflow. If underflow occurs in the exponential then it is in a term in the sum added to in (22), so that term is negligible and the underflow is harmless. Note, in particular, that if for all then whereas Algorithm 3 returns , Algorithm 4 suffers no underflow and returns .

The main question is how shifting affects the accuracy of the evaluations. We give a rounding error analysis to assess this question. The analysis is a generalization of that in the previous section for the unshifted algorithm.

We first examine the error in evaluating the sum of nonnegative terms

 s=n∑i=1i≠kexi−a=:n∑i=1i≠kwi. (25)

Evaluating yields a computed result satisfying

 ˆwi=e(xi−a)(1+δ1)(1+δ2),% |δ1|≤u, |δ2|≤u.

Therefore

 ˆwi=exi−ae(xi−a)δ1(1+δ2)=exi−a(1+(xi−a)δ1+O(δ21))(1+δ2),

and hence

 |ˆwi−wi|≤((1+a−xi)u+O(u2))wi.

Assuming for notational simplicity that , we can write the (exact) sum of computed quantities as

 ˜s=n−1∑i=1ˆwi.

The rounding error analysis in , [12, sec 4.2] shows that the computed sum satisfies

 |˜s−ˆs|≤un−2∑i=1|ti|+O(u2),

where , so that, since ,

 |˜s−ˆs|≤un−1∑i=1(n−i)ˆwi+O(u2).

Hence

 |s−ˆs| ≤n−1∑i=1|ˆwi−wi|+|ˆs−˜s| ≤un−1∑i=1(1+a−xi)wi+un−1∑i=1(n−i)ˆwi+O(u2) =n−1∑i=1(u(n−i)+u(1+a−xi))wi+O(u2), (26)

since . Hence

 ∣∣∣ˆs−ss∣∣∣≤(n+xmax−xmin)u+O(u2), (27)

which guarantees an accurate computed sum as long as is not too large.

The final stage of the computation is to evaluate using the computed , for which we have

 ˆy=(xmax+log(1+ˆs)(1+δ3))(1+δ4),|δ3|,|δ4|≤u.

Here, we are assuming that the function has the property

 fl(log1p(s))=log1p(s)(1+δ),|δ|≤u.

Ignoring the innocuous term and writing, by (27),

 ˆs=s(1+η),|η|≤(n+xmax−xmin)u+O(u2), (28)

we have

 ˆy =xmax+log(1+s(1+η))(1+δ3) =xmax+log(1+s+sη))(1+δ3) =xmax+(log(1+s)+sη1+s+O(u2))(1+δ3),

using a Taylor series expansion about of the logarithm. Hence

 ˆy−y=log(1+s)δ3+sη1+s(1+δ3)+O(u2).

Bounding using (28) gives

 |y−ˆy|≤log(1+s)u+s1+s(n+xmax−xmin)u+O(u2) (29)

or, as a relative error bound, since ,

 ∣∣∣y−ˆyy∣∣∣≤(log(1+s)+n+xmax−xmin|y|)u+O(u2). (30)

Simplifying the bound gives the next result.

###### Theorem 4 (Shifted log-sum-exp algorithm)

The computed log-sum-exp  from Algorithm 4 satisfies

 ∣∣∣y−ˆyy∣∣∣=∣∣∣y+n−xminy∣∣∣u+O(u2). (31)

The main question is how this result compares with Theorem 1 for the unshifted algorithm. The only difference in the bounds is that in (16) is replaced by here. Now is possible only if and , so let us assume that these two inequalities hold. The term comes from bounding the term , where is defined in (25) and , and if then . Hence the potentially large constant is mitigated by the term that it multiplies—something that is lost in the manipulations to achieve a readable bound. We conclude that shifting should have little effect on the accuracy.

We note that (31) is weaker than necessary when (recall that ), since we bounded by in going from (29) to (30). If then (29) becomes

 |y−ˆy|≲s(1+n+xmax−xmin)u+O(u2).

Since also implies for and hence , we have

 |y−ˆy||y|≲s|1+n+y−xmin||y|u+O(u2),

which is a factor smaller than (31).

Turning to the evaluation of the softmax function from the shifted formula (4), we have, using (27),

 ˆgj=exp((xj−a)(1+δ1))(1+δ2)(1+δ3)s(1+η),

where corresponds to the exponential evaluation and to the division, and

 |δi|≤u, i=1:3,|η|≤(n+xmax−xmin)u+O(u2).

Therefore

 ˆgj =gjexp((xj−a)δ1)(1+δ2)(1+δ3)1+η =gj(1+θ),|θ|≤(n+2+2(xmax−xmin))u+O(u2).

Hence we have obtained the following result.

###### Theorem 5 (Shifted softmax algorithm)

The computed from Algorithm 4 satisfies

 ∥g−ˆg∥∞∥g∥∞≤(n+2+2(xmax−xmin))u+O(u2). (32)

Again, this is broadly commensurate with Theorem 2 for the unshifted evaluation, bearing in mind the comments following Theorem 4.

Finally, we consider (5) with the log-sum-exp computed by Algorithm 4. In floating-point arithmetic we have the same equation (19) as for the unshifted algorithm, but now with bounded by, using (31),

 |θ|≤(1+|xj−y|+|y+n−xmin|)u+O(u2).

We have obtained the following result.

###### Theorem 6 (Alternative shifted softmax algorithm)

The computed from (5) with the log-sum-exp computed by Algorithm 4 satisfies

 ∥g−ˆg∥∞∥g∥∞≤(1+maxj|xj−y|+|y+n−xmin|)u+O(u2). (33)

This is broadly similar to Theorem 3 for the unshifted alternative softmax algorithm.

## 5 Computational experiments

We now perform some experiments in a realistic setting, using MATLAB R2019a. The codes and data used for the experiments are available online.

Our aims are to examine the sharpness of the rounding error bounds and to give a pairwise comparison of the accuracy of the algorithms in floating-point arithmetic. Our data comes from a deep learning application. To generate the data, we first set up and trained an artificial neural network, using the MATLAB Deep Learning Toolbox

. More precisely, we trained a network to classify handwritten digit data from the widely used MNIST data set . Here each data point is a grayscale pixel image and there are ten categories: , , …, . We used a network whose architecture has the following general form:

1. [nosep]

2. Image Input with normalization.

3. Convolution 8 stride [1 1] padding ’same’.

4. Batch Normalization 8 channels.

5. Max Pool stride [2 2] padding [0 0 0 0].

6. Convolution 16 stride [1 1] padding ’same’ .

7. Batch Normalization 16 channels.

8. ReLU.

9. Max Pool stride [2 2] padding [0 0 0 0].

10. Convolution 32 stride [1 1] padding ’same’.

11. Batch Normalization 32 channels.

12. ReLU.

13. Fully Connected 10 layer.

14. Softmax.

15. Classification Output crossentropy.

This is the default architecture from , where further details may be found.

The network was trained on 7500 images (750 from each of the ten categories), with 2500 further images (250 from each of the ten categories) used for validation.

The network takes as input a matrix corresponding to the pixels in the image and returns a nonnegative vector whose th component may be interpreted as the probability that the image came from category . If we categorize according to the highest probability from the output, then the trained network misclassifed 27 of the 2500 validation images, corresponding to a 98.9% success rate.

The network uses single precision arithmetic, fp32

. In our experiments, we are concerned only with floating-point arithmetic issues, and we treat the trained network as a means to produce a realistic data set. To do this, we extracted the 2500 single precision vectors from the validation set that were passed into the softmax layer and converted them to fp16 or bfloat16. We then used this data in our implementation of the softmax and log-sum-exp algorithms that we have studied in the previous sections.

To record errors in computed results we applied the basic algorithm, Algorithm 3, in single precision to provide a reference solution and used the chop function of  to simulate half precision arithmetic, in both the fp16 format and the bfloat16 format.

We first describe experiments in fp16. The components in the 2500 test vectors vary between about and . As indicated in Table 2, overflows in fp16 for . Hence, in these tests, overflow is an issue for the basic log-sum-exp implementation in Algorithm 3: it generated an Inf for 475 of the 2500 test vectors. The shifted version of log-sum-exp in Algorithm 4 did not overflow. In the plots below, we do not include results for the cases where Algorithm 3 produced overflow.

First, we look at the log-sum-exp algorithms. In the upper left plot of Figure 1 we used the basic implementation of log-sum-exp, Algorithm 3. We scatter plot over the 2025 vectors where no overflow occurred. For each such vector, the horizontal coordinate is the leading term in the error bound of Theorem 1, scaled by , that is, . Here, as shown in Table 1, for fp16. The vertical coordinate is the actual scaled relative error . The plot also gives a reference line of slope from the origin. We see that the bound is always satisfied and is reasonably sharp in many cases. Figure 1: Scatter plots of errors and error bounds, scaled by unit roundoff, over 2025 vectors in R10 for log-sum-exp algorithms in fp16. See the text for a description of the axes. Upper left: basic implementation of log-sum-exp from Algorithm 3. According to the error analysis, all points should lie below the reference line y=x (shown in red). Upper right: corresponding results for the shifted implementation of log-sum-exp in Algorithm 4. Lower: scaled error from Algorithm 3 versus scaled error from Algorithm 4.

In the upper right plot of Figure 1 we show corresponding results for the shifted log-sum-exp implementation in Algorithm 4, using the bound from Theorem 4.

In the lower part of Figure 1

we scatter plot the floating-point errors for the basic and shifted algorithms. Here, for 1863 out of the 2025 cases (92%) the two errors were identical to all digits in the half precision computation. In more detail, over all the data points the ratio of the error in the basic log-sum-exp (horizontal axis) divided by the error in the shifted version (vertical axis) varied between 0.19 and 59, with a mean of 1.07 and a standard error of 0.03. This indicates that the two versions perform similarly, with the shift producing slightly better results.

We now move on to the four softmax implementations. In Figure 2 we use the shifted softmax implementation from Algorithm 4, analysed in Theorem 5, as the basis for comparison. The upper left plot has the scaled error from Algorithm 4 on the horizontal axis and the scaled error from the basic softmax in Algorithm 3 on the vertical axis. The upper right plot compares the shifted softmax against the alternative algorithm analyzed in Theorem 3. Similarly, the lower plot compares against the alternative shifted softmax algorithm analyzed in Theorem 6. We see that the softmax values obtained from Algorithms 3 and 4 have similar accuracy, whereas the alternative softmax versions based on the rewrite in (5) are typically less accurate. Figure 2: Scatter plots of errors, scaled by unit roundoff, for softmax algorithms in fp16. See the text for a description of the axes. Reference line is y=x.

The results in Figures 1 and 2 are consistent with our floating-point error analysis.

A further test is to compute the sum of each softmax vector, which should equal . In Figure 3 we compare the softmax sums for the basic algorithm (red circles) analyzed in Theorem 2 and the alternative version (blue crosses) analyzed in Theorem 3. Similarly, Figure 4 compares the shifted softmax algorithm analyzed in Theorem 5 and its alternative analyzed in Theorem 6. The order along the -axis is arbitrary; it corresponds to the order in which the data vectors were generated. These figures provide further evidence that the alternative softmax algorithms are less accurate than the basic or shifted algorithms. Figure 3: Sum of entries of computed softmax vector for Algorithm 3 (red circles), analyzed in Theorem 2, and the alternative (blue crosses) analyzed in Theorem 3. Figure 4: Sum of entries of computed softmax vector for Algorithm 4 (red circles), analyzed in Theorem 5, and the alternative (blue crosses) analyzed in Theorem 6.

We also conducted the corresponding experiments in simulated bfloat16 arithmetic. Here, as indicated in Tables 1 and 2, the number range is increased at the expense of reduced precision. In this case there was no overflow in any of the algorithms. The results were very similar to those for fp16, so they are not shown here.

## 6 Conclusions

The log-sum-exp and softmax functions both feature in many computational pipelines, so it is important to compute them accurately and to avoid generating infs or NaNs because of overflow or underflow. To this end, a shift is usually incorporated into the defining formulas, yielding (3) and (4). It is important to understand the effect of the shift on the accuracy of the computed result, especially when computations are carried out in a low precision such as bfloat16 or fp16, which have the equivalent of only 3 or 4 decimal digits of precision.

Our rounding error analysis shows that shifting by the largest element of the input vector does not lessen the accuracy of the computed log-sum-exp and softmax. Underlying this pleasing fact is the phenomenon that any large coefficients caused by shifting are canceled by multiplication with small exponentials.

We obtained an explicit formula for the condition number of log-sum-exp and bounds for the condition number of softmax, and we were able to identify situations in which the log-sum-exp algorithms are guaranteed to be forward stable.

For the alternative and widely used softmax formula that avoids division, (5), we obtained larger error bounds than for the shifted formula (4). Since our numerical experiments confirm that larger errors are typically obtained in practice, we recommend using (4) instead of (5) to evaluate softmax.

In summary, Algorithm 4 is our recommendation for computing log-sum-exp and softmax. It avoids overflow, reduces the chance of harmful underflow, and generally produces results as accurate as those from the unshifted formulas.

## References

•  Mary Aprahamian and Nicholas J. Higham. SIAM J. Matrix Anal. Appl., 35(1):88–109, 2014.
•  Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, New York, 2006. xx+738 pp. ISBN 978-0-387-31073-2.
•  Stephen Boyd and Lieven Vandenberghe. Cambridge University Press, Cambridge, UK, 2004. xiii+716 pp. ISBN 0-521-83378-7.
•  Giuseppe C. Calafiore, Stephane Gaubert, and Corrado Possieri. IEEE Trans. Neural Networks and Learning Systems, pages 1–12, 2019.
•  Florent de Dinechin, Christoph Lauter, and Jean-Michel Muller. RAIRO-Inf. Theor. Appl., 41(1):85–102, 2007.
•  Deep Learning Toolbox. The MathWorks, Inc., Natick, MA, USA.
•  Bradley Efron and Trevor Hastie. Cambridge University Press, Cambridge, UK, 2016. xix+475 pp. ISBN 978-1-107-14989-2.
•  Ian Goodfellow, Yoshua Bengio, and Aaaron Courville. Deep Learning. The MIT Press, Cambridge, MA, USA, 2016. xxii+775 pp. ISBN 978-0-262-03561-3.
•  HP-15C Advanced Functions Handbook. Hewlett-Packard, Portable Computer Division, Corvallis, OR, USA, 1982. 221 pp. Part number 00015-90011 Rev. C.
•  Catherine F. Higham and Desmond J. Higham. Deep learning: An introduction for applied mathematicians. SIAM Review, to appear, 2019.
•  Nicholas J. Higham. SIAM J. Sci. Comput., 14(4):783–799, 1993.
•  Nicholas J. Higham. Second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002. xxx+680 pp. ISBN 0-89871-521-0.
•  Nicholas J. Higham. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008. xx+425 pp. ISBN 978-0-898716-46-7.
•  Nicholas J. Higham and Theo Mary. MIMS EPrint 2018.33, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, November 2018. 22 pp. Revised March 2019. To appear in SIAM J. Sci. Comput.
•  Nicholas J. Higham and Srikara Pranesh. MIMS EPrint 2019.4, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, March 2019. 18 pp. Revised July 2019. To appear in SIAM J. Sci. Comput.
•  IEEE Computer Society, New York, 2008. 58 pp. ISBN 978-0-7381-5752-8.
•  Intel Corporation. BFLOAT16—hardware numerics definition, November 2018. White paper. Document number 338302-001US.
•  Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–.
•  Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. Accessed June 17, 2019.
•  Matlab code for machine learning algorithms in book PRML.
•  Jean-Michel Muller. Third edition, Birkhäuser, Boston, MA, USA, 2016. xxv+283 pp. ISBN 978-1-4899-7981-0.
•  Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, and Serge Torres. Second edition, Birkhäuser, Boston, MA, USA, 2018. xxv+627 pp. ISBN 978-3-319-76525-9.
•  Kevin P. Murphy. Machine Learning: a Probabilistic Approach. Cambridge University Press, Cambridge, UK, 2012.
•  Statistics and Machine Learning Toolbox. The MathWorks, Inc., Natick, MA, USA.
•  R Core Team. R Foundation for Statistical Computing, Vienna, Austria.
•  Christopher K. I. Williams and David Barber. IEEE Trans. Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.