We deal with the implementation of the integer power function in floating-point arithmetic. In the following, we assume a radix- floating-point arithmetic that follows the IEEE-754 standard for floating-point arithmetic. We also assume that a fused multiply-and-add (fma) operation is available, and that the input as well as the output values of the power function are not subnormal numbers, and are below the overflow threshold (so that we can focus on the powering of the significands only).
An important case dealt with in the paper will be the case when an internal format, wider than the target format, is available. For instance, to guarantee – in some cases – correctly rounded integer powers in double precision arithmetic, we will have to assume that a double-extended precision is available. The examples will consider that it has a 64-bit precision, which is the minimum required by the IEEE-754 standard.
The IEEE-754 standard  for radix-2 floating-point arithmetic (and its follower, the IEEE-854 radix-independent standard ) require that the four arithmetic operations and the square root should be correctly rounded. In a floating-point system that follows the standard, the user can choose an active rounding mode from:
rounding towards : is the largest machine number less than or equal to ;
rounding towards : is the smallest machine number greater than or equal to ;
rounding towards : is equal to if , and to if ;
rounding to nearest: is the machine number that is the closest to (with a special convention if is exactly between two machine numbers: the chosen number is the “even” one, i.e., the one whose last significand bit is a zero).
When is computed, where and are floating-point numbers and is , , or , the returned result is what we would get if we computed exactly, with “infinite” precision and rounded it according to the active rounding mode. The default rounding mode is round-to-nearest. This requirement is called correct rounding. Among its many interesting properties, one can cite the following result (the first ideas that underlie it go back to Møller ).
Theorem 1 (Fast2Sum algorithm)
(Theorem C of , page 236). Assume the radix of the floating-point system being considered is less than or equal to , and that the used arithmetic provides correct rounding with rounding to nearest. Let and be floating-point numbers, and assume that the exponent of is larger than or equal to that of . The following algorithm computes two floating-point numbers and that satisfy:
is the floating-point number that is closest to .
Algorithm 1 (Fast2Sum(a,b))
If no information on the relative orders of magnitude of and is available, there is an alternative algorithm introduced by Knuth . It requires operations instead of for the Fast2Sum algorithm, but on any modern computer, the additional operations cost significantly less than a comparison followed by a branching.
Some processors (e.g., the IBM PowerPC or the Intel/HP Itanium ) have a fused multiply-add (fma) instruction that allows to compute , where , and are floating-point numbers, with one final rounding only. This instruction allows one to design convenient software algorithms for correctly rounded division and square root. It also has the following interesting property. From two input floating-point numbers and , the following algorithm computes and such that , and is the floating-point number that is nearest .
Algorithm 2 (Fast2Mult(a,b))
Performing a similar calculation without a fused multiply-add operation is possible  but requires floating-point operations instead of .
Algorithms Fast2Sum and Fast2Mult both provide double-precision results of value represented in the form of pairs . In the following we need product of numbers represented in this form. However, we will be satisfied with approximations to the product, discarding terms of the order of the product of the two low-order terms. Given two double-precision operands and the following algorithm computes such that where the relative error is discussed in Section 3 below.
Algorithm 3 (DblMult(,,,))
Note that the condition for applying is satisfied.
2 The two algorithms
We now give two algorithms for accurately computing , where is a floating-point number, and is an integer greater than or equal to . We assume that an fma instruction is available, as it is used in and thus implicitly also in .
The first ( time) algorithm is derived from the straightforward, -multiplication, algorithm. It is simple to analyze and will be faster than the other one if is small.
Algorithm 4 (LinPower(), )
where the low order terms are accumulated with appropriate weights using a Horner scheme evaluation. Algorithm LinPower uses floating-point operations.
The second (-time) algorithm is based on successive squarings.
Algorithm 5 (LogPower(), )
Due to the approximations performed in algorithm , terms corresponding to the product of low order terms are not included. A thorough error analysis is performed below. The number of floating-point operations used by the LogPower algorithm is between and , whereas for LinPower it is . Hence, LogPower will become faster than LinPower for values of around
(but counting the floating-point operations only gives a rough estimate, the actual threshold will depend on the architecture and compiler).
3 Error analysis
We will use the following result.
Theorem 2 (Theorem 2.2 of , p. 38)
Assume a radix- floating-point system , with precision . If lies in the range of , then
3.1 Error of function DblMult
Let , where is the precision of the radix- floating-point system used. If and then the returned value of function DblMult satisfies
as soon as , we have ;
in the case of single precision , ;
in the case of double precision ,
Proof: Following the notation in Algorithm 5, with ’s being variables of absolute value less than , we have
We also have , , , and
Now, from and we deduce
with , which gives
3.2 Error of algorithm LogPower
The two values and returned by algorithm LogPower satisfy
where is the same value as in Theorem 3.
Proof: Algorithm LogPower computes approximations to powers of , using . By induction, one easily shows that the approximation to is of the form , where . If we call the relative error (obtained from Theorem 3) when multiplying together the approximations to and , the induction follows from
Table 1 gives bounds on for several values of (note that the bound is an increasing value of ), assuming the algorithm is used in double precision.
Define the significand of a non-zero real number to be
Define as the bound on obtained for a given value of . From
we deduce that the significand of is within from . From the results given in Table 1, we deduce that for all practical values of the significand of is within much less than from (indeed, to get larger that , we need ). This means that is within less than one ulp from , hence
If algorithm LogPower is implemented in double precision, then is a faithful rounding of , as long as .
Moreover, for , is within ulps from the exact value: we are very close to correct rounding (indeed, we almost always return a correctly rounded result), yet we cannot guarantee correct rounding, even for the smallest values of . This requires a much better accuracy, as shown in Section 4. To guarantee a correctly rounded result in double precision, we will need to run algorithm LogPower in double-extended precision. Table 2 gives bounds on for several values of assuming the algorithm is realized in double-extended precision. As expected, we are 22 bits more accurate.
3.3 Error of algorithm LinPower
Define , , as the values of variables , and at the end of the loop of index of the algorithm. Define as the value variable would have if the instructions were errorless (that is, if instead we had exactly):
Initially let , . By induction, one can easily show that
hence we have
The algorithm only computes an approximation to . To evaluate the error of the algorithm, we must therefore estimate the distance between and . We have , and exactly. Define as the number of absolute value less than such that
We have , and by induction, we find for , using :
To derive a useful bound from this result, we must make a simplifying hypothesis. We know that We assume is close enough to , so that
(this means that our estimate for will become wrong when the algorithm becomes very inaccurate for , ). From (1), we therefore have:
from which, using (3.3), we deduce
This gives the following result
Theorem 6 (Accuracy of algorithm LinPower)
If for , the final computed values and of the variables and of the algorithm satisfy
Let us try to compute an estimate of the coefficient in .
Define a function
One can notice that , so that if we are able to find a simple formula for we will be able to deduce a formula for . We have
Hence a bound on the value of is,
Table 3 gives the obtained bound on for several values of , assuming double precision (). That table shows that as soon as is larger than a few units, algorithm LinPower is less accurate than algorithm LogPower.
4 Correct rounding
In this section we consider algorithm LogPower only: first because it is the fastest for all reasonable values of , second because it is the only one for which we have certain error bounds (the error bounds of algorithm LinPower are approximate only). And if needed, specific algorithms could be designed for small values of . We are interested in getting correctly rounded results in double precision. To do so, we assume that we perform algorithm LogPower in double extended precision. The algorithm returns two double-extended numbers and such that
where is given in Table 2.
In the following we will need to distinguish two roundings, e.g., means round-to-nearest in extended double precision and is round-to-nearest in double precision. Let denote “unit-in-last-position” such that .
V. Lefèvre introduced a new method for finding hardest-to-round cases for evaluating a regular function [8, 7]. That method allowed Lefèvre and Muller to give such cases for the most familiar elementary functions . Recently, Lefèvre adapted his method to the case of functions and , when is an integer. For instance, in double-precision arithmetic, the hardest to round case for function corresponds to
which means that is extremely close to the exact middle of two
consecutive double-precision numbers. There is a run of consecutive
zeros after the rounding bit. This case is the worst case for all values of
between and . Table 4 gives the
maximal length of the chains of identical bits after the rounding bit
Define a breakpoint as the exact middle of two consecutive double precision numbers. will be equal to if and only if there is no breakpoint between and .
The worst case obtained shows that if is a double-precision number, and if , then the significand of is always at a distance larger than from the breakpoint (see Figure 1) where the distance .
We know that the significand of is within from that of , where (as given by its binary logarithm) is listed in Table 2. For all values of less than or equal to , we have , thus . We therefore get the following result:
If algorithm LogPower is run in double-extended precision, and if , then : Hence by rounding to the nearest double-precision number, we get a correctly rounded result.
Now, two important remarks:
We do not have the worst cases for , but from probabilistic arguments we strongly believe that the lengths of the largest chains of consecutive bits after the rounding bit will be of the same order of magnitude (i.e., around ) for some range of above . However, it is unlikely that we will be able to show correct rounding in double precision for values of larger than .
On an Intel Itanium processor, it is possible to directly add two double-extended precision numbers and round the result to double precision without a “double rounding” (i.e., without having an intermediate sum rounded to double-extended precision). Hence Theorem 7 can directly be used. It is worth being noticed that the draft revised standard IEEE 754-R (see http://754r.ucbtest.org/) includes the fma as well as rounding to any specific destination format, independent of operand formats.
It has been shown that the function can be calculated in time with correct rounding in double precision, employing double-extended precision arithmetic, at least for the range . A fused multiply accumulate (fma) instruction is assumed available for algorithm efficiency reasons; and to keep the analysis simple, it was assumed that the input as well as the output are not subnormal numbers and are below the overflow threshold.
A simpler, time algorithm, faster than the above for small values of , was also analyzed. However, its error analysis turned out to be more complicated (and less rigorous), and also to be less accurate than the other.
-  American National Standards Institute and Institute of Electrical and Electronic Engineers. IEEE standard for binary floating-point arithmetic. ANSI/IEEE Standard, Std 754-1985, New York, 1985.
-  M. Cornea, J. Harrison, and P. T. P. Tang. Scientific Computing on Itanium-Based Systems. Intel Press, Hillsboro, OR, 2002.
-  T. J. Dekker. A floating-point technique for extending the available precision. Numerische Mathematik, 18:224–242, 3 1971.
-  N. Higham. Accuracy and Stability of Numerical Algorithms, Second Edition. SIAM, Philadelphia, PA, 2002.
-  American National Standards Institute, Institute of Electrical, and Electronic Engineers. IEEE standard for radix independent floating-point arithmetic. ANSI/IEEE Standard, Std 854-1987, New York, 1987.
-  D. Knuth. The Art of Computer Programming, 3rd edition, volume 2. Addison-Wesley, Reading, MA, 1998.
-  V. Lefèvre. Developments in Reliable Computing, chapter An Algorithm That Computes a Lower Bound on the Distance Between a Segment and , pages 203–212. Kluwer Academic Publishers, Dordrecht, 1999.
-  V. Lefèvre. Moyens Arithmétiques Pour un Calcul Fiable. PhD thesis, École Normale Supérieure de Lyon, Lyon, France, 2000.
-  V. Lefèvre and J.-M. Muller. Worst cases for correct rounding of the elementary functions in double precision. In Burgess and Ciminiera, editors, Proc. of the 15th IEEE Symposium on Computer Arithmetic (Arith-15). IEEE Computer Society Press, Los Alamitos, CA, 2001.
-  0. Møller. Quasi double-precision in floating-point addition. BIT, 5:37–50, 1965.