# Computing the exact sign of sums of products with floating point arithmetic

IIn computational geometry, the construction of essential primitives like convex hulls, Voronoi diagrams and Delaunay triangulations require the evaluation of the signs of determinants, which are sums of products. The same signs are needed for the exact solution of linear programming problems and systems of linear inequalities. Computing these signs exactly with inexact floating point arithmetic is challenging, and we present yet another algorithm for this task. Our algorithm is efficient and uses only of floating point arithmetic, which is much faster than exact arithmetic. We prove that the algorithm is correct and provide efficient and tested code for it.

## Authors

• 5 publications
09/20/2021

### Solving systems of inequalities in two variables with floating point arithmetic

From a theoretical point of view, finding the solution set of a system o...
05/20/2021

### Indirect predicates for geometric constructions

Geometric predicates are a basic ingredient to implement a vast range of...
06/03/2021

### When does the Lanczos algorithm compute exactly?

In theory, the Lanczos algorithm generates an orthogonal basis of the co...
06/14/2019

### Exact arithmetic as a tool for convergence assessment of the IRM-CG method

Using exact computer arithmetic, it is possible to determine the (exact)...
12/15/2016

### The Method of Gauss-Newton to Compute Power Series Solutions of Polynomial Homotopies

We consider the extension of the method of Gauss-Newton from complex flo...
02/18/2019

### ENBB Processor: Towards the ExaScale Numerical Brain Box [Position Paper]

ExaScale systems will be a key driver for simulations that are essential...
11/07/2016

### SPECTRA -a Maple library for solving linear matrix inequalities in exact arithmetic

This document describes our freely distributed Maple library spectra, f...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

From a computational geometry perspective, we address the following problem: given a point , defined explicitly or implicitly as the intersection of two lines, we want to decide on which side of a line it lies using only floating point arithmetic (or to find whether it is on the line.) This problem is classic. It plays a fundamental role in the computation of convex hulls, Voronoi diagrams and Delaunay triangulations [1, 14, 16]. The essence of the geometric issues is described in Figure 1.

From an algebraic perspective, the determination of the position of the point with respect to the line is equivalent to the computation of the sign of a sum of products

 S=n∑i=1ni∏j=1aij.

The inexactness of floating point arithmetic makes it hard to compute this sign exactly in some cases. There are already several references about the computation sums of floating point numbers [3, 4, 5, 6, 7, 8, 10, 12, 13, 15, 16, 17], with applications in computations geometry. The idea of decomposing products in sums is also as old as [2, 11], and the fourth chapter of [14] presents an algorithm for obtaining the signs of such sums. However, all theses references differ in one way or another from the present article. In summary, none of them gives a complete solution to the problem we solve here, using IEEE floating point arithmetic and taking underflow into account. We present complete algorithm for this task, in gory detail. In particular, we take as input the factors of the products and before applying most of the other algorithms we would need to obtain an exact representation of the products , or assume that such a product does not underflow. The algorithm was implemented in C++, and it was carefully tested, and is supported by a detailed theory. It is quite likely that the theoretical results presented here can be generalized to bases other than two and other rounding modes. We are not interested in such generalizations. The goal of our theory is only to provide a justification to what matter most us: the practical algorithms we present, and our theoretical results are only meant to justify our algorithms.

Finite floating point numbers suffice to solve our problem in the situations usually found in practice, and we only care about such numbers, which are elements of a set . Deviating from the tradition, we assume that the arithmetic operations are performed rounding up. For instance, we assume that the subtraction of floating point numbers is defined as

 b⊖a=up(b−a)

where is the function from defined by

 up(x):=min{y∈F with y≥x}, (1)

and the operations and are similar (we have no need for expensive ’s.) For brevity, we leave the result of the arithmetic operation undefined when

 a op b∉(minF,maxF).

In practice, when the rounding mode is not upwards already, enforcing our assumption requires a function call to change the rounding mode upwards in the beginning of the use our functions, and another function call to restore it when we are done, to be polite. The cost of this two calls is amortized, but cases like this make us believe that always rounding upwards (or always rounding downards) is a good option for code requiring exact results. Our Moore library [9] is a good example of this. Since it works with rounding mode upwards by default, there would be no need for changes in the rounding mode when executing the algorithms described here.

The motivation for this article is to have a bullet proof algorithm for computing the sign of with floating point arithmetic. Such an algorithm is needed because sometimes rounding errors may lead to wrong conclusions by naive algorithms. However, floating point arithmetic is an excellent tool: such naive algorithms will be correct most of the time, and we should care about both precision and performance. For this reason, we believe that we should proceed in two steps: we first try to compute the sign using a quick algorithm, which is unable to compute the sign only in rare cases. In such rare cases we resort to a robust algorithm, which can handle all cases but is more expensive. We present both algorithms below. In Section 2 we describe a simple algorithm, which is quite efficient but may not find the sign in rare cases. In Section 3 we present a robust algorithm, which is more expensive and should be used only after a quick algorithm was unable to find the sign.

Regarding the efficiency of our algorithms, we emphasize that the quick one will suffice in the overwhelming majority of the cases found in practice, and the robust algorithm will be an extreme safety measure. As a result, usually the cost of evaluating the sign will be twice the cost of evaluating the sum using naive floating point arithmetic, plus the cost of branches, plus the change of rounding modes when they are necessary. This is a cost, with a small constant. However, in rare cases in which the robust algorithm is necessary, the cost can grow exponentially with the .

Finally, the actual code is implemented using the template features of C++, and exploiting the details of this language it is possible to generate code that will only resort to more expensive operations in the rare situations in which they are needed.

## 2 Quick sign

In this section we present a fast algorithm which finds the sign of the sum of products in most cases, but which may be inconclusive sometimes. The algorithm returns a sign . If then it definitely is the sign of the sum. However, if then the sign can be anything. In this case we must resort to a more expensive algorithm to find the sign, like the one in the next section.

As in the rest of this article, we use floating point arithmetic rounding upwards, with at most two changes in rounding mode. In essence, for each product the algorithm computes numbers and such that

 −di≤p≤ui.

If then is positive and if then is negative, otherwise we cannot decide, and the algorithm returns . The algorithm is described as Algorithm 2 below, which uses the auxiliary Algorithm 1.

## 3 Robust sign

This section describes Algorithm 3, which computes the sign of , using binary floating point arithmetics which have subnormal numbers. The most relevant arithmetics in this class are the ones covered by he IEEE 754 and IEEE 854 standards. The algorithm is presented in the last page of the article. It assumes that there is not overflow in the products , but underflow is handled correctly. It also assumes that

 ∑i with ni=11+∑i with ni>12ni−2<1/ϵ, (2)

where is the machine precision. In practice, the largest we care about corresponds to the type float. In this case and the algorithm could be used to compute the signs of determinants of dimensions up to 8, because if and then

 n∑i=12ni−2=8!×26=2580480<8388608=223.

Since is already more than enough for the usual applications in computational geometry, we have no motive to make the algorithm more complicated than it already is in order to relax the condition .

The algorithm is based upon two lemmas. Lemma 1 is about the exact computation of the difference of floating point numbers. There are versions of this lemma since the late 1960s [2, 7, 8], but we prove it here for the case in which we round upwards for completeness, and because the details are not obvious (as stated, Lemma 1 is false if we round downwards for instance.) In essence, it states that we can represent exactly as the difference of two floating point numbers and , with the additional feature that is much smaller than . As a result, in most cases we can base our decision regarding signs on , and is used only in the rare cases in which knowing is not sufficient.

###### Lemma 1

if , and

 c:=b⊖a,d:=b⊖cande:=a⊖d

then

 b−a=c−eand0≤e

where is the machine precision.

Lemma 2 is the analogous to Lemma 1 for multiplication, but it is more subtle. It relies on the fused multiply add operation (), which is available in most processors and programming language these days. In other words, we assume that given such that we can compute

 fma(a,x,y):=up(ax+y). (4)

It is well known that using the fma we can represent the product as the difference of two floating point numbers, but we are not aware of proofs (or even statements) of results describing conditions under which this representation is exact when rounding upwards. It is important to notice that such conditions are necessary, because Lemma 2 may not hold if the condition is violated. In order to state the decomposition result for multiplications we need to define some constants that characterize the floating point arithmetic:

• is the smallest positive normal element of and is the smallest positive element of

• is the machine precision, that is, is the successor of in .

• is the largest power of two in , and we assume that .

Using the constants above we define the threshold

 τ:=2ν/ϵ. (5)

The values of these constants for the usual arithmetics are presented in Table 1. By inspecting this table, readers will notice that the following assumptions used in Lemma 2 are satisfied:

 σϵ2≥2andσ2νϵ>2. (6)

We now state the decomposition lemma for multiplications.

###### Lemma 2

Consider , with , for which is defined. If

 c:=a⊗b>τ, (7)

then

 (8)

If the condition is not satisfied then and if

 ~c:=~a⊗b>τ, (9)

then

 ab=σ−1~c−σ−1~dfor~d:=fma(−~a,b,~c)≥0. (10)

Finally, if Equations and are not satisfied and the first inequality in Equation holds then , and if the second inequality in Equation holds then

 (11)

In summary, if and is defined then there exists and such that

 ab=σ−ec−σ−ed. (12)

In words, Lemma 2 shows that we may fail twice in trying to represent exactly the product as a difference of two floating point numbers, but the third time is a charm: we finally can represent exactly as the scaled difference of two floating point numbers. Scaling is essential here in order to deal with underflow. We scale numbers by multiplying them by the constant , which is a power of two. Such scaling does not introduce rounding errors, but requires some book keeping. In C++ we can keep the books using an struct like

struct scaled_number {
scaled_number(T t, int exp)
T t;
int exp;
};


where T is the type of the floating point numbers. An scaled number s represents

 x=value(s)=σ−s.exp % s.t.

We keep the scaled numbers in two heaps, one for the positive values, called , and another for the negative values, called . In the heap we store the absolute value of the corresponding numbers, so that the t field in our scaled numbers is always positive, and . The elements in the heaps are sorted in increasing order according to the following comparison function:

bool is_less(scaled_number x, scaled_number y) {
if( x.exp > y.exp ) return true;
if( x.exp < y.exp ) return false;
return x.t < y.t;
}


In order to ensure the consistency of the order above we only push two kinds of scaled numbers in our heaps:

 Scaled numbers s with s.exp=0 and τ0 and τ

We assume that these conditions are enforced by the constructor of scaled_numbers, which is only called with positive ’s. We then have the following Lemma

###### Lemma 3

Under the conditions and for scaled numbers x and y we have

 value(x)

We have now all the ingredients to describe our algorithm. It uses an auxiliary function split which splits each product as a sum of scaled numbers using Lemma 2 (given this lemma, writing such a function is trivial.) If then half of the parts in which is split will be negative and the other half will be positive. Therefore, if then contributes scaled numbers to each heap. As a result, the left hand side of Equation is the maximum number of elements which we will have on each heap, and the condition ensures that this number does not exceed .

Once the heaps are filled with products, we start to compare them. While

 sn:=size(N)>0andsp:=size(P)>0

we pop the top elements and of and and compare them. If then the sign of the sum is . If then the sign of the sum is , otherwise, conceptually, we use Lemma 1 to split and push the parts back into the heaps. There is a catch in this argument in that and are scaled numbers, which may have different exponents. If these exponents differ by more than one then the numbers on the heap with the numbers with the largest exponent are negligible and we are done. When the exponents differ by one we multiply the t field of the one with the largest exponent by , reducing both numbers to the same exponent. This multiplication by may be inexact, but this inexactness is harmless. For instance, when has the largest exponent and the multiplication is inexact, then Lemma 4 in Section 4 implies that and Equations and yield

 n.t>τ=2ϵν≥2ϵz≥2spz, (15)

and we will reach the correct conclusion that the sign of the sum is even if we use the incorrect . Once we have both numbers with the same exponent, simply split the difference and adjust the exponents of the results consistently.

Finally, the algorithm terminates because we have two possibilities after we reduce and to the same exponent. When (and the case is analogous.):

• If then the largest field decreases, and this can only happen a finite number of times.

• If then Lemma 1 implies that . Since the number of elements in is at most by bound , this implies that , and the algorithm returns due to this condition.

This is only a high level description of the algorithm. A reasonably detailed version of it is presented in the last page of this article. The actual code is a bit more involved, due to optimizations which replace scaled_numbers by plain floating point numbers when possible. Readers interested in the implementation details should look at the C++ code available as supplementary material to the arxiv version of this article. This code is distributed under the Mozilla Public License 2.0.

## 4 Proofs

Here we prove the results stated above and two auxiliary results. Our proofs use the following characteristics shared by the usual binary floating point arithmetics with subnormal numbers, like the ones covered by the IEEE standards 754 and 854. There are three kinds of elements in the set of finite floating point numbers:

• is a floating point number.

• if and only if .

• The subnormal numbers have absolute value of the form

 (16)
• The normal numbers have absolute value of the form

 x=2eνϵm (17)

for integers and such that

 0≤e≤emax:=log2(σ)−log2(ν)and1≤ϵm<2. (18)

We use two auxiliary lemmas, and the proofs are presented after the statements of these lemmas. The lemmas are proved in the order in which they were stated.

###### Lemma 4

If and then . Therefore, if is a positive integer and then . Similary, if and then is normal.

###### Lemma 5

If the integer and the real number are such that

 2ℓ≤xν≤2ℓ+1≤σ (19)

then

 up(x)=2ℓνϵ⌈x2ℓνϵ⌉. (20)

Proof of Lemma 1. If is either subnormal or normal with a minimum exponent, then is also of the form ,

 c=νϵ(mb−ma),d=νϵma, e=0andb−a=c=c+e,

and Equation holds. We can then assume that

 b=2ebνϵmbwitheb>0and1≤ϵmb<2, (21)

and

 a=2eaνϵmawithea≥0and1≤ma<2/ϵ.

It follows that

 b−a=νϵhforh:=2ebmb−2eama>0. (22)

If then is subnormal and

 c=b⊖a=b−a ⇒ d=a ⇒ e=0 ⇒ b−a=c−e,

Equation holds and we are done. Let us then assume that and let be the integer such that

 2ℓ≤b−aν=ϵh≤2ℓ+1. (23)

By Lemma 5,

 c=b⊖a=2ℓνϵ⌈2−ℓh⌉=2ℓνϵ¯¯c (24)

for

 ¯¯c:=⌈2eb−ℓmb−2ea−ℓma⌉. (25)

 h=2ebmb−2eama≥2ℓ/ϵ (26)

and Equation shows that

 2ebmb≥2ℓ/ϵ ⇒ 2eb≥2ℓϵmb>2ℓ−1⇒eb≥ℓ,

and is integer.

If then

 ¯¯c=2eb−ℓmb,c=b,d=b⊖c=0,e=a⊖d=a  ⇒  b−a=c−e,

and the first part of Equation holds. Moreover, Equation yields

 e=a=2eamaνϵ<2ℓνϵ<(2ebmbϵ)νϵ=ϵb=ϵc,

and the second part of Equation also holds. Therefore, we can assume that

 2ℓ−ea≤ma<2/ϵ. (27)

If then is integer,

 ¯¯c=2eb−ℓmb−2ea−ℓma,c=b−a,d=b⊖c=a,e=a⊖d=0  ⇒  b−a=c−e

and Equation holds again. Therefore, we can assume that and

 q:=ℓ−ea>0. (28)

Equation shows that and the integers

 ¯¯¯a:=⌊2−qma⌋anda––=ma−2q¯¯¯a

are such that

 ma=2q¯¯¯a+a––,1≤¯¯¯a≤ma/2<1/ϵand0≤a––<2q≤ma<2/ϵ. (29)

It follows that

 2eb−ℓmb−2ea−ℓma=2eb−ℓmb−2−qma=2eb−ℓmb−¯¯¯a−2−qa––

and

 c=b⊖a=b−2ℓνϵ¯¯¯a=b−^afor^a:=2ℓνϵ¯¯¯a. (30)

The second bound in Equation shows that is subnormal and Lemma 4 shows that . This implies that

 d=b⊖c=b−c=^a. (31)

 a−d=2eaνϵma−2ℓνϵ¯¯¯a=2eaνϵ(ma−2q¯¯¯a)=2eaνϵa––=:~a. (32)

If then . If then the same argument used for shows that , and if then is normal and Lemma 4 shows that . Therefore, in all cases for in Equation we have that and

 e=a⊖d=a−d=~a. (33)

Equations and show that and Equations and yield

 c−e=b−^a−~a=b−a,

and the first part of Equation holds. Finally, Equations , , , and imply that

 ϵc≥2ℓνϵ=2q+eaνϵ>2eaa––νϵ=~a.

and the second part of Equation holds.

Proof of Lemma 2. Let us start with the case in which . In this case and is normal. Therefore,

 b=2ebmbwith1≤ϵmb<2. (34)

can be normal or subnormal, but in both cases there exist integers , and with

 (35)

If is the integer such that

 2ℓ≤abν<2ℓ+1 (36)

then

 2ℓ+1≥abν>1/ϵ ⇒2ℓ>1/(2ϵ)>1  ⇒ℓ>0,

and Lemma 5 and Equations and show that

 a⊗b=2ℓνϵ⌈2ea+ebmamb2ℓνϵ⌉=⌈mambp⌉ (37)

for

 p:=2ℓ−ea−ebνϵ. (38)

Since

 ab2ℓvϵ=mambp,

Equation implies that

 abv=2ℓϵmambp<2ℓ+1

and Equations and yield

 p>ϵmamb/2=(ma/2)(ϵmb)≥2ℓa−1≥1/2⇒p>1/2.

Since is a power of two, this implies that is integer. On the other hand, Equation implies that

 2ℓϵmambp≥2ℓ

 p≤ϵmamb<4/ϵ⇒p≤2/ϵ.

Euclid’s division algorithm yields integers and such that

 mamb=qp−rand0≤r

and Equation shows that

 c=a⊗b=2ℓνϵ⌈q−p−1r⌉=2ℓνϵq=ab+2ℓp−1νϵr. (40)

Equations and imply that

 mamb<2ℓa+12/ϵ<4ϵ2 ⇒1mamb>ϵ24,

the bound yields and

 2ℓp−1=2ea+ebνϵ=1ϵ2mamb2ea+ebmambϵν=1ϵ2mambabϵν>2ϵ2mamb>12.

Since is a power of 2, this implies that is integer, and the last inequality in Equation and Lemma 4 imply that

 d:=2ℓp−1νϵr∈F.

Combining this with Equations we obtain that

 ab=2ℓνϵq−d=c−d

and

 (−a)b+c=d∈F⇒fma(−a,b,c)=d.

This completes the proof of Equation .

If then

 a2≤ab≤τ⇒a≤√2νϵ<1⇒σa<σ⇒~a=σa=σ⊗a∈F,

and using the same argument used to prove Equation we can prove Equation .

Finally, the smallest value possible for a positive floating point number is , and if the conditions and are not satified then

 ~a=σa≥σνϵ.

The violation of condition and the first inequality in Equation yield

 b≤2νϵ~a≤2νϵσνϵ⇒σb≤2ϵ2≤σ.

This bound implies that . We also have that and the second inequality in Equation leads to

 ~a^b≥σ2ν2ϵ2=σ2νϵ22νϵ>2νϵ=τ.

This condition allows us to use the same argument used to prove the validity of Equations and to prove Equations , and this proof is complete.

Proof of Lemma 3. If then and item (ii) implies that . It follows that and using item (i) we derive that

 value(x)=σ−x.expx.t≤σ−y.exp−1 στ=σ−y.exp τ<σ−y.exp y.t=value(y),

and the function is_less returns the correct value in this branch. The branch is analogous. Finally, in the last branch the exponents cancel out and we are left with the correct comparison of the t fields.

Proof of Lemma 4. If then . If is normal, then Lemma 4 follows directly from Equation . If is subnormal then there are two possibilities: If then is also subnormal. If then is normal, because . The part of the Lemma regarding division follows directly from the definition of normal number.

Proof of Lemma 5. If satisfies the condition then

 1/ϵ≤q:=x2ℓνϵ≤2/ϵ.

Therefore,

 1/ϵ≤⌈q⌉≤2/ϵ.

The number

 ¯¯¯x:=2ℓνϵ⌈q⌉

belongs to because if then it fits on definition with and , and if then

 ¯¯¯x=2ℓ+1νϵ/ϵ

and definition holds with and . We have that

 2ℓνϵq=x≤¯¯¯x,

and the definition of in implies that . To prove that , we now show that if is such that then . In fact, if then and this implies that is normal, that is,

 y=2eyνϵmywith1/ϵ≤my<2/ϵand2eyνϵmy≥x=2ℓνϵq.

 2eymy≥2ℓq  ⇒  2ey−ℓ≥q/my>(1/ϵ)/(2/ϵ)=1/2⇒ey≥ℓ,

and is integer. As a result

and we are done.

## References

• [1] De Berg. M., van Kreveld, M., Overmans, M., Schwarzkopf,O., Computational Geometry, algorithms and applications, Springer (2008).
• [2] Dekker, T.J., A floating-point technique for extending the available precision. Numer. Math. 18(3), 224 (1971).
• [3] Demmel, J., Hida, Y., Fast and accurate floating point summation with application to computational geometry. Numer. Algorithms 37(1–4), 101–112 (2004).
• [4] Espelid, T. O., On floating point summation. SIAM Review 37, 603–607 (1995).
• [5] Graillat, S., Louvet, N., Applications of fast and accurate summation in computational geometry. Technical report, Laboratoire LP2A, University of Perpignan, Perpignan, France (2006).
• [6] Higham, N. J., The accuracy of floating point summation. SIAM J. Sci. Computation 14, 783–799 (1993).
• [7] Kahan, W. Further remarks on reducing truncation errors. Commun. ACM 8(1), 40 (1965).
• [8] Lange, M., Oishi, S., A note on Dekker’s FastTwoSum algorithm. Numerische Mathematik, 145(2), 383–403 (2020).
• [9] Mascarenhas, W.F., Moore: Interval Arithmetic in C++20, In: Barreto G., Coelho R. (eds) Fuzzy Information Processing. NAFIPS 2018. Communications in Computer and Information Science, vol 831, pp 519–529 (2018).
• [10] Mascarenhas, W.F., Floating point numbers are real numbers, arXiv:1605.09202v2 [math.NA] (2017).
• [11] Ogita, T., Rump, S.M., Oishi, S, Accurate sum and dot product. SIAM J. Sci. Comput. 26(6), 1955–1988 (2005).
• [12] Ozaki K., Ogita, T., Oishi S., A robust algorithm for geometric predicate by error-free determinant transformation, Information and Computation 216 3–13 (2012).
• [13] Priest, D. M., On properties of floating point arithmetics: numerical stability and the cost of accurate computations. Ph.D. Thesis, University of California, Berkeley (1992).
• [14] Ratschek,H. and Rokne, J., Geometric Computations with Interval and New Robust Methods, Applications in Computer Graphics, GIS and Computational Geometry. Horwood Publishing Chichester (2003).
• [15]

Rump, S., Error estimation of floating-point summation and dot product. BIT Numerical Mathematics 52(1) 201–220 (2012).

• [16] Shewchuk, J., Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates, Discrete & Computational Geometry 18:305–363 (1997).
• [17] Zhu, Y.K., Hayes, W.B., Algorithm 908. ACM Trans. Math. Softw. 37(3), 1–13 (2010).