1 Introduction
From a computational geometry perspective, we address the following problem: given a point , defined explicitly or implicitly as the intersection of two lines, we want to decide on which side of a line it lies using only floating point arithmetic (or to find whether it is on the line.) This problem is classic. It plays a fundamental role in the computation of convex hulls, Voronoi diagrams and Delaunay triangulations [1, 14, 16]. The essence of the geometric issues is described in Figure 1.
From an algebraic perspective, the determination of the position of the point with respect to the line is equivalent to the computation of the sign of a sum of products
The inexactness of floating point arithmetic makes it hard to compute this sign exactly in some cases. There are already several references about the computation sums of floating point numbers [3, 4, 5, 6, 7, 8, 10, 12, 13, 15, 16, 17], with applications in computations geometry. The idea of decomposing products in sums is also as old as [2, 11], and the fourth chapter of [14] presents an algorithm for obtaining the signs of such sums. However, all theses references differ in one way or another from the present article. In summary, none of them gives a complete solution to the problem we solve here, using IEEE floating point arithmetic and taking underflow into account. We present complete algorithm for this task, in gory detail. In particular, we take as input the factors of the products and before applying most of the other algorithms we would need to obtain an exact representation of the products , or assume that such a product does not underflow. The algorithm was implemented in C++, and it was carefully tested, and is supported by a detailed theory. It is quite likely that the theoretical results presented here can be generalized to bases other than two and other rounding modes. We are not interested in such generalizations. The goal of our theory is only to provide a justification to what matter most us: the practical algorithms we present, and our theoretical results are only meant to justify our algorithms.
Finite floating point numbers suffice to solve our problem in the situations usually found in practice, and we only care about such numbers, which are elements of a set . Deviating from the tradition, we assume that the arithmetic operations are performed rounding up. For instance, we assume that the subtraction of floating point numbers is defined as
where is the function from defined by
(1) 
and the operations and are similar (we have no need for expensive ’s.) For brevity, we leave the result of the arithmetic operation undefined when
In practice, when the rounding mode is not upwards already, enforcing our assumption requires a function call to change the rounding mode upwards in the beginning of the use our functions, and another function call to restore it when we are done, to be polite. The cost of this two calls is amortized, but cases like this make us believe that always rounding upwards (or always rounding downards) is a good option for code requiring exact results. Our Moore library [9] is a good example of this. Since it works with rounding mode upwards by default, there would be no need for changes in the rounding mode when executing the algorithms described here.
The motivation for this article is to have a bullet proof algorithm for computing the sign of with floating point arithmetic. Such an algorithm is needed because sometimes rounding errors may lead to wrong conclusions by naive algorithms. However, floating point arithmetic is an excellent tool: such naive algorithms will be correct most of the time, and we should care about both precision and performance. For this reason, we believe that we should proceed in two steps: we first try to compute the sign using a quick algorithm, which is unable to compute the sign only in rare cases. In such rare cases we resort to a robust algorithm, which can handle all cases but is more expensive. We present both algorithms below. In Section 2 we describe a simple algorithm, which is quite efficient but may not find the sign in rare cases. In Section 3 we present a robust algorithm, which is more expensive and should be used only after a quick algorithm was unable to find the sign.
Regarding the efficiency of our algorithms, we emphasize that the quick one will suffice in the overwhelming majority of the cases found in practice, and the robust algorithm will be an extreme safety measure. As a result, usually the cost of evaluating the sign will be twice the cost of evaluating the sum using naive floating point arithmetic, plus the cost of branches, plus the change of rounding modes when they are necessary. This is a cost, with a small constant. However, in rare cases in which the robust algorithm is necessary, the cost can grow exponentially with the .
Finally, the actual code is implemented using the template features of C++, and exploiting the details of this language it is possible to generate code that will only resort to more expensive operations in the rare situations in which they are needed.
2 Quick sign
In this section we present a fast algorithm which finds the sign of the sum of products in most cases, but which may be inconclusive sometimes. The algorithm returns a sign . If then it definitely is the sign of the sum. However, if then the sign can be anything. In this case we must resort to a more expensive algorithm to find the sign, like the one in the next section.
As in the rest of this article, we use floating point arithmetic rounding upwards, with at most two changes in rounding mode. In essence, for each product the algorithm computes numbers and such that
If then is positive and if then is negative, otherwise we cannot decide, and the algorithm returns . The algorithm is described as Algorithm 2 below, which uses the auxiliary Algorithm 1.
3 Robust sign
This section describes Algorithm 3, which computes the sign of , using binary floating point arithmetics which have subnormal numbers. The most relevant arithmetics in this class are the ones covered by he IEEE 754 and IEEE 854 standards. The algorithm is presented in the last page of the article. It assumes that there is not overflow in the products , but underflow is handled correctly. It also assumes that
(2) 
where is the machine precision. In practice, the largest we care about corresponds to the type float. In this case and the algorithm could be used to compute the signs of determinants of dimensions up to 8, because if and then
Since is already more than enough for the usual applications in computational geometry, we have no motive to make the algorithm more complicated than it already is in order to relax the condition .
The algorithm is based upon two lemmas. Lemma 1 is about the exact computation of the difference of floating point numbers. There are versions of this lemma since the late 1960s [2, 7, 8], but we prove it here for the case in which we round upwards for completeness, and because the details are not obvious (as stated, Lemma 1 is false if we round downwards for instance.) In essence, it states that we can represent exactly as the difference of two floating point numbers and , with the additional feature that is much smaller than . As a result, in most cases we can base our decision regarding signs on , and is used only in the rare cases in which knowing is not sufficient.
Lemma 1
if , and
then
(3) 
where is the machine precision.
Lemma 2 is the analogous to Lemma 1 for multiplication, but it is more subtle. It relies on the fused multiply add operation (), which is available in most processors and programming language these days. In other words, we assume that given such that we can compute
(4) 
It is well known that using the fma we can represent the product as the difference of two floating point numbers, but we are not aware of proofs (or even statements) of results describing conditions under which this representation is exact when rounding upwards. It is important to notice that such conditions are necessary, because Lemma 2 may not hold if the condition is violated. In order to state the decomposition result for multiplications we need to define some constants that characterize the floating point arithmetic:

is the smallest positive normal element of and is the smallest positive element of

is the machine precision, that is, is the successor of in .

is the largest power of two in , and we assume that .
Using the constants above we define the threshold
(5) 
The values of these constants for the usual arithmetics are presented in Table 1. By inspecting this table, readers will notice that the following assumptions used in Lemma 2 are satisfied:
(6) 
float  double  long double  quad  

1.2e38  2.2e308  3.4e4932  3.3e4932  
1.2e07  2.2e016  1.1e0019  1.9e0034  
1.7e+38  9.0e+307  5.9e+4931  5.9e+4931  
2.0e31  2.0e292  6.2e4913  3.4e4898 
We now state the decomposition lemma for multiplications.
Lemma 2
Consider , with , for which is defined. If
(7) 
then
(8) 
If the condition is not satisfied then and if
(9) 
then
(10) 
Finally, if Equations and are not satisfied and the first inequality in Equation holds then , and if the second inequality in Equation holds then
(11) 
In summary, if and is defined then there exists and such that
(12) 
In words, Lemma 2 shows that we may fail twice in trying to represent exactly the product as a difference of two floating point numbers, but the third time is a charm: we finally can represent exactly as the scaled difference of two floating point numbers. Scaling is essential here in order to deal with underflow. We scale numbers by multiplying them by the constant , which is a power of two. Such scaling does not introduce rounding errors, but requires some book keeping. In C++ we can keep the books using an struct like
struct scaled_number { scaled_number(T t, int exp) T t; int exp; };
where T is the type of the floating point numbers. An scaled number s represents
We keep the scaled numbers in two heaps, one for the positive values, called , and another for the negative values, called . In the heap we store the absolute value of the corresponding numbers, so that the t field in our scaled numbers is always positive, and . The elements in the heaps are sorted in increasing order according to the following comparison function:
bool is_less(scaled_number x, scaled_number y) { if( x.exp > y.exp ) return true; if( x.exp < y.exp ) return false; return x.t < y.t; }
In order to ensure the consistency of the order above we only push two kinds of scaled numbers in our heaps:
(13)  
(14) 
We assume that these conditions are enforced by the constructor of scaled_numbers, which is only called with positive ’s. We then have the following Lemma
Lemma 3
Under the conditions and for scaled numbers x and y we have
We have now all the ingredients to describe our algorithm. It uses an auxiliary function split which splits each product as a sum of scaled numbers using Lemma 2 (given this lemma, writing such a function is trivial.) If then half of the parts in which is split will be negative and the other half will be positive. Therefore, if then contributes scaled numbers to each heap. As a result, the left hand side of Equation is the maximum number of elements which we will have on each heap, and the condition ensures that this number does not exceed .
Once the heaps are filled with products, we start to compare them. While
we pop the top elements and of and and compare them. If then the sign of the sum is . If then the sign of the sum is , otherwise, conceptually, we use Lemma 1 to split and push the parts back into the heaps. There is a catch in this argument in that and are scaled numbers, which may have different exponents. If these exponents differ by more than one then the numbers on the heap with the numbers with the largest exponent are negligible and we are done. When the exponents differ by one we multiply the t field of the one with the largest exponent by , reducing both numbers to the same exponent. This multiplication by may be inexact, but this inexactness is harmless. For instance, when has the largest exponent and the multiplication is inexact, then Lemma 4 in Section 4 implies that and Equations and yield
(15) 
and we will reach the correct conclusion that the sign of the sum is even if we use the incorrect . Once we have both numbers with the same exponent, simply split the difference and adjust the exponents of the results consistently.
Finally, the algorithm terminates because we have two possibilities after we reduce and to the same exponent. When (and the case is analogous.):

If then the largest field decreases, and this can only happen a finite number of times.

If then Lemma 1 implies that . Since the number of elements in is at most by bound , this implies that , and the algorithm returns due to this condition.
This is only a high level description of the algorithm. A reasonably detailed version of it is presented in the last page of this article. The actual code is a bit more involved, due to optimizations which replace scaled_numbers by plain floating point numbers when possible. Readers interested in the implementation details should look at the C++ code available as supplementary material to the arxiv version of this article. This code is distributed under the Mozilla Public License 2.0.
4 Proofs
Here we prove the results stated above and two auxiliary results. Our proofs use the following characteristics shared by the usual binary floating point arithmetics with subnormal numbers, like the ones covered by the IEEE standards 754 and 854. There are three kinds of elements in the set of finite floating point numbers:

is a floating point number.

if and only if .

The subnormal numbers have absolute value of the form
(16) 
The normal numbers have absolute value of the form
(17) for integers and such that
(18)
We use two auxiliary lemmas, and the proofs are presented after the statements of these lemmas. The lemmas are proved in the order in which they were stated.
Lemma 4
If and then . Therefore, if is a positive integer and then . Similary, if and then is normal.
Lemma 5
If the integer and the real number are such that
(19) 
then
(20) 
Proof of Lemma 1. If is either subnormal or normal with a minimum exponent, then is also of the form ,
and Equation holds. We can then assume that
(21) 
and
It follows that
(22) 
If then is subnormal and
Equation holds and we are done. Let us then assume that and let be the integer such that
(23) 
By Lemma 5,
(24) 
for
(25) 
Equation leads to the bound
(26) 
and Equation shows that
and is integer.
If then
and the first part of Equation holds. Moreover, Equation yields
and the second part of Equation also holds. Therefore, we can assume that
(27) 
If then is integer,
and Equation holds again. Therefore, we can assume that and
(28) 
Equation shows that and the integers
are such that
(29) 
It follows that
and the bound leads to
and
(30) 
The second bound in Equation shows that is subnormal and Lemma 4 shows that . This implies that
(31) 
Additionally,
(32) 
If then . If then the same argument used for shows that , and if then is normal and Lemma 4 shows that . Therefore, in all cases for in Equation we have that and
(33) 
Equations and show that and Equations and yield
and the first part of Equation holds. Finally, Equations , , , and imply that
and the second part of Equation holds.
Proof of Lemma 2. Let us start with the case in which . In this case and is normal. Therefore,
(34) 
can be normal or subnormal, but in both cases there exist integers , and with
(35) 
If is the integer such that
(36) 
then
and Lemma 5 and Equations and show that
(37) 
for
(38) 
Since
Equation implies that
and Equations and yield
Since is a power of two, this implies that is integer. On the other hand, Equation implies that
and Equations and lead to
Euclid’s division algorithm yields integers and such that
(39) 
and Equation shows that
(40) 
Equations and imply that
the bound yields and
Since is a power of 2, this implies that is integer, and the last inequality in Equation and Lemma 4 imply that
Combining this with Equations we obtain that
and
This completes the proof of Equation .
If then
and using the same argument used to prove Equation we can prove Equation .
Finally, the smallest value possible for a positive floating point number is , and if the conditions and are not satified then
The violation of condition and the first inequality in Equation yield
This bound implies that . We also have that and the second inequality in Equation leads to
This condition allows us to use the same argument used to prove the validity of Equations and to prove Equations , and this proof is complete.
Proof of Lemma 3. If then and item (ii) implies that . It follows that and using item (i) we derive that
and the function is_less returns the correct value in this branch. The branch is analogous. Finally, in the last branch the exponents cancel out and we are left with the correct comparison of the t fields.
Proof of Lemma 4. If then . If is normal, then Lemma 4 follows directly from Equation . If is subnormal then there are two possibilities: If then is also subnormal. If then is normal, because . The part of the Lemma regarding division follows directly from the definition of normal number.
Proof of Lemma 5. If satisfies the condition then
Therefore,
The number
belongs to because if then it fits on definition with and , and if then
and definition holds with and . We have that
and the definition of in implies that . To prove that , we now show that if is such that then . In fact, if then and this implies that is normal, that is,
This leads to
and is integer. As a result
and we are done.
References
 [1] De Berg. M., van Kreveld, M., Overmans, M., Schwarzkopf,O., Computational Geometry, algorithms and applications, Springer (2008).
 [2] Dekker, T.J., A floatingpoint technique for extending the available precision. Numer. Math. 18(3), 224 (1971).
 [3] Demmel, J., Hida, Y., Fast and accurate floating point summation with application to computational geometry. Numer. Algorithms 37(1–4), 101–112 (2004).
 [4] Espelid, T. O., On floating point summation. SIAM Review 37, 603–607 (1995).
 [5] Graillat, S., Louvet, N., Applications of fast and accurate summation in computational geometry. Technical report, Laboratoire LP2A, University of Perpignan, Perpignan, France (2006).
 [6] Higham, N. J., The accuracy of floating point summation. SIAM J. Sci. Computation 14, 783–799 (1993).
 [7] Kahan, W. Further remarks on reducing truncation errors. Commun. ACM 8(1), 40 (1965).
 [8] Lange, M., Oishi, S., A note on Dekker’s FastTwoSum algorithm. Numerische Mathematik, 145(2), 383–403 (2020).
 [9] Mascarenhas, W.F., Moore: Interval Arithmetic in C++20, In: Barreto G., Coelho R. (eds) Fuzzy Information Processing. NAFIPS 2018. Communications in Computer and Information Science, vol 831, pp 519–529 (2018).
 [10] Mascarenhas, W.F., Floating point numbers are real numbers, arXiv:1605.09202v2 [math.NA] (2017).
 [11] Ogita, T., Rump, S.M., Oishi, S, Accurate sum and dot product. SIAM J. Sci. Comput. 26(6), 1955–1988 (2005).
 [12] Ozaki K., Ogita, T., Oishi S., A robust algorithm for geometric predicate by errorfree determinant transformation, Information and Computation 216 3–13 (2012).
 [13] Priest, D. M., On properties of floating point arithmetics: numerical stability and the cost of accurate computations. Ph.D. Thesis, University of California, Berkeley (1992).
 [14] Ratschek,H. and Rokne, J., Geometric Computations with Interval and New Robust Methods, Applications in Computer Graphics, GIS and Computational Geometry. Horwood Publishing Chichester (2003).

[15]
Rump, S., Error estimation of floatingpoint summation and dot product. BIT Numerical Mathematics 52(1) 201–220 (2012).
 [16] Shewchuk, J., Adaptive Precision FloatingPoint Arithmetic and Fast Robust Geometric Predicates, Discrete & Computational Geometry 18:305–363 (1997).
 [17] Zhu, Y.K., Hayes, W.B., Algorithm 908. ACM Trans. Math. Softw. 37(3), 1–13 (2010).
Comments
There are no comments yet.