Solving systems of inequalities in two variables with floating point arithmetic

by   Walter F. Mascarenhas, et al.

From a theoretical point of view, finding the solution set of a system of inequalities in only two variables is easy. However, if we want to get rigorous bounds on this set with floating point arithmetic, in all possible cases, then things are not so simple due to rounding errors. In this article we describe in detail an efficient data structure to represent this solution set and an efficient and robust algorithm to build it using floating point arithmetic. The data structure and the algorithm were developed as a building block for the rigorous solution of relevant practical problems. They were implemented in and the code was carefully tested. This code is available as supplementary material to the arxiv version of this article, and it is distributed under the Mozilla Public License 2.0.



There are no comments yet.


page 1

page 2

page 3

page 4


Computing the exact sign of sums of products with floating point arithmetic

IIn computational geometry, the construction of essential primitives lik...

Rigorous Roundoff Error Analysis of Probabilistic Floating-Point Computations

We present a detailed study of roundoff errors in probabilistic floating...

Secure multiparty computations in floating-point arithmetic

Secure multiparty computations enable the distribution of so-called shar...

A posteriori error estimation for the non-self-consistent Kohn-Sham equations

We address the problem of bounding rigorously the errors in the numerica...

Benchmarking Software Model Checkers on Automotive Code

This paper reports on our experiences with verifying automotive C code b...

Computationally Inequivalent Summations and Their Parenthetic Forms

Floating-point addition on a finite-precision machine is not associative...

An Abstraction-guided Approach to Scalable and Rigorous Floating-Point Error Analysis

Automated techniques for rigorous floating-point round-off error analysi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the representation and computation of the set of points which satisfy the system of inequalities


This problem has applications in economics and computational geometry [2], but we developed the data structure and algorithm presented here for finding rigorous bounds on the solutions of two dimensional nonlinear programming problems using interval arithmetic [4]. For instance, when using Newton’s method to solve a nonlinear system of equations

with branch and bound, in each branch we have a candidate set of solutions described by a family of inequalities as in Equation . We then compute an interval matrix containing the jacobian matrix of for all and execute the interval arithmetic version of the Newton step for some near to the center of . The Newton step yields linear inequalities, which we use to refine . For this refinement to be bullet proof we need data structures and algorithms like the ones presented here. Many relevant problems can be solved this way, producing rigorous bounds on the solutions. The algorithm can be used as a building block for finding the complex roots of polynomials, the periodic orbits of chaotic systems [1]

or rigorous bounds on the solutions of ordinary differential equations.

Figure 1: Rounding errors may lead to the conclusion that the feasible region does not contain the vertex , because its rounded version lies below the line .

There are several algorithms for obtaining a reasonable representation of the feasible set for the system of inequalities using arithmetic operations [2], but implementing them with floating point arithmetic is not trivial due to rounding errors. These errors lead to the worst kind of bug: the ones that occur only in rare situations and are difficult to spot by testing. A simple example of what can go wrong is presented in Figure 1. In fact, after much experience developing real world software for computing Voronoi diagrams and Delaunay triangulations [2], we were convinced that it would be best to use exact arithmetic instead of floating point arithmetic for this kind of problem, even knowing quite well that exact arithmetic is much more expensive that floating point arithmetic. The bugs caused by floating point arithmetic were overwhelming. In particular, the naive idea of using tolerances (’s) does not work: in our experience, it is impossible to find the proper ’s in a consistent way, which works in general. Only recently we came to the conclusion that it is possible to perform these tasks with floating point arithmetic, provided that we use the techniques presented here and in the companion article [3].

For the kind of problems that we have in mind, it is acceptable to overestimate a bit, but we must not underestimate it. For instance, if our algorithm indicates that then it should be empty. On the other hand, it is acceptable to reduce the right hand side of the constrains a bit. Therefore, in the next sections we describe an algorithm with the following characteristics

  • We assume that is bounded, but it may be a point, a segment or empty. These cases cover all applications that we have in mind.

  • The algorithm finds a sharp approximation of , in the sense that is the exact feasible region for a slightly perturbed problem with constraints and such that

    In other words, and the area of the set is small.

  • When using a floating point type T, the data structure requires at most bytes of memory, for a mild constant .The algorithm performs at most sum, subtractions and multiplications, with a mild constant , and uses at most divisions.

In the next sections we describe the algorithm and the data structure used to implement it. Section 2 gives a bird’s eye view of the algorithm and points to the issues we face when translating this view into real code. Section 3 describes the data structure used to implement the algorithm. Finally, Section 4 emphasizes that we must be meticulous when testing the algorithm. Of course, an article like this is no replacement for the real code when one wants to fully understand the details. Such a code is available as supplementary material for the arxiv version of this article, and is distributed under the Mozilla Public License 2.0. For practical reasons, the coded algorithm deviates a bit from what we describe here, but ideas are the same.

2 Computing

normal vectors

Figure 2: Intersecting the feasible region with the half plane to the right of the line passing through the point . The vector is normal to this line and points to the right.

This section starts with a naive description of the algorithm to compute the feasible region for the system of inequalities . We then explain why this description is naive. We hope that this presentation will motivate the data structure we use to represent in C++ described in the next section. In the problems with which we are concerned, is contained in a box given by constraints


where is the largest finite floating point value. The algorithm assumes that Equation holds and starts the construction of from this box, and adds one constraint a time, at the cost of floating point operations per constraint. From the geometric perspective of Figure 2, adding a constraint , with say, corresponds to intersecting the current feasible region with the half plane defined by the line with normal vector containing , and we proceed as follows:

  1. We keep a data structure with the inward normal vectors of the edges of the feasible region sorted in counter clockwise order as in the right of Figure 2. This data structure can be simply a vector if we expect that the number of constraints will not be large, or a tree or other container with cost for the basic operations if we expect to be large. We perform two binary searches: one to locate and another to locate . By doing so we find the vertex which minimizes in and the vertex which maximizes in (these are the vertices of which satisfy the Karush/Kuhn Tucker conditions for minimizing and maximizing over .)

  2. The vertices are also sorted in counter clockwise order, and increases between and and decreases between and . We perform one binary search in to find the intersection point and another in to find , again at a cost of floating point operations.

  3. We then have the new feasible region .

This description of the algorithm is naive because it ignores many details and corner cases. A proper description would prescribe, for example, what to do if or , or even and . In fact, taking proper care of these particular cases is what consumes most of the time when coding this kind of algorithm. The naive case described above is easy (see Section 4 for more challenging examples.) Moreover, as in some books about computational geometry [5], the naive description also assumes that we can perform the three operations below exactly, but doing so requires care when using floating point arithmetic:

  • When finding the location of the normal in the set of normals, we must decide whether comes before or after the normal to an edge, or whether it has the same direction as in degenerate cases. In algebraic terms, this reduces to the computation of the sign of the determinant


    and the first hurdle we face is the possibility of overflow in the products and . We believe that the easiest way to handle this possibility is to normalize normals up front. When , the idea is to replace the constraint



    Unfortunately, this cannot always be done exactly in floating point arithmetic, and we need to round numbers consistently. It turns out that it is simpler to do this consistent rounding when the rounding mode is upwards, and our algorithm ensures that this is the rounding mode at its very beginning. Recalling that , we can then relax the constraint by replacing it by


    where by we mean the floating point division of by rounding up. The double negation in the right hand side of the modified constraint is equivalent to rounding down . As a result, by replacing the constraint by the constraint we can add points to the feasible region, but no points will be removed from it. The cases in which the assumption is not satisfied are handled analogously. Except for the trap , there are eight possibilities, which correspond to the decomposition of the interval in eight disjoint semi open intervals of width :


    This normalization of the normals makes it trivial to order them. To decide whether the constraint’s normal comes before , or is equal to it, we first compare the octants in which they lie. If they are in different octants then we can order and by comparing the octants. If they are in the same octant, it suffices to compare their “secondary” coordinate. For instance, in the case when we can simply compare and . As a result, the binary searches for finding and require only the comparison of single numbers, and the constant hidden in their complexity is small. Moreover, by normalizing normals we reduce the cost of evaluating the two products and a sum in , which compiles to a product and one fused add multiply (fma) instructions, to the evaluation of , which compiles to a single fma instruction. As a net effect, we exchange the cost of multiplications by the cost of the divisions for normalization, and this is a good deal for about one hundred.

    Finally, we must be prepared to handle overflow in the division . Since we are rounding upwards, we can only have . This implies that must be larger than the largest finite floating point value, and this contradicts our assumption and leads us to the conclusion that . Therefore, the algorithm terminates when results in overflow.

  • In Figure 2, when searching for the location of among the vertices , for a new constraint as in Equation we must evaluate the sign of , and for that we must represent and somehow. If is the intersection of the consecutive edges and in the counter clockwise order then




    and and cannot always be computed exactly with floating point arithmetic. Due to the ordering of the edges, and we can bound and using


    In these equations is the value of rounded up and is rounded up. With this arithmetic, we can prove that

    When , and , we can analyze the sign of by comparing


    and the analysis of the other of combinations of the signs of , and is analogous. If , then certainly , and if then certainly . However if neither nor then we cannot obtain the sign of only with the information provided by the numbers . That is, the numbers yield partial tests, which may be inconclusive in rare cases. In these rare cases, we resort to the technique presented in [3]. Using this technique we can evaluate exactly the sign of


    and decide on which side of the line the vertex lies. In summary, in order to locate we first try our best with the numbers in Equations . If we fail then we resort to the exact expressions for the vertex coordinates given by Equations and and use the more expensive evaluation of the sign of the expression in Equation with the technique described in [3].

  • After finding in which edges and lie, we must represent them somehow. Unfortunately, it is impossible to always compute and exactly with floating point arithmetic. What we can do is to compute the numbers in , and use the equations and defining the vertices as intersections of consecutive edges when the information provided by these numbers are not enough.

3 Representing the feasible region in C++

The building block we use to turn the ideas in the previous Section in C++ code is the following struct, which we use to represent the edges illustrated in Figure 3,

template <class T>
struct edge {
  T n;
  T c;
  T x[6];

normalized normal vectors

edges, vertices and boxes

Figure 3: is stored in eight ranges of edges, which correspond to the octants at the right. The normals in each range are sorted in the counter clockwise order. An edge is given by the normalized constraint and the six numbers in Equations . When , these numbers yield a small box containing the th vertex. Even when , they lead to quick tests to decide in which side of a line this vertex is.

In the struct edge<T>, the field n is the absolute value of the secondary entry of the normal of the normalized constraint corresponding to the edge. Its interpretation depends on the octant. In the first octant, the constraints are of the form , and n is . In the third octant, the constraints are , and n is equal to . In every octant, c is the right hand side of the constraint. The vector x contains the six numbers in Equations corresponding to the first vertex in the edge. The symmetry among quadrants is perfect and we do not need to write an specific function to handle edges in each octant, or to store the octant number in the edge. Instead, we manipulate edges using functions of the form

template <int Octant, class T>
void function(edge<T> const& e)


template <int OctantA, int OctantB, class T>
void function(edge<T> const& ea, edge<T> const& eb)

Due to symmetry, we can reason about such functions as if Octant = 0, or OctantA = 0, and let the compiler generate the code for all cases. In particular, there is little need for switches to decide with which octant we are working with at runtime. Most switches are performed at compile time.

Once we have decided to represent the edges and vertices of the feasible region by the struct above we must choose how to store them in memory. This choice depends upon how we expect to use the code. If the expected number of edges of is very large then it is advisable to use a container in which insertion and removal of edges has cost . However, these containers usually have an overhead and are inefficient for small or even moderate values of . For instance, the C++ standard library provides a container called map which is usually implemented as a red black tree and is notoriously inefficient even for in the order of a few hundred. There is also the subtle point that can be much smaller than , the number of inequalities. For instance, if the constraints are generated randomly then our experiments indicate that is much smaller than (something like seems plausible.) For these reasons we organized the code in such way that we could replace the type of container with easy, and due to time constraints we implemented only the container that we describe next. Insertion and removal of edges in this container can cost in the worst case, but it relies very little in dynamic memory allocation and causes no fragmentation in memory. As a result, it is quite efficient for the cases with up to a hundred which concern us most.

The edges are grouped into eight range<T>s:

template <class T>
struct range {
  edge<T>* begin;
  edge<T>* end;
  range<T>* previous;
  range<T>* next;











Figure 4: A feasible region with 5 edges, in ranges 1, 6, and 4. The letters and stand for begin, end, previous and next. The sentinel is represented twice but it is unique, and yes, it’s begin comes after it’s end. The ranges 0, 2, 3, 5 and 7 are empty and b = e = n = p = NULL for them.

The ranges are managed by an object of type ranges<T>:

template <class T>
struct ranges{
Ψrange<T> r[8];
Ψrange<T> sentinel;

Memory is organized as in Figure 4. Ranges can be inactive (when they are empty) or active. The active ranges and the sentinel form a doubly linked circular list, defined the field previous and next in the ranges. They share a common array of edges, with the edges for each active range being indicate by its begin and end fields. For consistency with circularity, the begin field of the sentinel always points to the first edge in the array of edges, and the sentinel’s end field always points to one passed the last element of the array of edges.

We leave slack on the edges array so that removing an edge from a range does not affect the other ranges and inserting an edge in a range only affects it’s neighboring ranges in a few cases. We can remove and insert edges in this data structure with the usual techniques for the manipulation of arrays and linked lists. When inserting an edge, if its range is inactive then we search for largest gap between the edges used by active ranges. If no gap is found then we allocate an new array of edges twice as large as the current one and move the current edges to it, dividing the extra space roughly equally in between the space used by the active ranges, and we activate the range by inserting it into the list of active ranges. If the new edge’s range is active and there is slack before or after it we simply expand the set of edges of this range accordingly. If there is no slack then we try to bump one of its neighbors. If this is not possible, then we reallocate memory as before, and after that we expand the range. To remove an edge we simply shrink the set of edges managed by its range. If the range becomes empty then we remove if from the list of active ranges, but keep it as inactive in the ranges<T>’s array r.

4 Testing

This section explains how to generate test cases for code that implements algorithms like the one we propose here. We warn readers that it is quite hard to write correct code for the task which we discuss in this article, and good tests are essential to ensure the quality of our code. Tests in which we simply generate constraints at random and verify that the resulting feasible region is consistent are not enough. Such tests tend to generate feasible regions with few edges, and will not find bugs caused by degenerate cases like the ones in Figure 5.

Figure 5: Cases which lead to bugs in code implementing the algorithm described here. The new constraints have a normal vector attached to them and the current feasible region is the polygon .

In order to generate random test cases with an acceptable coverage we must direct the random choices. A reasonable family of test cases can be built by choosing normals from the set of 32 equally space normals on the border of the square .



is the counter clockwise rotation by . With a powerful machine and multi threading we can test all “valid” subsets of as set of normals in the system , where by valid we mean that there is no gap greater than or equal to between two consecutive elements of (if there is such a large gap then the feasible region corresponding to is unbounded.)

Given a valid subset of with elements and an integer , in Subsection 4.1 we explain how to build a random polygon with vertices and numbers such that



The s are the lengths of the sides of in the sup norm. The polygon is the feasible region of the problem with , . Since , we have that


We can then execute the following procedure to test all cases in Figure 5 (except for the ones in which the new constraint contains two no adjacent vertices of ) using a floating point arithmetic in which the mantissa has bits and are such that does not overflow. For instance, for we could use float, and for we could use double.

To start with, we pick randomly a few orders on the indexes and for each order, starting from the square , we insert the constraints in Equation one by one, checking whether the current feasible region is consistent at each step. In the end we check whether the final feasible region is . We then consider the set of 64 normals equally spaced on the border of the square

For each we compute the values

and let be the set we obtain after we sort and remove the repetitions. For each we check whether the constraint

is inserted correctly. We then define and and for we choose a few randomly in and check whether the constraint

is inserted correctly.

The tests above do no cover degenerate s, i.e., the cases in which is empty, a point or a segment (In our code, we represent such cases using another data structure.) We can generate tests for these cases by a procedure similar to the one above, but which is less time consuming. To test the case in which is a point, we generate such point as the intersection of two segments with the normals in the set in Equation and chosen randomly so that the resulting point has non negative coordinates. To generate test cases for segments, we generate them using 3 elements from as normals and ’s chosen as for points. We then check the insertion of the same new constraints as for the case in which is a polygon.

In our experience, the test procedures above are quite powerful: using them we found bugs in our code which were not found by our unit tests which focused on each small part of the code at a time, due to incorrect implicit assumptions we made while coding these unit tests.

4.1 Building a polygon given its normals

This subsection describes how to generate a polygon with edges with normals in a valid subset of the set of normals in Equation . We assume to be sorted in the counter clockwise order. We now define the successor of as

and find the lengths of the edges of in the sup norm. We generate random integers and compute the sum

The lengths define the sides of a closed polygonal line if and only if . In the unlikely case that , we happily set for all and go for coffee. Otherwise, we fix the ’s as follows. Since there are at most elements in each octant in and for all , we can write

for integers , and


By symmetry, . Let be such that in the counter clockwise order in which is sorted. Since is valid, there exist such that

This implies that and with


is positive due to the order in . Since , . Similarly, the bound yields

It follows that


and . We now have the normals and the lengths for the sides of and build the and in two steps. First we define

and then, for ,


with the usual convention that . The identity leads to

and Equations and and the fact that imply that


Applying the same argument used to obtain the bound in Equation with in Equation instead of we obtain that


Equations and imply that

have absolute value smaller than . We then flip our last coins to find integers and define

completing the construction of .


  • [1] Goldsztejn A., Granvilliers, L., Christophe Jermann,C. Constraint Based Computation of Periodic Orbits of Chaotic Dynamical Systems Lecture Notes in Computer Science 8124, 774–789 (2013).
  • [2] De Berg. M., van Kreveld, M., Overmans, M., Schwarzkopf,O., Computational Geometry, algorithms and applications Springer, (2008).
  • [3] Mascarenhas, W.F., Computing the exact sign of sums of products with floating point arithmetic. arXiv:2109.07838 [cs.CG] (2021).
  • [4] Mascarenhas, W.F., Moore: Interval Arithmetic in C++20, In: Barreto G., Coelho R. (eds) Fuzzy Information Processing. NAFIPS 2018. Communications in Computer and Information Science, vol 831, pp 519–529 (2018).
  • [5] O’Rourke, J., Computational Geometry in C, Cambrige University Press, (1998).