Exploiting Binary Floating-Point Representations for Constraint Propagation: The Complete Unabridged Version

Floating-point computations are quickly finding their way in the design of safety- and mission-critical systems, despite the fact that designing floating-point algorithms is significantly more difficult than designing integer algorithms. For this reason, verification and validation of floating-point computations is a hot research topic. An important verification technique, especially in some industrial sectors, is testing. However, generating test data for floating-point intensive programs proved to be a challenging problem. Existing approaches usually resort to random or search-based test data generation, but without symbolic reasoning it is almost impossible to generate test inputs that execute complex paths controlled by floating-point computations. Moreover, as constraint solvers over the reals or the rationals do not natively support the handling of rounding errors, the need arises for efficient constraint solvers over floating-point domains. In this paper, we present and fully justify improved algorithms for the propagation of arithmetic IEEE 754 binary floating-point constraints. The key point of these algorithms is a generalization of an idea by B. Marre and C. Michel that exploits a property of the representation of floating-point numbers.


page 1

page 2

page 3

page 4


Correct Approximation of IEEE 754 Floating-Point Arithmetic for Program Verification

Verification of programs using floating-point arithmetic is challenging ...

Deductive Verification of Floating-Point Java Programs in KeY

Deductive verification has been successful in verifying interesting prop...

An efficient constraint based framework forhandling floating point SMT problems

This paper introduces the 2019 version of , a novel Constraint Programmi...

Why Not to Use Binary Floating Point Datatypes in RDF

The XSD binary floating point datatypes are regularly used for precise n...

Bringing freedom in variable choice when searching counter-examples in floating point programs

Program verification techniques typically focus on finding counter-examp...

Automatic generation and verification of test-stable floating-point code

Test instability in a floating-point program occurs when the control flo...

Datasets for Benchmarking Floating-Point Compressors

Compression of floating-point data, both lossy and lossless, is a topic ...

1 Introduction

During the last decade, the use of floating-point computations in the design of critical systems has become increasingly acceptable. Even in the civil and military avionics domain, which are among the most critical domains for software, floating-point numbers are now seen as a sufficiently-safe, faster and cheaper alternative to fixed-point arithmetic. To the point that, in modern avionics, floating-point is the norm rather than the exception (Burdy et al. 2012).

Acceptance of floating-point computations in the design of critical systems took a long time. In fact, rounding errors can cause subtle bugs which are often missed by non experts (Monniaux 2008), and can lead to catastrophic failures. For instance, during the first Persian Gulf War, the failure of a Patriot missile battery in Dhahran was traced to an accumulating rounding error in the continuous execution of tracking and guidance software: this failure prevented the interception of an Iraqi Scud that hit the barracks in Dhahran, Saudi Arabia, killing 28 US soldiers (Skeel 1992). A careful analysis of this failure revealed that, even though the rounding error obtained at each step of the floating-point computation was very small, the propagation during a long loop-iterating path could lead to dramatic imprecision.

Adoption of floating-point computations in critical systems involves the use of thorough unit testing procedures that are able to exercise complex chains of floating-point operations. In particular, a popular practice among software engineers in charge of the testing of floating-point-intensive computations consists in executing carefully chosen loop-iterating paths in programs. They usually pay more attention to the paths that are most likely to expose the system to unstable numerical computations.111A computation can be called numerically stable if it can be proven not to magnify approximation errors. It can be called (potentially) unstable otherwise. For critical systems, a complementary requirement is to demonstrate the infeasibility of selected paths, in order to convince a third-party certification authority that certain unsafe behaviors of the systems cannot be reached. As a consequence, software engineers face two difficult problems:

  1. How to accurately predict the expected output of a given floating-point computation?222This is the well-known oracle problem (see Weyuker 1982).

  2. How to find a test input that is able to exercise a given path, the execution of which depends on the results of floating-point computations, or to guarantee that such a path is infeasible?

The first problem has been well addressed in the literature (Kuliamin 2010) through several techniques. Ammann and Knight (1988) report on a technique known as the data diversity approach, which uses multiple related program executions of a program to check their results. Metamorphic testing (Chan et al. 1998) generalizes this technique by using known numerical relations of the function implemented by a program to check the results of two or more executions. Goubault (2001) proposes using the abstract interpretation framework (Cousot and Cousot 1977)

to estimate the deviation of the floating-point results with respect to an interpretation over the reals.

Scott et al. (2007) propose using a probabilistic approach to estimate round-off error propagation. More recently, Tang et al. (2010) propose to exploit perturbation techniques to evaluate the stability of a numerical program. In addition to these approaches, it is possible to use a (partial) specification, a prototype or an old implementation in order to predict the results for a new implementation.

In contrast, the second problem received only little attention. Beyond the seminal work of Miller and Spooner (1976), who proposed to guide the search of floating-point inputs to execute a selected path, few approaches try to exactly reason over floating-point computations. The work of Miller and Spooner (1976) paved the way to the development of search-based test data generation techniques, which consist in searching test inputs by minimizing a cost function, evaluating the distance between the currently executed path and a targeted selected path (Korel 1990, McMinn 2004, Lakhotia et al. 2010a). Although these techniques enable quick and efficient coverage of testing criteria such as “all decisions,” they are unfortunately sensitive to the rounding errors incurred in the computation of the branch distance (Arcuri 2009). Moreover, search-based test data generation cannot be used to study path feasibility, i.e., to decide whether a possible execution path involving floating-point computations is feasible or not in the program. In addition, these techniques can be stuck in local minima without being able to provide a meaningful result (Arcuri 2009). An approach to tackle these problems combines program execution and symbolic reasoning (Godefroid et al. 2005), and requires solving constraints over floating-point numbers in order to generate test inputs that exercise a selected behavior of the program under test. However, solving floating-point constraints is hard and requires dedicated filtering algorithms (Michel et al. 2001, Michel 2002). According to our knowledge, this approach is currently implemented in four solvers only: ECLAIR333http://bugseng.com/products/eclair, FPCS (Blanc et al. 2006), FPSE444http://www.irisa.fr/celtique/carlier/fpse.html (Botella et al. 2006), and GATeL, a test data generator for Lustre programs (Marre and Blanc 2005). It is worth noticing that existing constraint solvers dedicated to continuous domains (such as, e.g., RealPaver (Granvilliers and Benhamou 2006), IBEX and Quimper (Chabert and Jaulin 2009) or ICOS (Lebbah 2009)) correctly handle real or rational computations, but they cannot preserve the solutions of constraints over floating-point computations in all cases (see Section 5 for more on this subject). “Surprising” properties of floating-point computations such as absorption and cancellation (Goldberg 1991) show that the rounding operations can severely compromise the preservation of the computation semantics between the reals and the floats. Consider the C functions f1 and f2:

1float f1(float x) {                                          float f2(float x) {
2  float y = 1.0e12F;                                           float y = 1.0e12F;
3  if (x < 10000.0F)                                            if (x > 0.0F)
4     z = x + y;                                                  z = x + y;
5  if (z > y)                                                   if(z == y)

For both functions, let us ask the question whether the paths traversing lines 2-3-4-5-6 are feasible. The condition that must be satisfied in order for a certain path to be traversed is called path condition. For f1, the path conditions and , which on the reals are equivalent to whereas on the floats they have no solution. Conversely, for f2 the path conditions are and , which have no solutions on the reals but are satisfied by all IEEE 754 single precision floating-point numbers in the range .

1.1 A Real-World Example

1#define MAX_PPRZ 9600
4#ifndef CAM_PAN_MAX
5#define CAM_PAN_MAX 90
7#ifndef CAM_PAN_MIN
8#define CAM_PAN_MIN -90
10#define M_PI 3.14159265358979323846
11#define RadOfDeg(x) ((x) * (M_PI/180.))
15#error CAM_PAN_MAX has to be different from CAM_PAN_NEUTRAL
18#error CAM_PAN_MIN has to be different from CAM_PAN_NEUTRAL
22float cam_pan_c;
24void cam_angles( void ) {
25  float cam_pan = 0;
26  if (cam_pan_c > RadOfDeg(CAM_PAN_MAX)) {
27    cam_pan_c = RadOfDeg(CAM_PAN_MAX);
28  } else {
29    if (cam_pan_c < RadOfDeg(CAM_PAN_MIN))
30      cam_pan_c = RadOfDeg(CAM_PAN_MIN);
31  }
34  float pan_diff = cam_pan_c - RadOfDeg(CAM_PAN_NEUTRAL);
35  if (pan_diff > 0)
36    cam_pan = MAX_PPRZ * (pan_diff / (RadOfDeg(CAM_PAN_MAX - CAM_PAN_NEUTRAL)));
37  else
38    cam_pan = MIN_PPRZ * (pan_diff / (RadOfDeg(CAM_PAN_MIN - CAM_PAN_NEUTRAL)));
40  cam_pan = ((float)RadOfDeg(cam_pan_c - CAM_PAN_MIN))
41              * ((float)MAX_PPRZ / (float)RadOfDeg(CAM_PAN_MAX-CAM_PAN_MIN) );
44  if (cam_pan < MIN_PPRZ)
45    cam_pan = MIN_PPRZ;
46  else if (cam_pan > MAX_PPRZ)
47    cam_pan = MAX_PPRZ;
Listing 1: Code excerpted from a real-world avionic library

To illustrate the concrete problem raised by floating-point computations in program verification settings, consider the code depicted in Listing 1. It is a somewhat reduced version of a real-world example extracted from a critical embedded system.555The original source code is available at http://paparazzi.enac.fr, file sw/airborne/modules/cam_control/cam.c, last checked on November 29, 2013. In order to gain confidence in this code, a test-suite should be created that contains enough test cases to achieve a specified level of coverage. The basic coverage criterion is “all statements”, and prescribes that each statement is reached at least once by at least one test.666There exist more sophisticate and, correspondingly, more challenging coverage criteria, such as the already-mentioned “all decisions” and Modified Condition Decision Coverage (MCDC, see Ammann et al. 2003). For each statement, a set of constraints is defined that encodes the reachability of the statement and then solution is attempted: if one solution is found, then such a solution, projected on the explicit inputs (read parameters) and the implicit inputs (read global variables) of the function, constitutes the input part of a test case; if it is determined that a solution does not exist, then the statement is dead code; if the solution process causes a timeout, then we don’t know. For example, if the CAM_PAN_NEUTRAL is defined to expand to the integer literal 5 (or, for that matter, 45 or many other values), then we can prove that the statements in lines 45 and 47 are unreachable.777All the experiments mentioned in this paper have been conducted using the ECLAIR system. The presence of dead code is not acceptable for several industry standards such as MISRA C (Motor Industry Software Reliability Association 2013), MISRA C++ (Motor Industry Software Reliability Association 2008), and JSF C++ (VV. AA. 2005).

Another application of the same technology is the proof of absence of run-time anomalies, such as overflows or the unwanted generation of infinities. For each operation possibly leading to such an anomaly, a constraint system is set up that encodes the conditions under which the anomaly takes place. A solution is then searched: if it is found, then we have the proof that the code is unsafe; if it can be determined that no solution exists, then we know the code is safe; otherwise we don’t know. For the code of Listing 1, if the CAM_PAN_NEUTRAL is defined 5 or 45, then we can prove that no run-time anomaly is possible, whatever is the value of variable cam_pan_c when the function is invoked.

Now, let us take another point of view and consider that the macro CAM_PAN_NEUTRAL is not defined in the same file, as it is a configuration parameter. Its definition is (partially) validated by means of preprocessor directives as shown in the listing at lines 13–20: these directives enough to protect against dangerous definitions of CAM_PAN_NEUTRAL? We can provide an answer to this question by treating CAM_PAN_NEUTRAL as a variable of any type that is compatible with its uses in the code. This way we discover that, if CAM_PAN_NEUTRAL is defined to expand to, e.g., -2147483558, then we will have an overflow in line 36 on a 32-bit machine. Most compilers will catch this particular mistake, but this will not be the case if someone, someday, defines CAM_PAN_NEUTRAL as, e.g., +0x1ca5dc14c57550.p81 (roughly ): then in line 34 an infinity will be generated, something that in the aviation and other industries is unacceptable. One might also wonder whether one can define CAM_PAN_NEUTRAL as a double precision floating-point literal so that the denominator of divisions in lines 36 and 38 can be so small to cause an overflow: constraint solving over floating-point numbers is able to answer negatively to this question.

1.2 Contribution and Plan of the Paper

A promising approach to improve the filtering capabilities of constraints over floating-point variables consists in using some peculiar numerical properties of floating-point numbers. For linear constraints, this led to a relaxation technique where floating-point numbers and constraints are converted into constraints over the reals by using linear programming approaches

(Belaid et al. 2012). For interval-based consistency approaches, Marre and Michel (2010) identified a property of the representation of floating-point numbers and proposed to exploit it in filtering algorithms for addition and subtraction constraints. Carlier and Gotlieb (2011) proposed a reformulation of the Marre-Michel property in terms of “filtering by maximum ULP” (Units in the Last Place) that is generalizable to multiplication and division constraints.

Bagnara et al. (2013) addressed the question of whether the Marre-Michel property can be useful for the automatic solution of realistic test input generation problems: they sketched (without proofs) a reformulation and correction of the filtering algorithm proposed in (Marre and Michel 2010), along with a uniform framework that generalizes the property identified by Marre and Michel to the case of multiplication and division. Most importantly, (Bagnara et al. 2013) presented the implementation of filtering by maximum ULP in FPSE and some of its critical design choices, and an experimental evaluation on constraint systems that have been extracted from programs engaging into intensive floating-point computations. These results show that the Marre-Michel property and its generalization are effective, practical properties for solving constraints over the floats with an acceptable overhead. The experiments reported in (Bagnara et al. 2013) showed that improvement of filtering procedures with these techniques brings speedups of the overall constraint solving process that can be substantial (we have observed up to an order of magnitude); in the cases where such techniques do not allow significant extra-pruning, the slowdowns are always very modest (up to a few percent on the overall solution time).

The present paper is, on the one hand, the theoretical counterpart of (Bagnara et al. 2013) in that all the results are thoroughly proved; on the other hand, this paper generalizes and extends (Bagnara et al. 2013) as far as the handling of subnormals and floating-point division are concerned. More precisely, the contributions of the paper are:

  1. a uniform framework for filtering by maximum ULP is thoroughly defined and justified;

  2. the framework is general enough to encompass all floating-point arithmetic operations and subnormals (the latter are not treated in 

    (Bagnara et al. 2013));

  3. a second indirect projection by maximum ULP for division (not present in any previous work);

  4. all algorithms only use floating-point machine arithmetic operations on the same formats used by the analyzed computations.

The plan of the paper is as follows. Next section presents the IEEE 754 standard of binary floating-point numbers and introduces the notions and notations used throughout the paper. Section 3 recalls the basic principles of interval-based consistency techniques over floating-point variables and constraints. Section 4 presents our generalization of the Marre-Michel property along with a precise definition and motivation of all the required algorithms. Section 5 discusses related work. Section 6 concludes the main body of the paper. The most technical proofs are available in the Appendix.

2 Preliminaries

In this section we recall some preliminary concepts and introduce the used notation.

2.1 Ieee 754

This section recalls the arithmetic model specified by the IEEE 754 standard for binary floating-point arithmetic (IEEE Computer Society 2008). Note that, although the IEEE 754 standard also specifies formats and methods for decimal floating-point arithmetic, in this paper we only deal with binary floating-point arithmetic.

IEEE 754 binary floating-point formats are uniquely identified by quantities: , the number of significant digits (precision); , the maximum exponent; , the minimum exponent.888Note that, although the IEEE 754 formats have , we never use this property and decided to keep the extra-generality, which might be useful to accommodate other formats. The single precision format has and , the double precision format has and (IEEE 754 also defines extended precision formats). A finite, non-zero IEEE 754 floating-point number has the form where is the sign bit, is the hidden bit, is the -bit significand and the exponent is also denoted by or . Hence the number is positive when and negative when . is termed “hidden bit” because in the binary interchange format encodings it is not explicitly represented, its value being encoded in the exponent (IEEE Computer Society 2008).

Each format defines several classes of numbers: normal numbers, subnormal numbers, signed zeros, infinities and NaNs (Not a Number). The smallest positive normal floating-point number is and the largest is ; normal numbers have the hidden bit . The non-zero floating-point numbers whose absolute value is less than are called subnormals: they always have exponent equal to and fewer than significant digits as their hidden bit is . Every finite floating-point number is an integral multiple of the smallest subnormal . There are two infinities, denoted by and , and two signed zeros, denoted by and : they allow some algebraic properties to be maintained (Goldberg 1991).999Examples of such properties are and for . NaNs are used to represent the results of invalid computations such as a division of two infinities or a subtraction of infinities with the same sign: they allow the program execution to continue without being halted by an exception.

IEEE 754 defines five rounding directions: toward negative infinity (roundTowardNegative or, briefly, down), toward positive infinity (roundTowardPositive, a.k.a. up), toward zero (roundTowardZero, a.k.a. chop) and toward the nearest representable value (a.k.a. near); the latter comes in two flavors that depend on different tie-break rules for numbers exactly halfway between two representable numbers: roundTiesToEven (a.k.a. tail-to-even) or roundTiesToAway (a.k.a. tail-to-away) in which values with even significand or values away from zero are preferred, respectively. This paper is only concerned with roundTiesToEven, which is, by far, the most widely used. The roundTiesToEven value of a real number will be denoted by .

The most important requirement of IEEE 754 arithmetic is the accuracy of floating-point computations: add, subtract, multiply, divide, square root, remainder, conversion and comparison operations must deliver to their destination the exact result rounded as per the rounding mode in effect and the format of the destination. It is said that these operations are “correctly rounded.”

The accuracy requirement of IEEE 754 can still surprise the average programmer: for example the single precision, round-to-nearest addition of and (both numbers can be exactly represented) gives , i.e., the second operand is absorbed. The maximum error committed by representing a real number with a floating-point number under some rounding mode can be expressed in terms of the function (Muller 2005). Its value on is about for the single precision format.

2.2 Notation

The set of real numbers is denoted by while denotes a subset of the binary floating-point numbers, defined from a given IEEE 754 format, which includes the infinities and , the signed zeros and , but neither subnormal numbers nor NaNs. Subnormals are introduced in the set . In some cases, the exposition can be much simplified by allowing the of to be , i.e., by considering an idealized set of floats where the exponent is unbounded. Among the advantages is the fact that subnormals in can be represented as normal floating-point numbers in . Given a set of floating-point numbers , denotes the “non-negative” subset of , i.e., with .

For a finite, non-zero floating-point number , we will write (resp., ) to signify that the least significant digit of ’s significand is  (resp., ).

When the format is clear from the context, a real decimal constant (such as ) denotes the corresponding roundTiesToEven floating-point value (i.e., for ).

Henceforth, for , (resp., ) denotes the smallest (resp., greatest) floating-point number strictly greater (resp., smaller) than with respect to the considered IEEE 754 format. Of course, we have and .

Binary arithmetic operations over the floats will be denoted by , , and , corresponding to , , and over the reals, respectively. According to IEEE 754, they are defined, under roundTiesToEven, by

As IEEE 754 floating-point numbers are closed under negation, we denote the negation of simply by . Note that negation is a bijection. The symbol denotes any of , , or . A floating-point variable x is associated with an interval of possible floating-point values; we will write , where and denote the smallest and greatest value of the interval, and either or . Note that is not an interval, whereas is the interval denoting the set of floating-point numbers .

3 Background on Constraint Solving over Floating-Point Variables

In this section, we briefly recall the basic principles of interval-based consistency techniques over floating-point variables and constraints.

3.1 Interval-based Consistency on Arithmetic Constraints

Figure 1: Formulas for direct/indirect projections of addition/subtraction

Program analysis usually starts with the generation of an intermediate code representation in a form called three-address code (TAC). In this form, complex arithmetic expressions and assignments are decomposed into sequences of assignment instructions of the form

A further refinement consists in the computation of the static single assignment form (SSA) whereby, labeling each assigned variable with a fresh name, assignments can be considered as if they were equality constraints. For example, the TAC form of the floating-point assignment is , which in SSA form becomes , which, in turn, can be regarded as the conjunction of the constraints and .

In an interval-based consistency approach to constraint solving over the floats, constraints are used to iteratively narrow the intervals associated with each variable: this process is called filtering. A projection is a function that, given a constraint and the intervals associated with two of the variables occurring in it, computes a possibly refined interval101010That is, tighter than the original interval. for the third variable (the projection is said to be over the third variable). Taking as an example, the projection over is called direct projection (it goes in the same sense of the TAC assignment it comes from), while the projections over and are called indirect projections.111111Note that direct and indirect projections are idempotent, as their inputs and outputs do not intersect. Consider : direct projection propagates information on and onto , and doing it twice in a row would not enable any further inference. Likewise for indirect projections, which propagate information on onto and . Note that, for constraint propagation, both direct and indirect projections are applied in order to refine the intervals for , and . In this paper we propose new filtering algorithms for improving indirect projections.

A projection is called optimal if the interval constraints it infers are as tight as possible, that is, if both bounds of the inferred intervals are attainable (and thus cannot be pruned).

Figure 1 gives non-optimal projections for addition and subtraction. For finite , denotes the number that is exactly halfway between and ; note that either or . Non-optimal projections for multiplication and division can be found in (Michel 2002, Botella et al. 2006). Optimal projections are known for monotonic functions over one argument (Michel 2002), but they are generally not available for other functions. Note, however, that optimality is not required in an interval-based consistency approach to constraint solving, as filtering is just used to remove some, not necessarily all, inconsistent values.

3.2 The Marre-Michel Property

Marre and Michel (2010) published an idea to improve the filtering of the indirect projections for addition and subtraction. This is based on a property of the distribution of floating-point numbers among the reals: the greater a float, the greater the distance between it and its immediate successor. More precisely, for a given float with exponent , if , then for of exponent we have .

(Marre and Michel 2010, Proposition 1) Let be such that ; let also

with ;
with ;

Then, for each , implies that and . Moreover, .

Figure 2: An illustration of the Marre-Michel property: the segment , if it has to represent the difference between two floats, cannot be moved past

This property, which can be generalized to subnormals, can intuitively be explained on Figure 2 as follows. Let be a strictly positive constant such that , where are unknown. The Marre-Michel property says that cannot be greater than . In fact, is carefully positioned so that , and ; if we take we need if we want ; however, the smallest element of that is greater than , , is away from , i.e., too much. Going further with does not help: if we take , then

is an odd multiple of

(one step from to , all the subsequent steps being even multiples of ), whereas for each , is an even multiple of . Hence, if , . However, since , . The last inequality, which holds because , implies . A similar reasoning allows one to see that cannot be greater than independently from the value of . In order to improve the filtering of the addition/subtraction projectors, Marre and Michel (2010) presented an algorithm to maximize the values of and over an interval. As we will see, that algorithm is not correct for some inputs. In Section 4.5, the main ideas behind the work presented in (Marre and Michel 2010) will be revisited, corrected and discussed.

4 Filtering by Maximum ULP

In this section we first informally present, by means of worked numerical examples, the techniques that are precisely defined later. We then reformulate the Marre-Michel property so as to generalize it to subnormals and to multiplication and division operators. The filtering algorithms that result from this generalization are collectively called filtering by maximum ULP.

4.1 Motivating Example

Consider the IEEE 754 single-precision constraint with initial intervals , and . Forward projection gives

which is optimal, as both bounds are attainable. Suppose now the interval for z is further restricted to due to, say, a constraint from an if-then-else in the program or another indirect projection.

With the classical indirect projection we obtain , which, however, is not optimal. For example, pick : for we have and . By monotonicity of , for no we can have .

With our indirect projection, fully explained later, we obtain, from , the much tighter intervals . These are actually optimal as . This example shows that filtering by maximum ULP can be stronger than classical interval-consistency based filtering. However, the opposite phenomenon is also possible. Consider again with . Suppose now the constraints for x and y are and . As we have seen, our indirect projection gives ; in contrast, the classical indirect projection exploits the available information on x to obtain . Indeed, classical and maximum ULP filtering for addition and subtraction are orthogonal: both should be applied in order to obtain precise results.

For an example on multiplication, consider the IEEE 754 single-precision constraint with initial intervals and . In this case, classical projections do not allow pruning the intervals. However, take : for we have and . By monotonicity of , for no we can have .

On the same example, are the constraints inferred by our indirect projection. These are optimal because . As is the case for addition, classical indirect projection can be more precise. Consider again with , and . Classical indirect projection infers by exploiting the information on x.

4.2 Round-To-Nearest Tail-To-Even

We now formally define the roundTiesToEven rounding mode. To do that, we first introduce two functions: and give the distance between and and the distance between and .

The partial functions and are defined as follows, for each finite :

Note the special cases when : since both and represent the real number , the distance between and is . We can now define the function that captures roundTiesToEven.

For , is defined as follows:

Figure 3: Rounding of real numbers in the neighborhood of an even floating-point number under roundTiesToEven

Figure 3 illustrates the roundTiesToEven rounding mode; if is even, each real number between and , including extremes, is rounded to the same floating-point number . As is even, is odd, and each real number between and , excluding extremes, is rounded to . Similarly for . Note that point coincides with and coincides with .

All rounding modes are monotonic; in particular, for each , implies . Moreover, the chop and near rounding modes are symmetric, i.e., the value after rounding does not depend on the sign: for each , .

4.3 Upper Bound

It is worth pointing out that, while arithmetic operations on reals are strictly monotone, that is if then for any , in floating-point arithmetic, operations are just monotone. If then we may still have for some (or many) since addition over the floats is absorbing. Therefore, for determining the greatest (or the smallest) satisfying and correctly filter intervals over the floats, we need to introduce an appropriate, duly justified, framework.

For each IEEE 754 floating-point operation , in later sections we will define the sets and . Then we will define functions (see Definition 4.5 in Section 4.5 for and, consequently, , Definition 4.6 in Section 4.6 for , and Definition 4.7.1 in Section 4.7 for ) that satisfy the following property, for each :


In words, is the greatest float in that can be the left operand of to obtain .

Verifying that a function satisfies (1) is equivalent to proving that it satisfies the following properties, for each :


Property (3) means is a correct upper bound for the possible values of , whereas (2) implies that is the most precise upper bound we could choose.

Note that we may have : property (1) refers to an idealized set of floating-point numbers with unbounded exponents.

Since we are interested in finding the upper bound of for , we need the following

Let be such that, for each , …, , . Then, for each with and each , we have that .


Proof. Follows directly from (1). ∎

Let be a floating-point constraint where and let be such that for each : then no element of x that is greater than can participate to a solution of the constraint.

Dually, in order to refine the upper bound of y subject to , it is possible to define a function satisfying the following property, for each :


Due to (4), a result analogous to the one of Proposition 4.3 holds for , which allows refining the interval for y. Note, though, that when is commutative (i.e., it is or ), .

4.4 Lower bound

For computing the lower bound, we will introduce functions (defined in terms of the corresponding functions in Section 4.5 for and , in Section 4.6 for , and in Section 4.7 for ) satisfying the following property, for each :


This property entails a result similar to Proposition 4.3: given constraint where and such that for each , the float is a possibly refined lower bound for x.

In a dual way, in order to refine the lower bound of y subject to , we will define functions satisfying, for each :


Property (6) ensures that, under where , if is such that for each , then the float is a possibly refined lower bound for y.

Again, when is commutative .

4.5 Filtering by Maximum ULP on Addition/Subtraction

In this section we introduce the functions , , , , and . Note that, since is commutative, we have and . Moreover, the function can be defined in terms of the function as follows: for each , . We see that, if satisfies Property (1), then satisfies Property (5). Again, since is commutative, .

The first step for defining consists in extending Proposition 3.2 in order to explicitly handle subnormal numbers. Such extension was already sketched by Marre and Michel (2010): here we fully describe it and prove its correctness. Subnormals, which in are represented by numbers having the hidden bit and exponent , can be represented in by numbers with