1 Introduction
During the last decade, the use of floatingpoint computations in the design of critical systems has become increasingly acceptable. Even in the civil and military avionics domain, which are among the most critical domains for software, floatingpoint numbers are now seen as a sufficientlysafe, faster and cheaper alternative to fixedpoint arithmetic. To the point that, in modern avionics, floatingpoint is the norm rather than the exception (Burdy et al. 2012).
Acceptance of floatingpoint computations in the design of critical systems took a long time. In fact, rounding errors can cause subtle bugs which are often missed by non experts (Monniaux 2008), and can lead to catastrophic failures. For instance, during the first Persian Gulf War, the failure of a Patriot missile battery in Dhahran was traced to an accumulating rounding error in the continuous execution of tracking and guidance software: this failure prevented the interception of an Iraqi Scud that hit the barracks in Dhahran, Saudi Arabia, killing 28 US soldiers (Skeel 1992). A careful analysis of this failure revealed that, even though the rounding error obtained at each step of the floatingpoint computation was very small, the propagation during a long loopiterating path could lead to dramatic imprecision.
Adoption of floatingpoint computations in critical systems involves the use of thorough unit testing procedures that are able to exercise complex chains of floatingpoint operations. In particular, a popular practice among software engineers in charge of the testing of floatingpointintensive computations consists in executing carefully chosen loopiterating paths in programs. They usually pay more attention to the paths that are most likely to expose the system to unstable numerical computations.^{1}^{1}1A computation can be called numerically stable if it can be proven not to magnify approximation errors. It can be called (potentially) unstable otherwise. For critical systems, a complementary requirement is to demonstrate the infeasibility of selected paths, in order to convince a thirdparty certification authority that certain unsafe behaviors of the systems cannot be reached. As a consequence, software engineers face two difficult problems:

How to accurately predict the expected output of a given floatingpoint computation?^{2}^{2}2This is the wellknown oracle problem (see Weyuker 1982).

How to find a test input that is able to exercise a given path, the execution of which depends on the results of floatingpoint computations, or to guarantee that such a path is infeasible?
The first problem has been well addressed in the literature (Kuliamin 2010) through several techniques. Ammann and Knight (1988) report on a technique known as the data diversity approach, which uses multiple related program executions of a program to check their results. Metamorphic testing (Chan et al. 1998) generalizes this technique by using known numerical relations of the function implemented by a program to check the results of two or more executions. Goubault (2001) proposes using the abstract interpretation framework (Cousot and Cousot 1977)
to estimate the deviation of the floatingpoint results with respect to an interpretation over the reals.
Scott et al. (2007) propose using a probabilistic approach to estimate roundoff error propagation. More recently, Tang et al. (2010) propose to exploit perturbation techniques to evaluate the stability of a numerical program. In addition to these approaches, it is possible to use a (partial) specification, a prototype or an old implementation in order to predict the results for a new implementation.In contrast, the second problem received only little attention.
Beyond the seminal work of Miller and Spooner (1976),
who proposed to guide the search of floatingpoint
inputs to execute a selected path, few approaches try to exactly
reason over floatingpoint computations.
The work of Miller and Spooner (1976) paved the way to the development of
searchbased test data generation techniques, which consist
in searching test inputs by minimizing a cost function, evaluating
the distance between the currently executed path
and a targeted selected path (Korel 1990, McMinn 2004, Lakhotia et al. 2010a).
Although these techniques enable quick and efficient coverage of
testing criteria such as “all decisions,” they are unfortunately
sensitive to the rounding errors incurred in the computation of the
branch distance (Arcuri 2009).
Moreover, searchbased test data generation cannot be used to study
path feasibility, i.e., to decide whether a possible execution path
involving floatingpoint computations is feasible or not in the
program.
In addition, these techniques can be stuck in local minima without
being able to provide a meaningful result (Arcuri 2009).
An approach to tackle these problems combines program execution and
symbolic reasoning (Godefroid et al. 2005),
and requires solving constraints over floatingpoint
numbers in order to generate test inputs that exercise a selected behavior
of the program under test.
However, solving floatingpoint constraints is hard and requires dedicated
filtering algorithms (Michel et al. 2001, Michel 2002).
According to our knowledge, this approach is currently implemented
in four solvers only:
ECLAIR^{3}^{3}3http://bugseng.com/products/eclair,
FPCS (Blanc et al. 2006),
FPSE^{4}^{4}4http://www.irisa.fr/celtique/carlier/fpse.html
(Botella et al. 2006),
and GATeL, a test data generator for Lustre programs
(Marre and Blanc 2005).
It is worth noticing that existing constraint solvers dedicated to
continuous domains (such as, e.g., RealPaver (Granvilliers and Benhamou 2006),
IBEX and Quimper (Chabert and Jaulin 2009) or ICOS (Lebbah 2009))
correctly handle
real or rational computations, but they cannot preserve the
solutions of constraints over floatingpoint computations in all
cases (see Section 5 for more on this subject).
“Surprising” properties of floatingpoint computations such as
absorption and cancellation (Goldberg 1991) show that the rounding
operations can severely compromise the preservation of the computation
semantics between the reals and the floats.
Consider the C functions f1
and f2
:
For both functions, let us ask the question whether the paths
traversing lines 23456 are feasible.
The condition that must be satisfied in order for a certain path to be
traversed is called path condition.
For f1
, the path conditions
and ,
which on the reals are equivalent to
whereas on the floats they have no solution.
Conversely, for f2
the path conditions are
and , which
have no solutions on the reals but are satisfied by all IEEE 754
single precision floatingpoint numbers in the range .
1.1 A RealWorld Example
To illustrate the concrete problem raised by floatingpoint computations
in program verification settings,
consider the code depicted in Listing 1.
It is a somewhat reduced version of a realworld example extracted
from a critical embedded system.^{5}^{5}5The original source
code is available at http://paparazzi.enac.fr, file
sw/airborne/modules/cam_control/cam.c, last checked on November 29, 2013.
In order to gain confidence in this code,
a testsuite should be created that contains
enough test cases to achieve a specified level of coverage.
The basic coverage criterion is “all statements”, and
prescribes that each statement is reached at least once by at least
one test.^{6}^{6}6There exist more sophisticate and, correspondingly,
more challenging coverage criteria, such as the alreadymentioned
“all decisions” and Modified Condition Decision Coverage
(MCDC, see Ammann et al. 2003).
For each statement, a set of constraints is defined that encodes the
reachability of the statement and then solution is attempted:
if one solution is found, then such a solution, projected on the
explicit inputs (read parameters) and the implicit inputs (read global
variables) of the function, constitutes the input part of a test case;
if it is determined that a solution does not exist, then the statement
is dead code;
if the solution process causes a timeout, then we don’t know.
For example, if the CAM_PAN_NEUTRAL
is defined to expand to the integer
literal 5
(or, for that matter, 45
or many other values),
then we can prove that the statements in lines 45 and 47 are
unreachable.^{7}^{7}7All the experiments
mentioned in this paper have been conducted using the ECLAIR system.
The presence of dead code is not acceptable for several
industry standards such as MISRA C (Motor Industry Software Reliability
Association 2013),
MISRA C++ (Motor Industry Software Reliability
Association 2008),
and JSF C++ (VV. AA. 2005).
Another application of the same technology is the proof of absence of
runtime anomalies, such as overflows or the unwanted generation of infinities.
For each operation possibly leading to such an anomaly, a constraint system
is set up that encodes the conditions under which the anomaly takes place.
A solution is then searched:
if it is found, then we have the proof that the code is unsafe;
if it can be determined that no solution exists, then we know the code
is safe;
otherwise we don’t know.
For the code of Listing 1,
if the CAM_PAN_NEUTRAL
is defined 5
or 45
,
then we can prove that no runtime anomaly is possible, whatever
is the value of variable cam_pan_c
when the function
is invoked.
Now, let us take another point of view and consider that
the macro CAM_PAN_NEUTRAL
is not defined in the same file,
as it is a configuration parameter. Its definition is (partially)
validated by means of preprocessor directives as shown in the listing
at lines 13–20: these directives enough to protect against dangerous
definitions of CAM_PAN_NEUTRAL
?
We can provide an answer to this question by treating CAM_PAN_NEUTRAL
as a variable of any type that is compatible with its uses in the code.
This way we discover that, if CAM_PAN_NEUTRAL
is defined to expand
to, e.g., 2147483558
, then we will have an overflow in line 36 on
a 32bit machine. Most compilers will catch this particular mistake,
but this will not be the case if someone, someday, defines
CAM_PAN_NEUTRAL
as, e.g.,
+0x1ca5dc14c57550.p81
(roughly ):
then in line 34 an infinity will be generated, something
that in the aviation and other industries is unacceptable.
One might also wonder whether one can define CAM_PAN_NEUTRAL
as a double precision floatingpoint literal
so that the denominator of divisions in lines 36 and 38 can be so small
to cause an overflow: constraint solving over floatingpoint numbers
is able to answer negatively to this question.
1.2 Contribution and Plan of the Paper
A promising approach to improve the filtering capabilities of constraints over floatingpoint variables consists in using some peculiar numerical properties of floatingpoint numbers. For linear constraints, this led to a relaxation technique where floatingpoint numbers and constraints are converted into constraints over the reals by using linear programming approaches
(Belaid et al. 2012). For intervalbased consistency approaches, Marre and Michel (2010) identified a property of the representation of floatingpoint numbers and proposed to exploit it in filtering algorithms for addition and subtraction constraints. Carlier and Gotlieb (2011) proposed a reformulation of the MarreMichel property in terms of “filtering by maximum ULP” (Units in the Last Place) that is generalizable to multiplication and division constraints.Bagnara et al. (2013) addressed the question of whether the MarreMichel property can be useful for the automatic solution of realistic test input generation problems: they sketched (without proofs) a reformulation and correction of the filtering algorithm proposed in (Marre and Michel 2010), along with a uniform framework that generalizes the property identified by Marre and Michel to the case of multiplication and division. Most importantly, (Bagnara et al. 2013) presented the implementation of filtering by maximum ULP in FPSE and some of its critical design choices, and an experimental evaluation on constraint systems that have been extracted from programs engaging into intensive floatingpoint computations. These results show that the MarreMichel property and its generalization are effective, practical properties for solving constraints over the floats with an acceptable overhead. The experiments reported in (Bagnara et al. 2013) showed that improvement of filtering procedures with these techniques brings speedups of the overall constraint solving process that can be substantial (we have observed up to an order of magnitude); in the cases where such techniques do not allow significant extrapruning, the slowdowns are always very modest (up to a few percent on the overall solution time).
The present paper is, on the one hand, the theoretical counterpart of (Bagnara et al. 2013) in that all the results are thoroughly proved; on the other hand, this paper generalizes and extends (Bagnara et al. 2013) as far as the handling of subnormals and floatingpoint division are concerned. More precisely, the contributions of the paper are:

a uniform framework for filtering by maximum ULP is thoroughly defined and justified;

the framework is general enough to encompass all floatingpoint arithmetic operations and subnormals (the latter are not treated in
(Bagnara et al. 2013)); 
a second indirect projection by maximum ULP for division (not present in any previous work);

all algorithms only use floatingpoint machine arithmetic operations on the same formats used by the analyzed computations.
The plan of the paper is as follows. Next section presents the IEEE 754 standard of binary floatingpoint numbers and introduces the notions and notations used throughout the paper. Section 3 recalls the basic principles of intervalbased consistency techniques over floatingpoint variables and constraints. Section 4 presents our generalization of the MarreMichel property along with a precise definition and motivation of all the required algorithms. Section 5 discusses related work. Section 6 concludes the main body of the paper. The most technical proofs are available in the Appendix.
2 Preliminaries
In this section we recall some preliminary concepts and introduce the used notation.
2.1 Ieee 754
This section recalls the arithmetic model specified by the IEEE 754 standard for binary floatingpoint arithmetic (IEEE Computer Society 2008). Note that, although the IEEE 754 standard also specifies formats and methods for decimal floatingpoint arithmetic, in this paper we only deal with binary floatingpoint arithmetic.
IEEE 754 binary floatingpoint formats are uniquely identified by quantities: , the number of significant digits (precision); , the maximum exponent; , the minimum exponent.^{8}^{8}8Note that, although the IEEE 754 formats have , we never use this property and decided to keep the extragenerality, which might be useful to accommodate other formats. The single precision format has and , the double precision format has and (IEEE 754 also defines extended precision formats). A finite, nonzero IEEE 754 floatingpoint number has the form where is the sign bit, is the hidden bit, is the bit significand and the exponent is also denoted by or . Hence the number is positive when and negative when . is termed “hidden bit” because in the binary interchange format encodings it is not explicitly represented, its value being encoded in the exponent (IEEE Computer Society 2008).
Each format defines several classes of numbers: normal numbers, subnormal numbers, signed zeros, infinities and NaNs (Not a Number). The smallest positive normal floatingpoint number is and the largest is ; normal numbers have the hidden bit . The nonzero floatingpoint numbers whose absolute value is less than are called subnormals: they always have exponent equal to and fewer than significant digits as their hidden bit is . Every finite floatingpoint number is an integral multiple of the smallest subnormal . There are two infinities, denoted by and , and two signed zeros, denoted by and : they allow some algebraic properties to be maintained (Goldberg 1991).^{9}^{9}9Examples of such properties are and for . NaNs are used to represent the results of invalid computations such as a division of two infinities or a subtraction of infinities with the same sign: they allow the program execution to continue without being halted by an exception.
IEEE 754 defines five rounding directions: toward negative infinity (roundTowardNegative or, briefly, down), toward positive infinity (roundTowardPositive, a.k.a. up), toward zero (roundTowardZero, a.k.a. chop) and toward the nearest representable value (a.k.a. near); the latter comes in two flavors that depend on different tiebreak rules for numbers exactly halfway between two representable numbers: roundTiesToEven (a.k.a. tailtoeven) or roundTiesToAway (a.k.a. tailtoaway) in which values with even significand or values away from zero are preferred, respectively. This paper is only concerned with roundTiesToEven, which is, by far, the most widely used. The roundTiesToEven value of a real number will be denoted by .
The most important requirement of IEEE 754 arithmetic is the accuracy of floatingpoint computations: add, subtract, multiply, divide, square root, remainder, conversion and comparison operations must deliver to their destination the exact result rounded as per the rounding mode in effect and the format of the destination. It is said that these operations are “correctly rounded.”
The accuracy requirement of IEEE 754 can still surprise the average programmer: for example the single precision, roundtonearest addition of and (both numbers can be exactly represented) gives , i.e., the second operand is absorbed. The maximum error committed by representing a real number with a floatingpoint number under some rounding mode can be expressed in terms of the function (Muller 2005). Its value on is about for the single precision format.
2.2 Notation
The set of real numbers is denoted by while denotes a subset of the binary floatingpoint numbers, defined from a given IEEE 754 format, which includes the infinities and , the signed zeros and , but neither subnormal numbers nor NaNs. Subnormals are introduced in the set . In some cases, the exposition can be much simplified by allowing the of to be , i.e., by considering an idealized set of floats where the exponent is unbounded. Among the advantages is the fact that subnormals in can be represented as normal floatingpoint numbers in . Given a set of floatingpoint numbers , denotes the “nonnegative” subset of , i.e., with .
For a finite, nonzero floatingpoint number , we will write (resp., ) to signify that the least significant digit of ’s significand is (resp., ).
When the format is clear from the context, a real decimal constant (such as ) denotes the corresponding roundTiesToEven floatingpoint value (i.e., for ).
Henceforth, for , (resp., ) denotes the smallest (resp., greatest) floatingpoint number strictly greater (resp., smaller) than with respect to the considered IEEE 754 format. Of course, we have and .
Binary arithmetic operations over the floats will be denoted by , , and , corresponding to , , and over the reals, respectively. According to IEEE 754, they are defined, under roundTiesToEven, by
As IEEE 754 floatingpoint numbers are closed under negation, we denote the negation of simply by . Note that negation is a bijection. The symbol denotes any of , , or . A floatingpoint variable x is associated with an interval of possible floatingpoint values; we will write , where and denote the smallest and greatest value of the interval, and either or . Note that is not an interval, whereas is the interval denoting the set of floatingpoint numbers .
3 Background on Constraint Solving over FloatingPoint Variables
In this section, we briefly recall the basic principles of intervalbased consistency techniques over floatingpoint variables and constraints.
3.1 Intervalbased Consistency on Arithmetic Constraints
Program analysis usually starts with the generation of an intermediate code representation in a form called threeaddress code (TAC). In this form, complex arithmetic expressions and assignments are decomposed into sequences of assignment instructions of the form
A further refinement consists in the computation of the static single assignment form (SSA) whereby, labeling each assigned variable with a fresh name, assignments can be considered as if they were equality constraints. For example, the TAC form of the floatingpoint assignment is , which in SSA form becomes , which, in turn, can be regarded as the conjunction of the constraints and .
In an intervalbased consistency approach to constraint solving over the floats, constraints are used to iteratively narrow the intervals associated with each variable: this process is called filtering. A projection is a function that, given a constraint and the intervals associated with two of the variables occurring in it, computes a possibly refined interval^{10}^{10}10That is, tighter than the original interval. for the third variable (the projection is said to be over the third variable). Taking as an example, the projection over is called direct projection (it goes in the same sense of the TAC assignment it comes from), while the projections over and are called indirect projections.^{11}^{11}11Note that direct and indirect projections are idempotent, as their inputs and outputs do not intersect. Consider : direct projection propagates information on and onto , and doing it twice in a row would not enable any further inference. Likewise for indirect projections, which propagate information on onto and . Note that, for constraint propagation, both direct and indirect projections are applied in order to refine the intervals for , and . In this paper we propose new filtering algorithms for improving indirect projections.
A projection is called optimal if the interval constraints it infers are as tight as possible, that is, if both bounds of the inferred intervals are attainable (and thus cannot be pruned).
Figure 1 gives nonoptimal projections for addition and subtraction. For finite , denotes the number that is exactly halfway between and ; note that either or . Nonoptimal projections for multiplication and division can be found in (Michel 2002, Botella et al. 2006). Optimal projections are known for monotonic functions over one argument (Michel 2002), but they are generally not available for other functions. Note, however, that optimality is not required in an intervalbased consistency approach to constraint solving, as filtering is just used to remove some, not necessarily all, inconsistent values.
3.2 The MarreMichel Property
Marre and Michel (2010) published an idea to improve the filtering of the indirect projections for addition and subtraction. This is based on a property of the distribution of floatingpoint numbers among the reals: the greater a float, the greater the distance between it and its immediate successor. More precisely, for a given float with exponent , if , then for of exponent we have .
(Marre and Michel 2010, Proposition 1) Let be such that ; let also
with ;  
with ;  
Then, for each , implies that and . Moreover, .
This property, which can be generalized to subnormals, can intuitively be explained on Figure 2 as follows. Let be a strictly positive constant such that , where are unknown. The MarreMichel property says that cannot be greater than . In fact, is carefully positioned so that , and ; if we take we need if we want ; however, the smallest element of that is greater than , , is away from , i.e., too much. Going further with does not help: if we take , then
is an odd multiple of
(one step from to , all the subsequent steps being even multiples of ), whereas for each , is an even multiple of . Hence, if , . However, since , . The last inequality, which holds because , implies . A similar reasoning allows one to see that cannot be greater than independently from the value of . In order to improve the filtering of the addition/subtraction projectors, Marre and Michel (2010) presented an algorithm to maximize the values of and over an interval. As we will see, that algorithm is not correct for some inputs. In Section 4.5, the main ideas behind the work presented in (Marre and Michel 2010) will be revisited, corrected and discussed.4 Filtering by Maximum ULP
In this section we first informally present, by means of worked numerical examples, the techniques that are precisely defined later. We then reformulate the MarreMichel property so as to generalize it to subnormals and to multiplication and division operators. The filtering algorithms that result from this generalization are collectively called filtering by maximum ULP.
4.1 Motivating Example
Consider the IEEE 754 singleprecision constraint with initial intervals , and . Forward projection gives
which is optimal, as both bounds are attainable. Suppose now the interval for z is further restricted to due to, say, a constraint from an ifthenelse in the program or another indirect projection.
With the classical indirect projection we obtain , which, however, is not optimal. For example, pick : for we have and . By monotonicity of , for no we can have .
With our indirect projection, fully explained later, we obtain, from , the much tighter intervals . These are actually optimal as . This example shows that filtering by maximum ULP can be stronger than classical intervalconsistency based filtering. However, the opposite phenomenon is also possible. Consider again with . Suppose now the constraints for x and y are and . As we have seen, our indirect projection gives ; in contrast, the classical indirect projection exploits the available information on x to obtain . Indeed, classical and maximum ULP filtering for addition and subtraction are orthogonal: both should be applied in order to obtain precise results.
For an example on multiplication, consider the IEEE 754 singleprecision constraint with initial intervals and . In this case, classical projections do not allow pruning the intervals. However, take : for we have and . By monotonicity of , for no we can have .
On the same example, are the constraints inferred by our indirect projection. These are optimal because . As is the case for addition, classical indirect projection can be more precise. Consider again with , and . Classical indirect projection infers by exploiting the information on x.
4.2 RoundToNearest TailToEven
We now formally define the roundTiesToEven rounding mode. To do that, we first introduce two functions: and give the distance between and and the distance between and .
The partial functions and are defined as follows, for each finite :
Note the special cases when : since both and represent the real number , the distance between and is . We can now define the function that captures roundTiesToEven.
For , is defined as follows:
Figure 3 illustrates the roundTiesToEven rounding mode; if is even, each real number between and , including extremes, is rounded to the same floatingpoint number . As is even, is odd, and each real number between and , excluding extremes, is rounded to . Similarly for . Note that point coincides with and coincides with .
All rounding modes are monotonic; in particular, for each , implies . Moreover, the chop and near rounding modes are symmetric, i.e., the value after rounding does not depend on the sign: for each , .
4.3 Upper Bound
It is worth pointing out that, while arithmetic operations on reals are strictly monotone, that is if then for any , in floatingpoint arithmetic, operations are just monotone. If then we may still have for some (or many) since addition over the floats is absorbing. Therefore, for determining the greatest (or the smallest) satisfying and correctly filter intervals over the floats, we need to introduce an appropriate, duly justified, framework.
For each IEEE 754 floatingpoint operation , in later sections we will define the sets and . Then we will define functions (see Definition 4.5 in Section 4.5 for and, consequently, , Definition 4.6 in Section 4.6 for , and Definition 4.7.1 in Section 4.7 for ) that satisfy the following property, for each :
(1) 
In words, is the greatest float in that can be the left operand of to obtain .
Verifying that a function satisfies (1) is equivalent to proving that it satisfies the following properties, for each :
(2)  
(3) 
Property (3) means is a correct upper bound for the possible values of , whereas (2) implies that is the most precise upper bound we could choose.
Note that we may have : property (1) refers to an idealized set of floatingpoint numbers with unbounded exponents.
Since we are interested in finding the upper bound of for , we need the following
Let be such that, for each , …, , . Then, for each with and each , we have that .
Proof.
Proof. Follows directly from (1). ∎
Let be a floatingpoint constraint where and let be such that for each : then no element of x that is greater than can participate to a solution of the constraint.
Dually, in order to refine the upper bound of y subject to , it is possible to define a function satisfying the following property, for each :
(4) 
Due to (4), a result analogous to the one of Proposition 4.3 holds for , which allows refining the interval for y. Note, though, that when is commutative (i.e., it is or ), .
4.4 Lower bound
For computing the lower bound, we will introduce functions (defined in terms of the corresponding functions in Section 4.5 for and , in Section 4.6 for , and in Section 4.7 for ) satisfying the following property, for each :
(5) 
This property entails a result similar to Proposition 4.3: given constraint where and such that for each , the float is a possibly refined lower bound for x.
In a dual way, in order to refine the lower bound of y subject to , we will define functions satisfying, for each :
(6) 
Property (6) ensures that, under where , if is such that for each , then the float is a possibly refined lower bound for y.
Again, when is commutative .
4.5 Filtering by Maximum ULP on Addition/Subtraction
In this section we introduce the functions , , , , and . Note that, since is commutative, we have and . Moreover, the function can be defined in terms of the function as follows: for each , . We see that, if satisfies Property (1), then satisfies Property (5). Again, since is commutative, .
The first step for defining consists in extending Proposition 3.2 in order to explicitly handle subnormal numbers. Such extension was already sketched by Marre and Michel (2010): here we fully describe it and prove its correctness. Subnormals, which in are represented by numbers having the hidden bit and exponent , can be represented in by numbers with