 # Correct Approximation of IEEE 754 Floating-Point Arithmetic for Program Verification

Verification of programs using floating-point arithmetic is challenging on several accounts. One of the difficulties of reasoning about such programs is due to the peculiarities of floating-point arithmetic: rounding errors, infinities, non-numeric objects (NaNs), signed zeroes, denormal numbers, different rounding modes.... One possibility to reason about floating-point arithmetic is to model a program computation path by means of a set of ternary constraints of the form z = x op y and use constraint propagation techniques to infer new information on the variables' possible values. In this setting, we define and prove the correctness of algorithms to precisely bound the value of one of the variables x, y or z, starting from the bounds known for the other two. We do this for each of the operations and for each rounding mode defined by the IEEE 754 binary floating-point standard, even in the case the rounding mode in effect is only partially known. This is the first time that such so-called filtering algorithms are defined and their correctness is formally proved. This is an important slab for paving the way to formal verification of programs that use floating-point arithmetics.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Programs using floating-point numbers are notoriously difficult to reason about [Mon08]. Many factors complicate the task:

1. compilers may transform the code in a way that does not preserve the semantics of floating-point computations;

2. floating-point formats are an implementation-defined aspect of most programming languages;

3. there are different, incompatible implementations of the operations for the same floating-point format;

4. mathematical libraries often come with little or no guarantee about what is actually computed;

5. programmers have a hard time predicting and avoiding phenomena caused by the limited range and precision of floating-point numbers (overflow, absorption, cancellation, underflow, …); moreover, devices that modern floating-point formats possess in order to support better handling of such phenomena (infinities, signed zeroes, denormal numbers, non-numeric objects a.k.a. NaNs) come with their share of issues;

6. rounding is a source of confusion in itself; moreover, there are several possible rounding modes and programs can change the rounding mode to any time.

As a result of these difficulties, the verification of floating-point programs in industry relies, almost exclusively, on informal methods, mainly testing, or on the evaluation of the numerical accuracy of computations, which only allows to determine conservative (but often too loose) bounds on the propagated error [DGP09].

The satisfactory formal treatment of programs engaging into floating-point computations requires an equally satisfactory solution to the difficulties summarized in the above enumeration. Progress has been made, but more remains to be done. Let us review each point:

1. Some compilers provide options to refrain from rearranging floating-point computations. When these are not available or cannot be used, the only possibilities is to verify the generated machine code or some intermediate code whose semantics is guaranteed to be preserver by the compiler back-end.

2. Even though the used floating-point formats are implementation-defined aspects of, say, C and C++111This is not relevant if one analyzes machine or sufficiently low-level intermediate code. the wide adoption of the IEEE 754 standard for binary floating-point arithmetic [IEE08] has improved things considerably.

3. The IEEE 754 standard does provide some strong guarantees, e.g., that the results of individual additions, subtractions, multiplications, divisions and square roots are correctly rounded, that is, it is as if the results were computed in the reals and then rounded as per the rounding mode in effect. But it does not provide guarantees on the results of other operations and on other aspects, such as, e.g., when the underflow exception is signaled [CKVDV02].222The indeterminacy described in [CKVDV02] is present also in the 2008 edition of IEEE 754 [IEE08].

4. A pragmatic, yet effective approach to support formal reasoning on commonly used implementation of mathematical functions has been recently proposed in [BCGB16, BCGB17]. The proposed techniques exploit the fact that the floating-point implementation of mathematical functions preserve, not completely but to a great extent, the piecewise monotonicity nature of the approximated functions over the reals.

5. The contribution of the present paper is in this area: by defining and formally proving the correctness of constraint propagation algorithms for IEEE 754 arithmetic constraints, we enable the use of formal methods for a broad range of programs. Such methods, i.e., abstract interpretation and symbolic model checking, allow proving that a number of generally unwanted phenomena (e.g., generation of NaNs and infinities, absorption, cancellation, instability…) do not happen or, in case they do happen, allow the generation of a test vector to reproduce the issue.

6. Handling of all IEEE 754 rounding modes, and being resilient to uncertainty about the rounding mode in effect, is another original contribution of this paper.

While the round-to-nearest rounding mode is, by far, the most frequently used one, it must be taken into account that:

• the possibility of programmatically changing the rounding mode is granted by IEEE 754 and is offered by most of its implementations (e.g., in the C programming language, via the fesetround() standard function);

• such possibility is exploited by interval libraries and by numerical calculus algorithms (see, e.g., [Rum13, RO07]);

• setting the rounding mode to something different from round-to-nearest can be done by third parties in a way that was not anticipated by programmers: this may cause unwanted non-determinism in videogames [Fie10] and there is nothing preventing the abuse of this feature for more malicious ends, denial-of-service being only the least dangerous in the range of possibilities. Leaving malware aside, there are graphic and printer drivers and sound libraries that are known to change the rounding mode and may fail to set it back [Wat08].

As a possible way of tackling the difficulties described until now, and enabling sound formal verification of floating-point computations, this paper introduces new algorithms for the propagation of arithmetic constraints over floating-point numbers. These algorithms are called filtering algorithms as their purpose is to prune the domains of possible variable values by filtering out those values that cannot be part of the solution of a system of constraints. Algorithms of this kind must be employed in constraint solvers that are required in several different areas, such as automated test-case generation, exception detection or the detection of subnormal computations. In this paper we propose fully detailed, provably correct filtering algorithms for floating-point constraints, which handle all values, including symbolic values (NaNs, infinities and signed zeros), and rounding modes defined by IEEE 754. Note that filtering techniques used in solvers over the reals do not preserve all solutions of constraints over floating-point numbers [MRL01, Mic02], and therefore they cannot be used to prune floating-point variables domains reliably. This leads to the need of filtering algorithms such as those we hereby introduce.

Before defining our filtering algorithms in a detailed and formal way, we provide a more comprehensive context on the propagation of floating-point constraints from a practical point of view (Sections 1.1 and 1.2), and justify their use in formal program analysis and verification (Section 1.3).

### 1.1 From Programs to Floating-Point Constraints

Independently from the application, program analysis starts with parsing, the generation of an abstract syntax tree and the generation of various kinds of intermediate program representations. An important intermediate representation is called three-address code (TAC). In this representation, complex arithmetic expressions and assignments are decomposed into sequences of assignment instructions of the form

 result:=operand1operatoroperand2.

A further refinement consists in the computation of the static single assignment form (SSA) whereby, labeling each assigned variable with a fresh name, assignments can be considered as if they were equality constraints. For example, the TAC form of the floating-point assignment is , which in SSA form becomes . These, in turn, can be regarded as the conjunction of the constraints and , where by and we denote the multiplication and addition operations on floating-point numbers, respectively. The Boolean comparison expressions that appear in the guards of if statements can be translated into constraints similarly. This way, a C/C++ program translated into an SSA-based intermediate representation can be represented as a set of constraints on its variables. Constraints can be added or removed from this set in order to obtain a constraint system that describes a particular behavior of the program (e.g., the execution of a certain instruction, the occurrence of an overflow in a computation, etc.). Once such a constraint system has been solved, the variable domains only contain values that cause the desired behavior. If one of the domains is empty, then that behavior can be ruled out.

### 1.2 Constraint Propagation

Once constraints have been generated, they are amenable to constraint propagation: under this name goes any technique that consists in considering a subset of the constraints at a time, explicitly removing elements from the set of values that are candidate to be assigned to the constrained variables. The values that can be removed are those that cannot possibly participate in a solution for the selected set of constraints. For instance, if a set of floating-point constraints contains the constraint , then any value outside the set can be removed from further consideration. The degree up to which this removal can actually take place depends on the data-structure used to record the possible values for , intervals and multi-intervals being typical choices for numerical constraints. For the example above, if intervals are used, the removal can only be partial (negative floating-point numbers are removed from the domain of ). With multi-intervals more precision is possible, but any approach based on multi-intervals must take measures to avoid combinatorial explosion.

In this paper, we only focus on interval-based constraint propagation: the algorithms we present for intervals can be rather easily generalized to the case of multi-intervals. We make the further assumption that the floating-point formats available to the analyzed program are also available to the analyzer: this is indeed quite common due to the wide adoption of the IEEE 754 formats.

Interval-based floating-point constraint propagation consists in iteratively narrowing the intervals associated to each variable: this process is called filtering. A projection is a function that, given a constraint and the intervals associated to two of the variables occurring in it, computes a possibly refined interval for the third variable (the projection is said to be over the third variable). Taking as an example, the projection over is called direct projection (it goes in the same sense of the TAC assignment it comes from), while the projections over and are called indirect projections.

### 1.3 Applications of Constraint Propagation to Program Analysis

When integrated in a complete program verification framework, the constraint propagation techniques presented in this paper enable activities such as abstract interpretation, automatic test-input generation and symbolic model checking. In particular, symbolic model checking consists in exhaustively proving that a certain property, called specification, is satisfied by the system in exam, which in this case is a computer program. A model checker can either prove that the given specification is satisfied, or provide a useful counterexample whenever it is not.

For programs involving floating-point computations, some of the most significant properties that can be checked consist in ruling out certain undesired exceptional behaviors such as overflows, underflows and the generation of NaNs, and numerical pitfalls such as absorption and cancellation. In more detail, we call a numeric-to-NaN transition a floating-point arithmetic computation that returns a NaN despite its operands being non-NaN. We call a finite-to-infinite transition the event of a floating-point operation returning an infinity when executed on finite operands, which occurs if the operation overflows. An underflow occurs when the output of a computation is too small to be represented in the machine floating-point format without a significant loss in accuracy. Specifically, we divide underflows into three categories, depending on their severity:

an operation performed on normalized numbers results into a subnormal number. In other words, a subnormal has been generated out of normalized numbers: enabling gradual underflow is indeed the very reason for the existence of subnormals in IEEE 754. However, as subnormals come with their share of problems, generating them is better avoided.

Hard underflow:

an operation performed on normalized numbers results into a zero, whereas the result computed on the reals is nonzero. This is called hard because the relative error is 100%, gradual overflow does not help (the output is zero, not a subnormal), and, as neither input is a subnormal, this operation may constitute a problem per se.

Soft underflow:

an operation with at least one subnormal operand results into a zero, whereas the result computed on the reals is nonzero. The relative error is still 100% but, as one of the operands is a subnormal, this operation may not be the root cause of the problem.

Absorption occurs when the result of an arithmetic operation is equal to one of the operands, even if the other one is not the neutral element of that operation. For example, absorption occurs when summing a number with another one that has a relatively very small exponent. If the precision of the floating-point format in use is not enough to represent them, the additional digits that would appear in the mantissa of the result are rounded out.

###### Definition 1

(Absorption.) Let with , let be any IEEE 754 floating-point operator, and let . Then gives rise to absorption if

• and either and , or and ;

• and either and , or and ;

• and either and , or and ;

• , and .

In this section, we show how symbolic model checking can be used to either rule out or pinpoint the presence of these run-time anomalies in a software program by means of a simple but meaningful practical example. Floating-point constraint propagation has been fully implemented with the techniques presented in this paper in the commercial tool ECLAIR,333https://bugseng.com/eclair, last accessed on January 23rd, 2019. developed and commercialized by BUGSENG. ECLAIR is a generic platform for the formal verification of C/C++ and Java source code, as well as Java bytecode. The filtering algorithms described in the present paper are used in the C/C++ modules of ECLAIR that are responsible for semantic analysis based on abstract interpretation [CC77], automatic generation of test-cases, and symbolic model checking. The latter two are based on constraint satisfaction problems [GBR98, GBR00], whose solution is based on multi-interval refinement and is driven by labeling and backtracking search. Constraints arising from the use of mathematical functions provided by C/C++ standard libraries are also supported. Unfortunately, most implementations of such libraries are not correctly rounded, which makes the realization of filtering algorithms for them rather challenging. In ECLAIR, propagation for such constraints is performed by exploiting the piecewise monotonicity properties of those functions, which are partially retained by all implementations we know of [BCGB16, BCGB17].