Explaining Counterexamples with Giant-Step Assertion Checking

by   Benedikt Becker, et al.

Identifying the cause of a proof failure during deductive verification of programs is hard: it may be due to an incorrectness in the program, an incompleteness in the program annotations, or an incompleteness of the prover. The changes needed to resolve a proof failure depend on its category, but the prover cannot provide any help on the categorisation. When using an SMT solver to discharge a proof obligation, that solver can propose a model from a failed attempt, from which a possible counterexample can be derived. But the counterexample may be invalid, in which case it may add more confusion than help. To check the validity of a counterexample and to categorise the proof failure, we propose the comparison between the run-time assertion-checking (RAC) executions under two different semantics, using the counterexample as an oracle. The first RAC execution follows the normal program semantics, and a violation of a program annotation indicates an incorrectness in the program. The second RAC execution follows a novel "giant-step" semantics that does not execute loops nor function calls but instead retrieves return values and values of modified variables from the oracle. A violation of the program annotations only observed under giant-step execution characterises an incompleteness of the program annotations. We implemented this approach in the Why3 platform for deductive program verification and evaluated it using examples from prior literature.



There are no comments yet.


page 1

page 2

page 3

page 4


Certifying C program correctness with respect to CompCert with VeriFast

VeriFast is a powerful tool for verification of various correctness prop...

S-semantics – an example

The s-semantics makes it possible to explicitly deal with variables in p...

PNP and All Sets in NP∪coNP Have P-Optimal Proof Systems Relative to an Oracle

As one step in a working program initiated by Pudlák [Pud17] we construc...

Reproducible Execution of POSIX Programs with DiOS

In this paper, we describe DiOS, a lightweight model operating system wh...

PEQcheck: Localized and Context-aware Checking of Functional Equivalence (Technical Report)

Refactorings must not alter the program's functionality. However, not al...

Pointer Life Cycle Types for Lock-Free Data Structures with Memory Reclamation

We consider the verification of lock-free data structures that manually ...

Different Maps for Different Uses. A Program Transformation for Intermediate Verification Languages

In theorem prover or SMT solver based verification, the program to be ve...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deductive program verification aims at checking that a given program respects a given functional behaviour. The expected behaviour is expressed formally by logical assertions, principally pre-conditions and post-conditions on procedures and functions, and invariants on loops. The verification process consists in generating, from the code and the formal annotations, a set of logic formulas called verification conditions (VCs), typically via a Weakest Precondition Calculus (WP) [Dijkstra76]. If the VCs are proved valid, then the program is guaranteed to satisfy its specification. Deductive verification environments like Dafny [leino14fide], OpenJML [cok14], or Why3 [bobot14sttt], pass the VCs to automatic theorem provers usually based on Satisfiability Modulo Theories (SMT), such as Alt-Ergo [alt-ergo], CVC4 [barrett11cade] and Z3 [demoura08tacas]. Due to the nature of these solvers, the conclusion of each VC is negated to form a proof goal and the background solver is queried for satisfiability. Since these solvers are assumed to be sound when the answer is “unsat”, one can conclude that the VC is valid when that is indeed the answer.



SMT input


Candidate CE

Prover model

SMT solver

VC proved




(goal negated)




generation [dailler18jlamp]


Figure 1: Pipeline of the counterexample generation and categorisation

In this paper we address the case when the solver does not answer “unsat”, and provide a method to explain why the proof could not be completed. The solver may give instead several other answers: at best it answers “sat”, possibly with a model, which is a collection of values for the variables in the goal. As displayed in Fig. 1, we rely on an approach by Dailler et al [dailler18jlamp] to turn such a model into a candidate counterexample, which is essentially a collection of triples (variable, source location, value) representing the different values taken by a program variable during an execution. Such a counterexample may indicate two different problems: (i) a non-conformance, where the code does not satisfy one of its annotations; (ii) a subcontract weakness, where the annotations are not appropriate to achieve a proof (typically a weakness of a loop invariant or post-condition). Unfortunately there is no direct way to distinguish these cases. Other answers of the prover are even less informative: the answer “unknown” replaces “sat” when the solver is incomplete, for example in presence of quantified hypotheses or formulas involving non-linear arithmetic, or the solver reaches a time limit or a memory limit. In these cases, the solver may as well propose a model, but without any guarantee about its validity. Summing up, for any other answer than “unsat”, there is a need to validate the resulting candidate counterexample and categorise it as non-conformance or subcontract weakness.

We propose the categorisation of counterexamples using a novel notion of assertion checking, called giant-step assertion checking. Let us illustrate the idea on the following toy program, that operates on a global variable x. let setx (n:int) : unit ensures x ¿ n x ¡- n + 1

let main () : unit x ¡- 0; setx 2; assert x = 3 The VC for the function main is where denotes the new value of x after the call to setx, and the premise of the second implication comes from the post-condition of setx. The query sent to the solver is , from which the solver would typically answer “sat” with the model , corresponding to a counterexample where x is 0 after the assignment and 4 after the call to setx. If we proceed to a regular assertion-checking execution of main, no issue is reported: both the post-condition of setx and the assertion in main are valid. Our giant-step assertion checking executes calls to sub-programs in a single step, selecting the values of modified variables from the candidate counterexample: it handles the call setx 2 by extracting the new value for x from the counterexample, here , and checking the post-condition, which is correct here. The execution then fails because the assertion is wrong. Since the standard assertion checking is OK but the giant-step one fails, we report a subcontract weakness. This is the expected categorisation, suggesting the user to improve the contract of setx, in this case by stating a more precise post-condition. Giant-step assertion checking also executes loops by a single giant step to identify subcontract weaknesses in loop invariants.

In Section 2, we explain in more details the concept of giant-step execution, and how we use it to categorise counterexamples. In Section 3, we present our implementation, and experimental results, that were conducted within the Why3 environment [bobot14sttt]. We discuss related work and future work in Section LABEL:sec:concl. The technical details that we skip here due to lack of space are available in a research report [becker21rr].

2 Giant-Step Assertion Checking

We consider here a basic programming language with mutable variables, function calls and while loops, where functions are annotated with contracts (a pre-condition and a post-condition) and loops are annotated with a loop invariant. The language also includes a standard assert statement, as shown in the example above.

Giant-step assertion checking.

It corresponds to standard runtime assertion checking (RAC), in the sense that annotations (i.e., assertions, pre- and post-conditions, and loop invariants) are checked when they are encountered during execution. It differs from the standard RAC in the treatment of function calls and loops: instead of executing their body statements, they only involve a single, “giant” step. Giant-step execution is parameterised by an oracle, from which one retrieves the values of variables that are written by a function call or a loop. These written variables could be automatically inferred, but for simplicity we require them to be indicated by a writes clause. Therefore, a function declaration has the form:

let f (: ) (: ) : requires ensures writes =

and a loop has the form:

while invariant writes do done

2 if expression  is:
3  a function call :
4     query the context for declaration of :
5       parameters , pre-condition , writes  post-condition 
6     evaluate each  to value , and set each  to  in the context
7     evaluate , if false report assertion failure
8     query the oracle for values  for  and  for result, assign them in the context
9     evaluate , if false report execution stuck
10     return value 
11   a loop  :
12     evaluate , if false report assertion failure
13     query the oracle for values  for , assign in the context
14     evaluate , if false report execution stuck
15     evaluate , if false return
16     evaluate loop body 
17     evaluate , if false report assertion failure
18     report execution stuck
19  otherwise: as standard assertion-checking execution
Figure 2: Pseudo-code for giant-step assertion-checking execution

where are fresh variables.

Figure 3: Rules for WP computation, loops and function calls

Figure 2 presents a pseudo-code for giant-step evaluation of a program expression (see [becker21rr] for a formal presentation). This execution form is inspired by the weakest precondition calculus, specifically the rules for calls and loops that we remind in Figure 3. The execution may fail or get stuck in a number of situations:

  • Line 7: if the pre-condition of a call to is false, the execution must fail (as in standard RAC).

  • Line 9: if the post-condition of a call to is false, the values from the oracle are incompatible with the postcondition of . A stuck execution is reported, and the counterexample will be declared invalid.

  • Line 12: as in standard assertion checking, a failure is reported for the invariant initialisation.

  • Line 14: if the invariant is false, the oracle does not provide valid values to continue the execution. A stuck execution is reported, and the counterexample will be invalid.

  • Line 15: if the loop condition is false after setting the values of written variables in the context, the oracle covers an execution that goes beyond the loop, so we just terminate its execution.

  • Line 17: if invariant is not preserved, we report an assertion failure.

  • Line 18: if invariant is preserved, it means the oracle is not appropriate for identifying any failure, we report a stuck execution.

Categorisation of counterexample.

As shown on Fig. 1, assume that a VC for a program is not validated by some SMT solver, which returns a model, which is turned into a candidate counterexample. We categorise this counterexample as follows (stopping when the first statement is met):

  1. Run the standard RAC on the enclosing function of the VC, with arguments and values of global variables taken from the counterexample. If the result is

    1. a failure at the same source location as the VC, we report a non-conformance: the code does not satisfy the corresponding annotation.

    2. a failure at a different source location as the VC, the counterexample is bad (is not suitable to explain the failed proof), although it deserves to be reported as a non-conformance elsewhere in the code: it exposes an execution where the program does not satisfy some annotation.

  2. Run the giant-step RAC on the enclosing function of the VC, with inputs and written variables given by the counterexample seen as an oracle. If the result is

    1. a failure, we report a sub-contract weakness: some post-condition or loop invariant in the context is too weak, as in the introductory toy example.

    2. a stuck execution, the counterexample is invalid and discarded: one of the premises of the VC is not satisfied, and we can suspect a prover incompleteness.

    3. a normal execution, the counterexample is discarded: it satisfies the conclusion of the VC.

3 Experiments

1  let isqrt (n: int)
2    requires { 0 <= n <= 10000 }
3    ensures { result * result <= n < (result + 1) * (result + 1) }
4  = let ref r = n in
5    let ref y = n * n in
6    let ref z = -2 * n + 1 in
7    while y > n do
8      invariant  { 0 <= r <= n }
9      invariant  { y = r * r }
10      invariant  { n < (r+1) * (r+1) }
11      invariant  { z = -2 * r + 1 }
12      y <- y + z; z <- z + 2; r <- r - 1
13    done;
14    r\end{lstlisting}
15  \caption{Computation of the integer square root, adapted from C code of Petiot~\cite{petiot18fac}}
16  \label{fig:example-isqrt}
19We implemented our approach in the Why3 platform for deductive program
21and tested it by reproducing the experiments from prior literature
22about the categorisation of proof failures (Petiot et al.~\cite{petiot18fac}).
23These experiments covered three example programs written in C
24with ACSL program annotations, and the verification was carried out in the
25Frama-C framework. The experiments comprised modifications to the programs that
26introduced proof failures, which were then categorised using their approach. We
27translated the C programs to WhyML and were able to reproduce the
28categorisations with our approach for all 16 modifications that were applicable
29to the WhyML program.
31The experiments by Dailler et
32al.~\cite{dailler18jlamp} were written in Ada/SPARK, which uses Why3 for
33deductive verification. Their Riposte testsuite contains 24 programs with
34247 checks. The integration of our approach with SPARK is work in
35progress. Currently, we are able to identify in this testsuite 14 wrong
36counterexamples, 57 non-conformities, and 22 other cases classified as
37either non-conformity or subcontract-weaknesses (an imprecise answer which
38results when the standard RAC is non-conclusive~\cite{becker21rr}). The counterexamples of 154
39checks could not yet be categorised.
41Let us illustrate our experiments on one of Petiots
42examples~\cite{petiot18fac} in C, a function that calculates the integer
43square root. Our translation of the C program to WhyML is shown in
44Figure~\ref{fig:example-isqrt}. The result for parameter  is an integer 
45such that . The variable  is initialised by  as an
46over-approximation of the result, the variable  by , and  by . During execution of the while loop, the value of  is decremented and the
47value of  is kept at , while maintaining . When the loop
48condition becomes false,  contains the largest integer such that .
49The VC for function \w{isqrt} is split by Why3 into nine
50verification goals: two for the initialisation and preservation of each of
51the four loop invariant, and one for the post-condition. The validity of
52the program is proven in Why3 (we used the Z3 solver to prove the goals and
53generate counterexamples).
55We reproduce here two variations of that program, that introduce proof failures.
56% S4
58changing the first assignment in line~12 to \w{y <- y - z} leads to a proof failure
59for the preservation of invariant~. The counterexample gives the value~4
60for the argument \w{n} of function \w{isqrt}. The standard RAC execution fails
61after the first loop iteration when checking the preservation of invariant
62, and the proof failure is categorised as non-conformity.
63% S7
64Second, removing loop invariant  leads to a proof failure for the
65postcondition.  The counterexample gives the value~1 for the argument \w{n}. The
66standard RAC execution terminates normally with the correct result of~1. The
67giant-step RAC execution initialises the variables \w{r}, \w{y}, and \w{z} with
68values~1, 1, and -1, respectively, and checks that the loop invariants hold
69initially. To execute the loop in a giant step, the variables \w{r}, \w{y}, and
70\w{z} are set to the values 0, 0, and 1 from the counterexample, which also
71satisfy the loop invariants. The loop condition becomes false, and the
72giant-step execution leaves the loop. The execution then fails because the
73current value 0 of variable \w{r} contradicts the postcondition.  Since standard
74RAC terminated normally but giant-step RAC failed, the proof failure is
75categorised as a subcontract-weakness.
77See the research report~\cite{becker21rr} for more examples and experimental
80\section{Related Work and Perspectives}
83Christakis et al.~\cite{christakis16tacas} use Dynamic Symbolic Execution (DSE)
84(i.e., concolic testing) to generate test cases for the part of the code that
85underlies a failing VC. The code is extended by run-time checks and fed to the
86DSE engine Delfy, that explores all possible paths up to a given
87bound. The output can be one of the following: the engine verifies the
88method, indicating that the VC is valid and thus no further action is required;
89it generates a test case that leads to an assertion violation, indicating that
90the VC is invalid; or no test case can be generated in the given bound, where
91nothing can be concluded.
92Our approach is more directly based on the work of
93Petiot et al.~\cite{petiot18fac}. Their work also relies on a DSE engine
94(the PathCrawler tool) to classify proof failures and to generate
95counterexamples. Each proof failure is classified as
96\emph{non-compliance}, \emph{subcontract weakness}, or \emph{prover
97  incapacity}. By using DSE, every generated counterexample leads to
98an assertion failure during concrete execution of the program.
100What distinguishes our approach from the two above is that we derive the test
101case leading to a proof failure from the model generated by the SMT solver,
102rather than relying on a separate tool such as the DSE engine. Instead of
103applying different program transformations, we compare the results of two
104different types executions (standard RAC and giant-step RAC) of the original program.
106There are quite a lot of potential improvements and extensions to our work,
107which are discussed in our research report~\cite{becker21rr}. First, our
108approach should be extended to support more features such as the maintenance of
109type invariants. Our implementation deserves to be made more robust, to deal
110with missing values in the counterexample, and calls to functions that lack an
111implementation. A representative example of remaining technical issues is when
112the original code makes use of polymorphic data types: in that case, a complex
113encoding is applied to the VCs before sending them to provers, and unwinding
114this encoding to a obtain an oracle for running the RAC does not currently
115work. As seen in the experiments so far, our method is often limited in the RAC,
116for example when checking assertions that are quantified formulas, which is a
117known issue for RAC~\cite{kosmatov16isola}.
118As mentioned in the experiments, our implementation aims to support various
119Why3 front-ends, such the one for Ada/SPARK~\cite{mccormick15}. The current
120experimental results are not fully satisfying and there are numerous required
121practical improvement. Yet, the ability to filter out obviously wrong
122counterexamples, and categorise them, is hopefully a major expected improvement
123for the SPARK user experience. A remaining challenge is that a counterexample at
124Why3s level might not be suitable at the front-end level. In particular,
125performing the small-step assertion checking in the
126front-end language may result in a more accurate diagnose.
128\paragraph{Acknowledgements.} We thank our partners from AdaCore (Yannick Moy)
129and Mitsubishi Electric R\&D Centre Europe (David Mentr\’e, Denis Cousineau,
130Florian Faissole and Beno\^it Boyer) for their feedback on first
131experimentations of our counterexample categorisation approach.
139% Local Variables:
140% mode: latex
141% TeX-master: t
142% TeX-PDF-mode: t
143% mode: flyspell
144% ispell-local-dictionary: british
145% fill-column: 80
146% End:
148%  LocalWords:  invariants satisfiability