Using a specification of a library’s methods in the verification of its clients is a hallmark of modular reasoning. Because these specifications encapsulate the interface between the client and the library, each may be independently verified without access to the other’s implementation. This modularity is particularly beneficial when the library function is complex or its source code is unavailable. All too often, though, such specifications are either missing or incomplete, preventing the verification of clients without making (often unwarranted) assumptions about the behavior of the library. This problem is further exacerbated when libraries expose rich datatype functionality, which often leads to specifications that rely on inductive invariants (Dillig et al., 2013; Itzhaky et al., 2014) and complex structural relations. One solution to this problem is to automatically infer missing specifications. Unfortunately, while significant progress has been made in specification inference over the past several years (Albarghouthi et al., 2016; Zhu et al., 2016; Padhi et al., 2016; Miltner et al., 2020), existing techniques have not considered inference in the frequently occurring case of client programs that make use of data structure libraries with unavailable implementations. To highlight the challenge, consider the following simple program, which concatenates two stacks together using four operations provided by a Stack library: , , and .
To ensure the correctness of this client function, its author may wish to verify that (a) the top element of the output stack is always the top element of one of the input stacks; and, (b) every element of the output stack is also an element of one of the input stacks and vice-versa. In order to express this behavior in a form amenable for automatic verification, we need some mechanism to encode the semantics of stacks in a decidable logic. To do so, we rely on a pair of method predicates, ” is the head of stack ”, , and ” is a member of stack ”, , to write our postcondition:
We assume these method predicates are associated with (possibly blackbox) implementations that we can use to check the specifications in which they appear. For example, may be defined in terms of the stack operations and , while the implementation of might additionally use . The variable in is used to represent the output stack of . The above assertion claims that the head of the output stack must be the head of either or and that any element found in the output must be a member of either or . By treating method predicates ( and ) and library functions (, , , and ) as uninterpreted function symbols, it is straightforward to generate verification conditions (VCs), e.g. using weakest precondition inference, which can be handed off to an off-the-shelf SMT solver like Z3 to check. However, the counter-examples returned by the theorem prover may be spurious, generated by incorrect assumptions about library method behavior in the absence of any constraints on these behaviors outside the client VCs. For example, the prover might assume the formula is valid, i.e. that the result of is not the head of . This claim is obviously inconsistent with the client’s expectation of ’s semantics, but it is not disallowed by any constraints in the SMT query. Using this assumption, Z3 may return the following counterexample: . This counterexample, which is obviously incongruous with the intended semantics of and , occurs because the expected relationship between hd and mem is lost when the predicates are embedded as uninterpreted functions in the SMT query. To overcome this problem, we need stronger specifications for the library methods, defined in terms of these method predicates, that are sufficient to imply the desired client postcondition. In particular, these specifications should rule out spurious unsafe executions such as the counter-example given above. In the (quite likely) scenario that such library specifications are not already available, a reasonable fallback is to infer some specifications for these functions that are strong enough to ensure the safety of the client. Traditional approaches to specification inference usually adopt a closed-world assumption in which specifications of library methods are discovered in isolation, independent of the client context in which they are being used. Such assumptions are not applicable here since (a) we do not have access to the library’s method implementations and (b) the nature of the specifications we need to infer are impacted by the verification demands of the client. In this setting, some form of data-driven inference (Miltner et al., 2020; Padhi et al., 2016; Zhu et al., 2016) can be beneficial. Such an approach may be tailored to the client context in which the library methods are used, postulating candidate specifications for library methods based on observations of their input-output behavior. Unfortunately, completely blackbox data-driven approaches are susceptible to overfitting on the set of observations used to train them, and can thus discount reasonable and safe behaviors of the underlying library functions. To address the problem of overfitting, we might instead consider attacking this problem from a purely logical standpoint, treating specification inference as an instance of a multi-abductive inference problem (Albarghouthi et al., 2016) that tries to find formulae , , and such that and yet which are sufficient to prove the desired verification condition. While such problems have been previously solved over linear integer arithmetic constraints (Albarghouthi et al., 2016) using quantifier elimination, these prior techniques cannot be directly applied to formulae with uninterpreted function symbols like the method predicates (e.g., and ) used to encode library method specifications in our setting. In this work, we combine aspects of these data-driven and abductive approaches in a way that addresses the limitations each approach has when considered independently. Our technique uses SMT-provided counterexamples to generate infeasible interpretations of these predicates (similar to other abductive inference methods) while using concrete test data to generate feasible interpretations (similar to data-driven inference techniques). This combination yields a novel CEGIS-style inference methodology that allows us to postulate specifications built from method predicates sufficient to prove the postcondition in a purely blackbox setting. The specifications learned by this procedure are guaranteed to be both consistent with the observed input-output behavior of the blackbox library implementations and safe with respect to the postcondition of the client program. As there may be many such specifications, we also endeavor to find a maximal one that is at least as weak as every other safe and consistent specification, in order to avoid overfitting to observed library behaviors. Our algorithm applies another data-driven weakening procedure to find these maximal specifications. To demonstrate the effectiveness of our approach, we have implemented a fully automated abductive specification inference pipeline in OCaml called Elrond (see Figure 1). This pipeline takes as input (a) an OCaml client program that may call blackbox library code defined over algebraic datatypes like lists, trees, heaps, etc.; (b) assertions about the behavior of this client program; and, (c) a set of method predicates (e.g., hd or mem), along with their (possibly blackbox) implementations, that are used to synthesize library method specifications.
It combines tests and counterexample-guided refinement techniques to either generate a set of maximal specifications for the library methods used by the client program, or a counterexample that demonstrates a violation of the postcondition. The notion of “weakest” used in our definition of maximal is bounded by the “shape” of specifications (e.g., the number of quantified variables, the set of method predicates, etc.) and a time bound. Our results over a range of sophisticated data-structure manipulating programs, including those drawn from e.g., Okasaki (1999), show that Elrond is able to discover maximally-weak specifications (as determined by an oracle executing without any time constraints) for the vast majority of applications in our benchmark suite within one hour. Our key contribution is thus a new abductive inference framework that is a fusion of automated data-driven methods and counterexample-guided refinement techniques, tailored to specification inference for libraries that make use of rich algebraic datatypes. Specifically, we:
Frame client-side verification as a multi-abduction inference problem that searches for library method specifications that are both consistent with the method’s implementation and sufficient to verify client assertions.
Devise a novel specification weakening procedure that yields the weakest specification among the collection of all safe and consistent ones with respect to a given set of quantified variables and method predicates.
Evaluate our approach in a tool, Elrond, which we use to analyze a comprehensive set of realistic and challenging functional (OCaml) data structure programs. An artifact containing this tool and our benchmark suite is publicly available (Zhou et al., 2021).
The remainder of the paper is structured as follows. The next section presents an overview of our approach using a detailed example to motivate its key ideas . A formal characterization of the problem is given in LABEL:*sec:formalization. LABEL:*sec:learning defines how a data-driven learning strategy can be used to perform inference. A detailed presentation of the algorithm used to manifest these ideas in a practical implementation is given in LABEL:*sec:algorithm. Details of our implementation and evaluation results are explained in LABEL:*sec:evaluation. Related work and conclusions are given in LABEL:*sec:related and LABEL:*sec:conc.
2. Overview and Motivation
We divide the inference of maximal library specifications into two stages, which are represented as the “Specification Inference” and “Weakening” components in Figure 1. Both stages leverage data-driven learning to overcome the lack of a purely logical abduction procedure for our specification language. The initial inference stage learns a set of safe and consistent specifications from a combination of concrete tests and verifier-provided counterexamples. The next stage then weakens these specifications by iteratively augmenting this data set with additional safe behaviors until a set of maximal specifications are found.
Figure 2 provides a more detailed depiction of the initial specification inference stage in Elrond. Starting from an initial set of maximally permissive specifications, this stage iteratively refines the set of candidate specifications until either a set of safe and consistent solutions or a counterexample witnessing an unsafe execution is found. Each iteration first uses a property-based sampler, e.g. QuickCheck (Claessen and Hughes, 2011), to look for executions of the blackbox library implementations that are inconsistent with the current set of inferred specifications. The reliance on a generator to provide high-quality tests provides yet more motivation for the subsequent weakening phase, in order to ensure that the final abduced specifications are not overfitted to or otherwise biased by the tests provided by the generator. At the same time, we also observe that the sorts of shape properties (e.g., membership and ordering) used in our specifications and assertions are relatively under-constrained and are thus amenable to property-based random sampling. We do not ask, for example, for QuickCheck to generate inputs satisfying non-structural properties like “a list whose 116 element is equal to 5.” Any tests that are disallowed by the current solution are passed to a learner which uses them to generalize the current specification. If no inconsistencies are detected, Elrond attempts to verify the client against the candidate specifications using a theorem prover. If the inferred specifications are sufficient to prove client safety, the loop exits, returning the discovered solution. If not, the verifier has identified a model that represents a potential safety violation. The model is then analyzed in an attempt to extract test inputs that trigger a safety violation. If we are unable to find such a counterexample, the model is most likely incongruous with the semantics of the method predicates and is thus spurious. In this case, the model is passed to the learner so that it can be used to strengthen candidate specifications, preventing this and similar spurious counterexamples from manifesting in subsequent iterations.
While the previous loop is guaranteed to return safe and consistent solutions, it may find specifications that are nonetheless too strong with respect to the underlying library implementation. This occurs when the property-based sampler fails to find a test that identifies an inconsistent specification, which may happen when the input space of a library function is very large. To combat overfitting specifications to test data, candidate solutions are iteratively weakened using the data-driven counterexample-guided refinement loop depicted in Figure 3. The data in this phase is supplied by the underlying theorem prover rather than a concrete test generator. Each iteration of the refinement loop first attempts to find a safe execution of the client program that is disallowed by the current set of specifications. If no such execution can be found, the specifications are maximal and the loop terminates. Otherwise, the identified execution is passed to a learner, which uses it to generalize the candidate solution so that the execution is permitted before continuing the refinement loop. The learner always generalizes candidate specifications, maintaining the invariant that the current solution is consistent with all previously observed library behaviors.
2.1. Elrond in action
To illustrate our approach in more detail, we apply it to the stack concatenation example from the introduction. Given the postcondition and the implementation of from Section 1, Elrond generates a formula that can be simplified to the following implication:
The four predicates in the premise of this formula correspond to the four library functions (, , and ) invoked in a recursive call to concat. The specification of a blackbox library function in our assertion logic is represented as a placeholder predicate: an uninterpreted predicate that relates the parameters of to its return value. For a library function , we adopt a naming convention of and for its placeholder predicates and return values, respectively. The predicate in the above formula, for example, says the variable holds the return value of the call to Stack.top s1 in concat. The result of the recursive call to concat is similarly denoted as . The conclusion of the formula encodes the expected verification condition for a recursive call to concat: namely, that if the result of the call to Stack.is\_empty s1 is false and the recursive call to concat (Stack.tail s1) s2 satisfies , then the result of concat must also satisfy . The remainder of this section refers to the premise and conclusion of this implication as and , respectively.
From a logical standpoint, the method predicates used in are simply uninterpreted function symbols which have no intrinsic semantics. This representation allows our specifications to use predicates whose semantics may be difficult to encode directly in the logic. Embedding recursively defined predicates like mem, for example, requires particular care (Zhu et al., 2016). In order to ensure that the specifications inferred by Elrond are tethered to reality, users must also supply Elrond with implementations (possibly blackbox) of these predicates. One possible implementation for and is:
where Stack.is\_empty, Stack.top and Stack.tail refer to blackbox implementations of stack library methods. While it is possible to naïvely include a method predicate for each library method, such an approach may not be useful for verification. Some functions may be irrelevant to client assertions, unnecessarily increasing the set of possible specifications that must be considered. Conversely, the library may not include functions for desirable predicates; e.g., the library does not provide a Stack.mem function, although it is quite relevant for verifying our running example.
Our ultimate goal is to find a mapping from each placeholder predicate in to an interpretation that entails . We refer to such a mapping as a verification interface; Figure 4 presents a potential verification interface for our running example. Not every mapping that ensures the safety of is reasonable, however. At one extreme, interpreting every predicate as ensures the safety of the client, but does not capture the behavior of any sensible stack implementation. Our goal, then, is to find interpretations that are general enough to cover a range of possible implementations. From a purely logical perspective, this an instance of a multi-abductive inference problem that tries to find the weakest interpretations of , , and in terms of predicates mem and hd such that the interpretations are self-consistent (i.e. ) and which are sufficient to prove the desired verification condition. While solutions to the multi-abduction problem have been developed for domains that admit quantifier elimination, e.g. linear integer arithmetic constraints Albarghouthi et al. (2016), there is no purely logical solution for formulae involving equalities over uninterpreted functions. An additional challenge in our setting is that we seek to infer specifications consistent with the library’s implementation, a requirement that is absent in (Albarghouthi et al., 2016).
2.2. Data-Driven Abduction
We overcome these challenges by adopting a data-driven approach to abducing maximal library specifications, framing the problem as one of training a Boolean classifier on a set ofexample behaviors for each function . Under this interpretation, a classifier represents a specification of the “acceptable” behaviors of . Thus, the goal of the specification inference stage of our algorithm is to learn a set of classifiers that recognize the behaviors of each that a) are consistent with ’s underlying implementation and b) preserve the safety of the client program. As discussed at the start of this section, this algorithm uses both an SMT-based verifier and a property-based test generator as sources of training data for our learner. The former identifies example behaviors consistent with ; these are labelled as “negative” examples, so that the learned specifications can help the solver rule out behaviors that are inconsistent with the semantics of the method predicates or that produce unsafe executions (i.e., interpretations that would violate the postcondition). Example behaviors drawn from tests are labelled as “positive”, so that our learner is biased towards explanations that are consistent with the (unknown) implementations of library functions. Notably, our algorithm generalizes this data-driven abduction procedure for individual functions to the multi-abduction case, ensuring that discovered interpretations are globally consistent over all library methods.
Figure 5 depicts the space of example behaviors for our learner, as well four potential verification interfaces. Each of these represents a potential solution in the hypothesis space for this learner, which is tasked with building a classifier that separates negative (-) and positive examples (+) for each library function. The dashed purple line, labelled , represents an unsafe verification interface that allows a client program to violate the desired postcondition. The remaining red lines represent the range of safe verification interfaces. The two dashed red lines represent the verification interfaces that are sufficient to verify the client, but which are suboptimal. is safe but inconsistent with the observed behaviors of the library implementation, and is thus overly restrictive. is safe and consistent, but not maximal, as there exists a weaker verification interface () in the hypothesis space that is still safe and consistent. Intuitively, the goal of our first phase is to identify , which is then weakened by the second phase to produce .
Our learner limits the shape of solutions it considers so that inferred specifications are both amenable to automatic verification and strong enough to verify specified postconditions. To enable automated verification of client programs, potential specifications are required to be prenex universally-quantified propositional formulae over datatype values and variables representing arguments to the predicates under consideration. Some possible specifications of the library function in our running example include:
which contains, among other candidates, the desired specification. All atomic literals in generated formulae are applications of uninterpreted method predicates and equalities over quantified variables, parameters, and return values of functions. The literals in the above formulae are simply applications of hd and mem to and , and the equality . We automatically discard equalities between terms of different types, e.g. . The feature set for the predicate , i.e. the set of atomic elements used to construct its specification, is thus: .
We now consider how to represent library behaviors in a form that is useful for learning a solution in our hypothesis space. To illustrate our chosen representation, consider the counterexample produced by an off-the-shelf theorem prover when asked to verify the formula from our running stack example, where the set of candidate specifications are initialized to true (i.e., ):
Intuitively, this counterexample asserts that the stacks , , and contain exactly one element, the constant , and the other two stacks, and , are empty. This assertion indeed violates the second conjunct of the postcondition, , but it is inconsistent with the expected semantics of the library functions, and can thus be safely ignored. The verifier generates this counterexample because the interpretations of the placeholder predicates in are too permissive. In order for the verifier to rule out this counterexample, the placeholder predicates need to be strengthened to rule out this inconsistent behavior. Ignoring how we identify this counterexample as spurious for now, note that there are many ways to strengthen these specifications. One approach is to focus on one particular function at a time. For example, we could choose to refine the specification of so that it guarantees that is a member of . Alternatively, we could focus on , ensuring the members of are also contained by . In general, however, it may be necessary to strengthen multiple specifications at once. Therefore, instead of focusing on one specification at a time, we learn refined specifications simultaneously.
Potential negative feature vectors extracted fromCex.
The first step to refining our placeholder specifications is to extract data from Cex in a form that can be used to train a classifier. We do this by using the assignments to the arguments of the placeholder predicates in a counterexample to build feature vectors that describe the valuations of method predicates and equalities in the unsafe execution. Table 1 presents the feature vectors extracted from Cex. The first column of this table indicates the particular placeholder predicates that can be strengthened to rule out this counterexample. The second column gives the feature vectors for a particular instantiation of the quantified variables () of the placeholders. The subsequent columns list applications of method predicates, with the rows underneath listing the valuation of these predicates in the offending run. The second row corresponds to the assertion that , for example. A strengthening of the specifications that disallows any one of these interpretations will also rule out the corresponding unsafe run of the program. Put another way, each row corresponds to a potential negative feature vector, and a classifier (i.e., specification) for the corresponding placeholder that disallows this feature will disallow the counterexample. The designation of these features as potentially negative is deliberate, as we only want to disallow features that are inconsistent with the implementation of the library functions. As an example, the first feature vector for (the second row of Table 1) states that the result of top s1 is not the head element of the input stack (since is true and is false), and thus is inconsistent with any reasonable implementation of . In contrast, the next feature vector is compatible with an execution of where, e.g. and . The second feature vector represents a behavior that is consistent with the underlying library implementation and that should be allowed by the learned specification. The consistency checker generates this positive training data via random testing of the client program. To see how we extract positive feature vectors from training data, consider the execution of concat with the inputs and , which produces the following assignment to program variables:
Similar to how we built negative feature vectors from Cex, we can construct feature vectors for each function specification from these assignments. Table 2 illustrates the feature vectors corresponding to this assignment. Consider the second row where is instantiated with : under this assignment, is true, as are and .
In order to train a classifier, we need to label the extracted feature vectors as either positive or negative. In other words, we need to identify behaviors that should (and should not) be allowed by the inferred specification. Assigning labels is not as straightforward as labelling the feature vectors extracted from counterexamples as negative and those extracted from testing as positive, as the two sets can overlap. We observe this in the vectors of from Table 1 and Table 2: the interpretation occurs in both tables. Intuitively, we do not want to strengthen the specification of to rule out this interpretation, as the positive sample is a witness that this execution is consistent with the implementation of . Ultimately, therefore, the specification must be relaxed to allow the execution. In cases where a negative feature vector conflicts with a positive feature vector, we identify the potential negative feature vector as positive and remove it from the learner’s negative feature vector set. This strategy is similar to the one used by Miltner et al. (2020) to deal with inductiveness counterexamples. Thankfully, as long as the counterexample set contains at least one feature vector not known to be consistent with the underlying library implementation, we can strengthen the specifications to disallow it. The first feature vector for in Table 1 represents one such infeasible execution. This vector encodes the case where is the output of but is not a member or head of the input stack. Clearly, no reasonable implementation would support such an interpretation. We use this observation to label as “negative” those feature vectors that are extracted from a counterexample but do not appear in the set drawn from a concrete execution. Table 3 shows the partition of positive and negative feature vectors for . Given this labelled set of positive and negative feature vectors, our data-driven learner builds a separator over the training data. One such classifier (formula) for the data in Table 3 is . Substituting similarly learned specifications for the other library functions in equips the SMT solver with enough constraints to rule out Cex while maintaining the invariant that the learned specifications are also consistent with the underlying library implementations . Additional iterations of this counterexample-guided refinement loop gather additional positive and negative features, eventually producing the specifications for the library functions presented in Figure 4.
Identifying Spurious Counterexamples
Thus far, we have only considered spurious counterexamples generated by the safety checker. Counterexamples, however, can also result from an incorrect client assertion. For example, suppose the client (unsoundly) asserts :
This assertion is wrong, assuming reasonable implementations of and , as the elements of the result stack can also come from . We distinguish counterexamples corresponding to actual safety violations by first checking if all the feature vectors extracted from the counterexample are included in the set of known positive feature vectors. For example, given this unsound assertion, the verifier may produce the following counterexample:
Table 4 shows all of the feature vectors for that are extracted from this counterexample . Since these are a subset of the positive feature vectors from Table 2, there are no feature vectors that can be labeled as negative, and there are thus no new bad behaviors that the learner can use to generate a refined specification mapping that rejects the counterexample. In this scenario, our algorithm tries to discover concrete values of and consistent with Cex’; that is, only contains and is empty. When called with these parameters, will return , which Elrond returns as a witness of an unsafe execution. Note that this situation may also occur when the feature set is not large enough, as the specifications in the corresponding hypothesis space are not expressive enough to identify a spurious counterexample. Thus, if we are not able to find inputs that trigger a safety violation, we grow the feature set by increasing the number of quantified variables (e.g. from to ) so that Elrond can explore a richer space of specifications.
While the above strategy is guaranteed to find a safe and consistent verification interface when one exists, the solutions it produces may still be suboptimal, as illustrated in Figure 5. For example, the first conjunct of the specification for push in Figure 4 states that any existing member of both the input and output stacks should not be the same as the element being added to the stack; that is, push always produces a stack with no duplicates. This specification is too restrictive, however, as it disallows reasonable behaviors such as push([1;2], 1) = [1;1;2]. However, if our sampler never generates an observation corresponding to this behavior, e.g. , and , the candidate specification for top produced by Elrond’s first phase will incorrectly disallow it. In other words, our reliance on testing to identify and label negative feature vectors may result in initial specifications that are overfitted to the examples enumerated by the test generator. There are two potential reasons such a positive example might be missed: (1) the input space of the program might be too large for a test generator to effectively explore, and (2) the provided implementation may simply not exhibit this behavior (e.g., it may be the case that the implementation of push that we are trying to verify against does indeed remove duplicates). While exhaustive or more effective enumeration can address the first cause, it cannot remedy the second. Elrond’s weakening phase helps ameliorate both issues. Our weakening algorithm iteratively weakens candidate specifications, focusing on one library function at a time. To weaken the specification of push, for example, we fix the specifications of the other library functions to their assignments in the current verification interface, and then try to find a maximal weakening of that admits a larger set of implementations of push. To do so, Elrond attempts to discover additional weakening feature vectors for , or feature vectors corresponding to behaviors disallowed by the current specification but which would not lead to a violation of client safety. One possible weakening feature vector for our current example is shown in Table 5. Here, the head of both and the result of the recursive call is ; this scenario is mistakenly disallowed by the specification of push in Figure 4.
Elrond repeatedly queries the verifier to identify weakening feature vectors for push, indicating that the current specification is maximal when none can be found. It then moves on to the next function specification, iteratively weakening each until a fixpoint is reached. Figure 6 shows the maximal verification interface Elrond builds by weakening the candidate specifications in Figure 4 . Compared with Figure 4, the specification of push now permits duplicate elements in stacks. In addition, the specification of is_empty has been simplified by removing the redundant conjunction , as can never be violated by a concrete stack value thanks to the observed semantics of and .
3. Problem Formulation
Having completed a high-level tour of Elrond in action, we now present a precise description of the specification synthesis problem and our data-driven inference procedure. We consider functional programs that use data structure libraries providing functions to access and construct instances of inductively-defined algebraic datatypes (e.g., list, stacks, trees, heaps, tries, etc.). In the remainder of the paper, we use to refer to the verification query whose validity we are attempting to establish. These structures serve the same role as verification conditions in a typical verification framework. The first component of this query, , is a conjunction of applications of specification placeholders () to arguments; these represent the library method calls made by the client program. The second component, , represents the client program’s pre- and post-conditions, encoded as sentences built from logical connectives (, , ) over prenex universally-quantified propositional formulae. Each verification query corresponds to a control flow path in the client program; the full algorithm considers the conjunction of all these verification queries at once. To keep the formalization and the description of our algorithms concise, our description considers a single verification query in isolation. The extension to sets of verification queries is provided in the supplementary material .
Definition 3.1 (Problem Definition).
A given verification query with unknown library functions has the form , where:
Here, the equality constraints are either between program variables () or between variables and constants of some base type (e.g., Booleans and integers). Each in is an application of a placeholder predicate to some arguments; the conjunction of these placeholder applications and equality constraints represents a sequence of library method invocations in one control-flow path of the client.
To model the input and output behaviors of the blackbox implementations of library functions and method predicates, our formalization relies on a pair of partial functions with the same signature as the implementations. We use partial functions to reflect the fact that we can only observe a subset of the full behaviors of these implementations when searching for specifications.
Definition 3.2 (Specification Configuration).
Let be a set of method predicates and be the set of functions in a library used by the client. Let be a partial function from the domain of to its codomain, and be a partial function with the same signature as . Let and . A specification configuration is a 5-tuple , where is the verification query extracted from the client.
Example. The specification configuration of our running example consists of a verification query , a method predicate set , and library functions , , , . The partial functions in and abstract over observations on their corresponding blackbox implementations; for an execution that produces the feature vectors shown in Table 2, they are:
where stack arguments are limited to values in this particular execution, e.g. or . Given a specification configuration as input, the output of our verification pipeline is a verification interface (), a logical interpretation of the method predicates that maps each placeholder predicate for a function to a universally-quantified propositional formula over the parameters and result of . We impose two requirements on . The first is safety: an underlying theorem prover (e.g., a SMT solver) must be able to prove , where denotes the formula constructed by replacing all occurrences of specification placeholders with their interpretation in , and is the verification query built from a client program:
Definition 3.3 (Safe Verification Interface).
For a given verification query , a verification interface is safe when:
it makes the VC valid: , and
is not trivial:
In addition to safety, we also desire that any proposed mapping be consistent with the provided implementations of method predicates and library functions, i.e. that must accurately represent their observed behavior. Formally:
Definition 3.4 (Interface Consistency).
A verification interface is consistent with and when all specifications in are consistent with the inputs on which and are defined. Formally,
The expression denotes the instantiation of the formula bound to in with the input arguments and observed output . The expression replaces all free occurrences of in with where .
This definition thus relates the observed behavior of a library method on test data, encoded by and with its logical characterization provided by . Note that there may be many possible verification interfaces for a given specification configuration. In order to identify the best such interface, we use an ordering based on a natural logical inclusion property:
Definition 3.5 (Interface Order ).
The verification interface is weaker () than when,
The two interfaces contain the same functions:
They are not equal:
The specifications in are logically weaker that those in :
Intuitively, weaker verification interfaces are preferable because they place fewer restrictions on the behavior of the underlying implementation. Given an ordering over verification interfaces, we seek to find the weakest safe and consistent interface, i.e. one that imposes the fewest constraints while still enabling verification of the client program.
Definition 3.6 (Maximal Verification Interface).
For a specification configuration , , , , , is a maximal verification interface when:
is safe for the verification query .
is consistent with and .
For a given bound on the number of quantified variables used by the specifications in , there is no safe and consistent interface whose specifications use at most quantified variables such that .
We now refine our expectation for the output of our verification pipeline to be not just any safe and consistent verification interface, but also a maximal one. Notice that our notion of maximality is parameterized by the number of quantified variables used in the interpretation. As this bound increases, we can always find a weaker specification mapping. Thus we frame our definition of maximality to be relative to the number of quantified variables in the specification.
4. Learning Library Specifications
As Section 2 outlined, Elrond frames the search for a safe verification interface as a data-driven learning problem. At a high level, the goal of learning is to build a classifier (a function from unlabeled data to a label) from a set of labeled data. More precisely, our goal is to learn classifiers for each of the library functions in a specification configuration that can correctly identify any input and output behavior that could induce an unsafe execution in the client. Our first challenge is to find an encoding of program executions that is amenable to a data-driven learning framework. To begin, we need to identify the salient features used by a classifier to make its decisions.
Definition 4.1 (Feature).
A feature of a function for a set of variables is a method predicate applied to elements of or equalities between variables in .
A feature is similar to a literal in first-order logic, but does not allow for method predicates as arguments (e.g. ) or constant arguments (e.g. ).
Definition 4.2 (Feature Set).
The feature set of a function with method predicates and quantified variables , denoted as , is a list of all well-typed features in for the set of variables which is minimally linearly independent:
Example. The feature set for the function from the Stack library, where is some base type, for predicate set , equality operation and quantified variables is . Note that the features and are not included in this set because they are not well-typed. The feature , on the other hand, is omitted because it can be represented by and is thus not linearly independent with respect to the other features in the set. We use feature vectors to encode the features of observed tests:
Definition 4.3 (Feature Vector).
A feature vector is a vector of Booleans that represents the value of each feature in the feature set for some test.
We also need to define the hypothesis space of possible solutions considered by our learning system. To easily integrate learned classifiers into the underlying theorem prover, we choose to represent such solutions as Boolean combinations over terms consisting of applications of interpreted base relations and uninterpreted functions. In order to preserve decidability, we limit this space to a subset of effectively propositional sentences. This limitation was expressive enough for all of our benchmarks.111The main sorts of properties that we do not support as a consequence of this choice are those which use quantifier alternation, e.g., for every element in a stream, there exists another larger element that appears after it: ().
Definition 4.4 (Hypothesis Space).
The hypothesis space of specifications for a library function , method predicate set , and quantified variables is the set of formulas in prenex normal form with the quantifier prefix , and whose bodies are built from , the logical connectives , and Boolean constants (true) and (false). The hypothesis space of over and is denoted .
In order to classify feature vectors, we ascribe them a semantics in logic:
Definition 4.5 (Unitary classifier).
For a given feature vector in a feature set , the logical embedding of is a formula encoding the assignment to its features:
We say that a classifier labels a feature vector as positive when , and negative otherwise.
Example. Given the classifier for the function from Section 2, the first row in Table 3 corresponds to the feature vector . The unitary classifier for is . is labelled negative by , as . The other two feature vectors in Table 3 are labeled as positive by .
Definition 4.6 (Classification).
For a given classifier and feature set , it is straightforward to partition the feature vectors of into positive () and negative () sets:
Notice that these two sets are trivially disjoint: .
For a particular configuration, we can straightforwardly lift this partitioning to verification interfaces:
4.1. Learning Safe and Consistent Verification Interfaces
We now confront the challenge of how to generate training data from a specification configuration in a way that guarantees the safety of the learned formulas (classifiers). To do so, we extract feature vectors from a set of logical samples:
Definition 4.7 (Sample).
A sample of a formula is an instantiation of its quantified variables and a Boolean-valued interpretation for each application of a method predicate to those variables in , which we denote as . The positive and negative samples of a verification query are samples of and , respectively.
Intuitively, the positive samples of a verification query correspond to safe executions of a client program, while negative samples represent potential violations that safe verification interfaces need to prevent. For example, Cex from Section 2 corresponds to the following negative sample of 222The interpretation of method predicates are represented as binary relations. :
and the following sample, extracted from a concrete input and client execution result, is positive:
Although they come from different sources, both samples provide the values of variables (e.g., the value of in both and is ) and the values of predicate applications (e.g., is true in sample and false in ). Using , we can extract a collection of feature vectors under a feature set from a sample :
Definition 4.8 (Classifier Consistency).
For a verification query , we say that a verification interface is consistent with a negative sample if at least one of the library specifications in classifies one or morefeatures extracted from that sample as negative:
Similarly, is consistent with a positive sample if all specifications in positively identify every feature vector extracted from :
Example. The verification interface from Figure 4 is consistent with (), as the specification of the function labels as negative the following feature vector of , to negative. Furthermore, is also consistent with all the feature vectors extracted from .
Theorem 4.9 ().
For a given specification configuration and verification interface , is valid iff is consistent with all negative samples ; is a consistent interface iff is consistent with all positive samples entailed by and . 333All positive samples entailed by and means every positive sample consistent with the observations encoded by and . 444Proofs for all theorems are available in the supplementary material .
4.2. Learning Maximal Verification Interfaces
While Theorem 4.9 identifies the conditions under which a verification interface is safe and consistent, it does not ensure that it is maximal. We frame the search for a maximal solution as a learning problem for a single function specification assuming all other specifications are fixed .
Definition 4.10 (Weakest safe specification).
For a given verification query and safe and consistent verification interface , is the weakest safe specification of iff
For a given bound on the number of quantified variables allowed in the specification of , there is no other specification with quantified variables that makes safe such that .
Definition 4.11 (Sample with respect to library function).
For a verification query , and safe verification interface , a sample is positive (resp. negative) with respect to library function when is positive (resp. negative) and consistent with the specifications of all other library functions in the domain of :