Algorithm Selection for Software Verification using Graph Attention Networks

01/27/2022
by   Will Leeson, et al.
University of Virginia
0

The field of software verification has produced a wide array of algorithmic techniques that can prove a variety of properties of a given program. It has been demonstrated that the performance of these techniques can vary up to 4 orders of magnitude on the same verification problem. Even for verification experts, it is difficult to decide which tool will perform best on a given problem. For general users, deciding the best tool for their verification problem is effectively impossible. In this work, we present Graves, a selection strategy based on graph neural networks (GNNs). Graves generates a graph representation of a program from which a GNN predicts a score for a verifier that indicates its performance on the program. We evaluate Graves on a set of 10 verification tools and over 8000 verification problems and find that it improves the state-of-the-art in verification algorithm selection by 11%. We conjecture this is in part due to Graves' use of GNNs with attention mechanisms. Through a qualitative study on model interpretability, we find strong evidence that the Graves' GNN-based model learns to base its predictions on factors that relate to the unique features of the algorithmic techniques.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/27/2021

Neural Network Branch-and-Bound for Neural Network Verification

Many available formal verification methods have been shown to be instanc...
06/24/2021

Visualizing Graph Neural Networks with CorGIE: Corresponding a Graph to Its Embedding

Graph neural networks (GNNs) are a class of powerful machine learning to...
06/10/2022

We Cannot Guarantee Safety: The Undecidability of Graph Neural Network Verification

Graph Neural Networks (GNN) are commonly used for two tasks: (whole) gra...
11/22/2020

GNNVis: A Visual Analytics Approach for Prediction Error Diagnosis of Graph Neural Networks

Graph Neural Networks (GNNs) aim to extend deep learning techniques to g...
03/23/2022

Graph Neural Networks in Particle Physics: Implementations, Innovations, and Challenges

Many physical systems can be best understood as sets of discrete data wi...
10/23/2020

Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks

Graph neural networks (GNNs) have emerged as a powerful tool for learnin...
03/07/2022

Scalable Verification of GNN-based Job Schedulers

Recently, Graph Neural Networks (GNNs) have been applied for scheduling ...

1. Introduction

Given a program, , and a correctness specification, , formal verification seeks to determine whether the executable program behavior is consistent with the specification, . In practice, can take many forms, such as that all assertions hold, that memory is used safely, or a guarantee that a program will terminate. Verification tools must check all feasible program executions to ensure that cannot be violated. As programs grow in size and complexity, it is increasingly hard to verify them. To address this, a variety of different verification algorithms and tools have been created, with different tools excelling in scenarios where others fail. In recent competition settings, 19 of the 20 competing tools built for verifying C programs were able to uniquely solve somewhere between 4 and 500 verification instances (beyer2021software). Of the 15000 verification problems in the competition, nearly 10% were solved by a single verifier. Deciding which tool is best suited to verify a specific piece of software can be difficult for an expert in the field of formal software verification, let alone an non-expert software developer.

Algorithm selection determines which algorithm from a suite of algorithms can best solve a instance of a specific problem class, e.g. the variety of sorting algorithms. Since it is rare for a single algorithm to dominate all others in terms of performance, algorithm selection is generally challenging. Even with the overhead of deciding which algorithm to employ and then executing it, algorithm selectors have been shown to be superior over single tools in various competitions (xu2008satzilla; richter2019pesco)

. Recently, machine learning techniques have been shown to be effective algorithm selectors for several problem classes 

(kotthoff2012evaluation; o2008using; richter2020algorithm). To relieve developers of the burden of deciding between verification tools, algorithm selectors have been created to determine which tool in a suite of tools is best suited for a given problem.

In this paper, we introduce a graph neural network approach to algorithm selection for verification of software, Graves. Given an arbitrary program and a specification , our approach can be used to rank a portfolio of verification tools based on their ability to accurately and efficiently verify . Through an automated process, is converted into a graph, , which is constructed from the program’s abstract syntax tree, control flow edges, and data flow edges. Using

, a GNN, consisting of graph attention layers, a jumping knowledge layer, and an attention based pooling layer, produces a graph feature vector. A simple neural network uses this vector to predict a fitness score for each verifier in question.

The contributions of this paper are the following:

  • We introduce an approach to algorithm selection using state-of-the-art graph neural networks techniques, which we have implemented in our verification tool Graves-CPA

  • We provide a broad empirical evaluation of verification tool prediction that demonstrates the improvement over the state-of-the-art of the proposed technique across a range of benchmarks, baselines, and metrics

  • We perform a qualitative study into the interpretability of the model our technique employs which identifies portions of programs that an expert would use to select a verification technique

2. Background

In this section, we present background information on both automated software verification and graph neural networks.

2.1. Automated Software Verification Tools

Developments in the field of automated software verification have led to a diverse field of verification techniques. Each technique has strengths and weaknesses. Many model checker based tools convert programs into an SMT formula in an attempt to prove no values of the free, or input, variables lead to a property violation. Thus, the power of the tool hinges on the SMT solver’s ability to check the complex formula they provide it. Abstract interpreters use abstract domains to characterize the variables and paths in programs. If an abstract domain is not precise enough to capture the behavior of the program, it can lead to the tool reporting unknown, or worse, giving an incorrect answer.

Most modern tools do not implement a single verification technique. Instead, they combine techniques to make a more sophisticated verifier. The CPAChecker framework allows developers to build their own verification tool by combining pre-implemented techniques using a configuration file (beyer2011cpachecker). Because tools implement different sets of techniques, there is an algorithmic diversity which allows some verifiers to excel where others fail.

Verification
Tool
CEGAR
Predicate
Abstraction
Symbolic
Execution
Bounded
Model
Checking
k-Induction
Interval
Analysis
Lazy
Abstraction
Interpolation
Automata
Based
Analysis
Ranking
Function
2LS
CBMC
CPA-Seq
DepthK
ESBMC-Kind
ESBMC-Incr
Symbiotic
U. Automizer
U. Kojak
U. Taipan
Table 1. Techniques implemented by several tools for the 2018 SV-Comp

The Competition on Software Verification (SV-Comp) is an annual event that evaluates verification tools on a diverse set of benchmarks, covering many program behaviors and several verification properties. In the most recent competition, SV-Comp 2021, 26 different verification tools competed, using some subset of over 20 techniques (beyer2021software). Table 1 provides an abbreviated look at the diversity of algorithmic implementations of tools at SV-Comp 2018, which we use in our study. While there are “winners” of the overall competition and the different categories, there is no single verifier that does best on all programs. This has motivated the creation of algorithm selectors using suites of tools from the competition. In fact, there are tools which compete using algorithm selectors  (richter2019pesco; darke2021veriabs).

2.2. Machine Learning

Machine learning is used to solve tasks that would normally require rigorous, often impossible, programming. For example, autonomous driving cars would require a complex series of conditionals to handle any given scenario a driver may encounter on the road. Instead of codifying a system of rules an autonomous car should follow when driving, machine learning has been used to learn how the car should react to a given scenario (sallab2017deep; kuutti2020survey). The power of machine learning techniques are their ability to learn complex patterns in large corpora of data to make accurate predictions.

A simple, yet effective machine learning technique for classification is the support vector machine (SVM) 

(boser1992training)

. SVMs use training data to learn a boundary which maximizes the margin between the boundary and the data points of the two classes. When a new data point is presented to the SVM, it is classified based on which side of the boundary it lies. SVMs can be generalized to multiclass classification problems as well.

One of the core concepts in machine learning is the idea of an artificial neural network (ANN)  (mcculloch1943logical)

. These networks are a series of layers of nodes with connections between layers. Data is input into the network and flows through the network. As the data passes through the layers, calculations are performed on the data until the final layer is reached. The output is then used to answer the task the network is meant to solve. These networks go through a training phase where the calculations the networks perform are iteratively tuned, typically through a process called backpropagation 

(rumelhart1986learning).

Traditional machine learning techniques leverage the fact that the data they operate on is of a consistent size. For example, bit-mapped image encodings have a consistent dimension and ordering. The networks learn to make calculations accordingly. Recurrent neural networks (RNNs) allow for variable sized input, typically streams of data, but they still leverage the fact that data has a set ordering or pattern. RNNs maintain a state which is updated as data is input to it. Graph data, in general, has no set ordering or size which makes it problematic for SVMs, ANNs and RNNs.

Introduced in Scarselli et al. (scarselli2008graph), graph neural networks aim not only to capture the information in the nodes of the graph, but also the connections, or edges, between them. An interesting observation in the foundational work is that GNNs can be thought of as a generalization of RNNs. RNNs operate on data that can be thought of as a linear, acyclic graph. Each input to the RNN is a node in the graph, with an edge from the last input and to the next input. If this restriction on the data can be relaxed, these networks can operate on arbitrary graphs.

In an RNN, data flows in a linear fashion. As data is fed through, calculations are made and the state is updated. With GNNs, this must be augmented. Instead of a single state, they have a set of states, one for each node in the graph. Each node in the graph is updated in parallel using the values of the nodes adjacent to it during a process known as message passing. After several iterations of message passing, each node in the graph has a value computed from itself and its neighboring nodes, effectively allowing the network to learn about each node and where it lies in the graph.

2.3. Model Interpretability

Model interpretability—the process of discerning why a machine learning model makes a certain decision for a given input—is a sought after property amongst machine learning researchers (petsiuk2018rise; fukui2019attention; dovsilovic2018explainable). Interpretability can help ensure the model is learning to make predictions based on features a domain expert would recognize as important to a given problem. For example, the tool RISE(petsiuk2018rise) produces heat maps of images based on how important a pixel is to the model’s prediction. If the model is supposed to predict whether or not an image is a tire, the heat map could identify circle shapes are important to pictures. It could also show that the model looks at images that typically have tires, e.g. a car or bike.

Ying et al. present a black-box approach to GNN interpretability, GNNExplainer, based on the idea of masking (ying2019gnnexplainer). Masking is the approach of removing certain data points to see how it affects the models prediction. GNNExplainer operates by masking edges in a given graph and giving the altered graph to the model. They can then determine the edges that most influence the model’s prediction.

3. Related Work

In this section, we describe existing approaches to algorithm selection for verification and other uses of graph neural networks to perform software engineering tasks.

3.1. Algorithm Selection for Verification

While machine learning is the norm, it is not required to perform algorithm selection for verification. The tool VeriAbs (darke2018veriabs)

implements an algorithm selector which uses a rule based system. Using lightweight analyses, they compute information about loops and the inputs to programs to decide which verification techniques to apply to the program. This technique proved to be very successful and VeriAbs has won the ReachSafety category in SV-Comp for the last three years.

To create a machine learning based algorithm selector for program verification, programs must be represented in a way amenable to machine learning techniques. Raw text is a poor representation of programs. Natural language processing has been used to “understand” a program to perform some tasks 

(ernst2017natural; allamanis2018survey), but programs are more than just a sequence of statements. The order in which statements are executed is based on the branching behavior of a program. Another way to represent a program is by extracting statistics, such as counts of loops, pointers, or occurrences of recursion, to form a “feature vector”. Models have been trained using these feature vectors to select verifiers with some success (demyanova2017empirical; tulsian2014mux). This representation will capture the presence of certain behaviors in a program, but not how they interact with each other. If program A has a loop with an assertion in it and program B has a loop with an assertion before the loop, counting these features would show no difference between A and B. This is problematic as some algorithms falter with certain looping behaviors, while other excel.

Similar to Graves, the tool PeSCo (richter2019pesco; richter2020algorithm) operates on graph representations of programs. They calculate new graph representations by concatenating the label of adjacent nodes for a set number of iterations. They then use a SVM which compares subgraphs in these representations using the Weisfeiler–Lehman test for graph isomorphism (shervashidze2011weisfeiler). Since PeSCo has begun competing in SV-Comp in 2019, it has been ranked in the top three tools for the overall category. PeSCo and our proposed technique make use of similar graph representations of programs. However, our technique uses GNNs to form a final graph representation instead of the label concatenation process they perform. We also use a neural network to perform final predictions, instead of an SVM like them.

In Richter et al. (richter2020attend), they teach an attention network to encode a version of the abstract syntax tree of programs. Using this encoding, they train another network to perform algorithm selection on verification tools. Graves also makes use of a given programs abstract syntax tree, however, the graph is enriched with control and data flow information. Similarly, Graves uses a neural network to form predictions, but it encodes graphs in a feature vector using GNNs.

3.2. Graph Neural Networks for Software Engineering

Graph neural networks have been used to represent programs for various purposes in the software engineering community. Typically, techniques begin with the program’s abstract syntax tree. From there, they add edges representing information they find may be useful in solving their task. Tasks that focus on individual statements may add edges between the tokens in statements in the order they appear, allowing the network to create statement based feature vectors. Tasks focus on optimizing an entire program may require edges between function calls or variable uses to more accurately summarize the entire program.

GNNs are capable of performing tasks to assist programmers in developing code. Allamanis et al. (allamanis2017learning) use graph neural networks to perform tasks akin to a linter (johnson1977lint). They find when a variable has been used incorrectly in place of another and they predict the names of such variables. These tasks can be integrated into IDEs to prevent misuses of variables and to help developers give descriptive names to variables, making code more legible. Both LeClair et al. (leclair2020improved) and Lu et al. (lu2019program) use graph neural networks to classify programs. This can be helpful for code mining and automatic code completion.

GNNs can also be used for tasks dealing with resource management. Cummins et al. (cummins2020programl) use graph representations of programs and GNNs to perform compiler analysis tasks, such as determining the sequence in which optimizations should be applied. This has the potential to improve both compilation time and the execution time of generated code.

To perform message passing, Allamanis et al. and Cummins et al. use Gated Graph Neural Networks (GGNNs), a predecessor to GATs, and LeClair et al. use Convolutional Graph Neural Networks. These GNNs do not use an attention score to weight edges when performing message passing. Both also use different graph structures. LeClair et al. only use the AST. While the AST does maintain program structure, it lacks information such as loop return edges or def/use pairs. On the other extreme, Allamanis et al. use 10 different edge types, which may allow unnecessary information to propagate through the graph. Graves creates graph using the program’s AST, control flow edges, data flow edges, and function call and return edges. Since Graves uses GATs to perform message passing, edges are weighted, preventing information the network deems as unnecessary from having a large impact on any node’s representation.

Lu et al. use GGNNs with an attention mechanism. However, they formulate their graphs differently, using the AST, function call graph, and data flow graph. By using only the function call graph, theie graph is missing edges which determine the branching behavior of the program. They also perform featuring engineering on the data flow graph, a process where experts use domain knowledge to edit how a feature is captured in attempts to help the network learn. They categorize the edges in the data flow graph into 5 types, presumably so each type is seen as different by the GNN. Graves simply lets the attention mechanism handle weighting the edges.

4. Approach

Figure 1. Graves

operates using the pipeline shown above. First, we convert programs in graphs by using the program’s AST with control flow, data flow, and function call and return edges added between the AST’s nodes. We parse this graph into several vectors. These tensors are input to a graph neural network which produces a graph feature vector a simple neural network can make a prediction from.

Graves follows the pipeline shown in Figure 1

. Given a C program, it generates a graph representation of the program using the AST as the base control flow, data flow, and function call and return edges added. A parser converts the graph representation into a set of nodes, where node’s are represented using a one-hot encoding, and several edge sets. A GNN uses these sets to calculate a graph feature vector. Finally, a neural network predicts a score for each tool based on how well it will perform at verifying the given program. These scores are then used to rank verifiers from most likely to correctly verify a program to least likely.

4.1. Graph Generation

Definition 4.1 ().

We define a graph as follows:

  • where is a node

  • where is the one hot encoding of the value of

  • where corresponds to the edge type in the graph

  • where is a directed edge

For Graves’ program graphs, is the set of nodes in the abstract syntax tree (AST). The AST nodes contain important information about the semantics of a program, such as variables, functions, and operations, but they leave out purely structural tokens, like parentheses or semicolons. By construction, the AST retains the information in these structural tokens. The node representation, , is the one-hot encoding of the AST token associated with . Let be the set of possible AST tokens. Each token is represented by an index . The one hot encoding of , , is a vector of length of 0s and a single 1 where .

contains three edge sets which capture three types of information: control flow, data flow, and AST edges. Control flow information is vital to most, if not all, verification properties. Control flow will show the path to an error state or the presence of looping behavior which could prevent termination. Thus, it is reasonable to believe that control flow edges are necessary to differentiate between verifiers. For example, techniques which use symbolic execution may perform poorly on loops whose conditions are dependent on program input variables, since they may effectively unroll such loops an arbitrary number of times. By overapproximating loop condition values, abstract interpretation based techniques can efficiently compute loop fixed points to verify properties with some loss of accuracy with which the condition is modeled. Graves generates control flow edges and function call and return edges using the programs statement based interprocedural control flow graph (ICFG). This graph encodes control flow between each statement in a program and the call and return edges between function calls.

Data flow edges express the way in which variables are used and altered throughout the program. This information is related to the solvers which can determine whether a path to an error state is feasible. If the verification of a property is reliant on solving a complex formula, a tool based on a precise abstract domain, like the octagonal or polyhedral domain, may be able to verify the property correctly, while a tool based on a simpler domain may fail.

Finally, AST edges capture the semantics of the program which any tool ultimately must operate on. They can capture complex non-linear expressions which can be problematic for abstract interpreters depending on their abstract domain. They are also convenient as every node, besides the root node, is guaranteed to have one AST edge going to it, and in most cases one or more leaving it. This allows information not captured in control or data flow edges to propagate through the graph more easily during the message passing phase.

Figure 2. Graves uses a GNN which consists of a variable number of graph attention layers. A jumping knowledge layer collates the intermediate values of the graph between each GAT layer. The attention based pooling layer pools the nodes in the graph into a single graph feature vector. A fully connected neural network scores each verifier using this vector.

4.2. Graph Neural Network

Graves employs a GNN that consists of three sections: a series of graph attention layers, a jumping knowledge layer, and a pooling layer.

4.2.1. Graph Attention Layers

Graph attention networks (GATs) (velivckovic2017graph) are a widely used technique to perform message passing in GNN networks(xu2018representation; NEURIPS2019_d09bf415; huang2019syntaxaware). They operate by using one or more attention mechanism to learn what information should be passed during the message passing phase. This allows the network to weight the information it propagates, instead of simply passing all information equally.

Given a directed edge

, GATs also perform a linear transformation to

and during message passing by multiplying by a weight matrix and by a weight matrix .

Graves allows for a variable number of GAT layers, as the optimal amount of propagation can be related to the task at hand. Each GAT layer contains 3 trainable parameters: the attention mechanism and the two weight matrices.

4.2.2. Jumping Knowledge Layer

In image classification networks, it has been shown that early layers in the network can identify coarse features, like the shape of a wheel, and then later layers can identify more fine features, like the spokes in wheel (tong2017image). It is possible that the network can learn from each layer of the GNN. The earlier layers may provide information on local behaviors, as they are only a few edge steps away. Later layers may make calculations on behaviors which take many more steps to find.

Jumping knowledge layers (xu2018representation) combine the output of several layers, denoted as A, B, and C in Figure 2

, to produce an aggregate representation, typically using concatenation, max-pooling, or a recurrent layer. This allows

Graves’ network to learn on the intermediate node representations along with the final representation produced by the GAT layers. Graph representations produced by the intermediate GAT layer are fed to the jumping knowledge layer which concatenates the intermediate and final representations of each node into a single vector.

4.2.3. Pooling Layer

In order to use traditional machine learning techniques to perform the task of graph prediction, graphs must ultimately be collated into a fixed sized representation. We use an attention based pooling layer to calculate a graph feature vector, , as follows:

Meaning, for each node, , in the graph, we calculate an attention score, via a neural network, , and multiply the node, , by the softmax of said attention score. We then take the sum of these values to form .

is a three layer neural network consisting of 233 neurons.

is tuned throughout the training process.

4.3. Prediction Network

Along with a representation of the property , is passed to a neural network consisting of three layers. This network produces a score for each verifier in the portfolio based on how likely they are to verify . Using these scores, we can rank the verifiers from most effective to least effective.

4.4. Implementation

We created an implementation of the Graves approach for C programs and C program Verifiers which can be found in the GitHub repository111https://github.com/will-leeson/graves.

To create program graphs, Graves uses the AST generated by the C compiler Clang (clang). Using a visitor pattern (johnson1995design), it walks the AST to collect its nodes and edges. Graves also collect the information that is necessary to generate control and data flow edges. Using a work-list reaching definition algorithm (aho1986compilers), Graves generate additional data flow edges.

We implement Graves

’ GNN and neural networks using the machine learning library PyTorch 

(NEURIPS2019_9015) and an extension of the library, PyTorch Geometric (Fey/Lenssen/2019)

. PyTorch Geometric is a machine learning framework made to perform deep learning on graph and irregularly shaped data. It has implementations of many state-of-the-art GNNs techniques as well as a method to create new layers.

We trained networks consisting of 0 to 5 GAT layers, a jumping knowledge layer, and an attention-based pooling layer. Graves’ final prediction layer consists of three layers made up of 156 (the size of a node’s feature vector), 78, and 10 (the size of our verifier suite) neurons respectively, which was chosen after a manual architectural search.

5. Research Questions and Experimental Design

To evaluate our approach, we answer the following questions:

RQ1:

How does Graves compare to other algorithm selectors for verification?

RQ2:

How do the components of the Graves’ GNN architecture affect the networks ability to predict?

RQ3:

Does Graves identify program patterns associated with verification algorithm success?

Overall Reach Safety Termination Mem Safety Overflow
Graphs 8,330 5,778 1,842 291 419
Maximum Nodes 99,563 99,563 92,877 23,869 18,283
Mean Nodes 12,393 16,634 3,153 2,547 1,386
Maximum Edges 156,573 156,573 125,940 37,601 25,823
Mean Edges 17,803 23,834 4,818 3,280 1,808
Table 2. Program Graph Statistics. We look at verification instances with several values. Reach safety and termination make up the majority of examples, and they also have the largest graphs. Memory safety and overflow are much small categories with much smaller graphs.

5.1. Baseline techniques

To address RQ1, we select several baseline to compare against Graves. The selector introduced in (richter2020algorithm) produces program graphs similar to Graves. However, they do not use graph neural networks to select verifiers. As described in Section 3.1, they introduce an SVM with a specialized kernel to look for similarities in graphs. This SVM uses the Weisfeiler-Lehman test to calculate Jaccard similarity, so we refer to this technique as WLJ.

The selector introduced in (richter2020attend) introduces a variant of ASTs called contextualized syntax trees (CSTs). In CSTs, nodes of the AST are grouped in hierarchies, e.g. function, statement, token, etc. We refer to this technique as CST.

We evaluate two additional selectors: an ideal static selector (ISS) and a random selector. The purpose of ISS is to evaluate the benefits of dynamic selection. If the dynamic selectors perform similarly to ISS then the benefit of the dynamic selector does not outweigh the cost of it’s overhead. The random selector allows us to evaluate our metrics. If a random selector performs well for a given metric, then the metric is most likely not rigorous.

5.2. Dataset and Verifier suite

In order to evaluate Graves and the baseline techniques, we must first provide them a suite of verifiers for it to select from, a a large set of verification problems to train the machine learning based selectors, and a separate set of verification problems to evaluate all of the techniques on.

SV-Comp evaluates tools built for both the C and Java programming languages. As the Java competition is relatively new, the set of benchmarks and tools is not as diverse as those in the C competition. SV-Comp has a large set of verification problems with several possible specifications written in the C. In their evaluations, WLJ and CST used the 2018 SV-Comp benchmarks and selected 10 verifiers which competed in all four major categories: reach safety, termination, memory safety, and overflow. (SV-Benchmarks2018). Definitions of these properties can be found on the SV-Comp website222https://sv-comp.sosy-lab.org/2018/rules.php. By selecting the same benchmarks and suite of tools, we can evaluate Graves and perform a fair comparison to WLJ and CST.

The 2018 SV-Comp benchmarks consists of 9523 verification problems written in C. We remove roughly 13% of the examples from the data set as they produce graphs too large given the constraints of the GPUs we have available for experiments (VRAM is limited to 16GB). For reference, the average graph from the remaining 8330 examples is roughly 8.75 MB. Table 2 provides statistics on the graphs Graves generated from the dataset.

For the evaluation of RQ1 and RQ2, the suite of verification tools are the SV-Comp 2018 submissions of the following tools: 2LS (schrammel20162ls), CBMC (kroening2014cbmc), CPA-Seq (wendler2013cpachecker), DepthK (rocha2017depthk), ESBMC-KInd (gadelha2018esbmc), ESBMC-Incr (gadelha2018esbmc), Symbiotic (SYMBIOTIC5-SVCOMP18), Ulitimate Automizer (UAUTOMIZER-SVCOMP17), Ultimate Kojak (nutz2015ultimate), and Ultimate Taipan (greitschus2017ultimate). The labels for these verifiers come from the results reported at SV-Comp 2018 (sv-results). Each label is a verifiers SV-Comp score for the given benchmark b minus a penalty for the time it took to verify b.

For the evaluation of RQ3, Graves selects from 4 verification algorithms implemented in the CPAChecker framework: bounded model checking (BMC) (biere1999symbolic), bounded model checking with K-Induction (BMC+K) (de2003bounded), counter-example guided abstraction refinement (CEGAR) (clarke2000counterexample), and symbolic execution (king1976symbolic). We select BMC, CEGAR, and symbolic execution as they are distinct techniques with separate benefits and shortcomings. We include BMC+K to observe if the network can identify the advantages and disadvantages K-Induction adds to BMC. Since Graves is selecting from tools within one framework, it avoids the issue of identifying implementation specific details as reasons for selections, like supporting various data types. It also ensures Graves is selecting from individual techniques as most tools use an amalgam of techniques. Labels were collected by running each CPAChecker configuration on the SV-Comp 2018 dataset on one of 5 identical CentOS servers. Each server has one 2.10GHz Intel(R) Xeon(R) Gold 6130 CPU and 128 GBs of RAM.

We randomly divide our data into training, validation, and test sets using an 80-10-10 split respectively. We ensure that this split holds across all problem types, i.e. 80% of reach safety problems are in the training set, 10% are in the validation set, and 10% are in the test set. To evaluate Graves’ ability to generalize, we also look at using an 18-2-80 split in a competition setting.

5.3. Evaluation Metrics

We evaluate Graves against the baseline techniques using three metrics: the ability to predict a successful verifier, Spearman rank correlation (spearman1987proof), and Top-K error. The first two metrics were used in the evaluation of (richter2020attend) and (richter2020algorithm), respectively.

5.3.1. Successful Verifier Selection Accuracy

The simplest measurement we evaluate Graves on is the ability to select a verifier which will be successful in verifying a given program. We remove instances where no verifier could solve the given instance from the test set when evaluating this metric as this would artificially deflate the results. We also remove instances when all verifiers could solve the given problem as this would artificially inflate the results since any response is correct. We are left with 680 examples, or 81.8% of the test set. For this metric, ISS is the verifier which produces the most correct responses on the training set.

5.3.2. Spearman Rank Correlation

Spearman rank correlation (spearman1987proof) determines how similar two lists are ordered, where implies x and y are ordered them same and implies is the inverse order of . In our case, x is the true score of each verifier and y is the predicted score for each verifier. It is closely related to Pearson correlation coefficient which measures the linear correlation between two lists (pearson1896vii). The Spearman rank correlation coefficient of two lists, x and y, is equal to the Pearson correlation coefficient of the ranking of x and y. Spearman rank correlation is defined as follows, where x and y are lists of values and there are n items being ranked:

The ISS ranks verifiers using the Borda counts (Behnke2004) on the training set. The Borda count for each verifier can be calculated as follows, where n is the number of verification instances, k is the number of verifiers in our suite, and is the true ranking of the verifier on verification instance :

Borda counts are the optimal static ordering for any ranking in terms of Spearman rank correlation (hullermeier2010predictive).

Spearman rank correlation is interesting in the case where verifiers are run in sequence. A network with high Spearman rank correlation should choose the verifiers most likely to succeed in order of speed of verification.

5.3.3. Top-K Error

Top-K error is a metric often used to evaluate deep learning models on the task of object recognition in images (krizhevsky2012imagenet; ren2015faster; he2016deep). Given some k-value, if the label value is within the first k choices the network predicts, then error is 0. If the label is not in the first k choices, error is 1. In our case, the label is the verifier which performs best on the given program.

The ISS chooses tools in order of how often they were the best selector on the training set, meaning the tool which was the best tool most often is first, the tool which is best the second most often is selected second, etc. This guarantees the highest Top-K values for a static ordering of tools on the training set.

A critique of algorithm selection is that a user could run the portfolio of algorithms in parallel instead of selecting one at a time (kerschke2019automated). This has the potential to cause an excessive use of resources as verification techniques tend to scale poorly in both time and memory usage as program complexity grows. While full parallelization is guaranteed to verify a program in the shortest time for the set of verifiers, there will be a waste computation time and energy on all verification tools that do not report a safety guarantee first.

Algorithm selectors can be used to find a middle ground between parallelism and sequential selection by training a selector to predict the Top-K most likely tools to verify a program, where K is the amount of parallel machines. If the selector can consistently pick the most effective tool in the Top-K, the developer will receive a safety guarantee in the same amount of time, while using a fraction of the computational resources.

5.4. Network Training

Training took place on various machines with different specifications. Aside from the limitation on graph sizes imposed by GPU VRAM, training resources are not pertinent to any of our research questions or evaluation metrics, so this poses no threat to validity.

To train CST models, we looked to the repository listed in their paper to replicate their results. After interacting with the authors, we could not reproduce the results of WLJ. As a result, we omit WLJ from our evaluation of successful verifier accuracy and Top-K accuracy as they did not evaluate their technique using these metrics. For Spearman correlation, we quote the results from their paper.

For Graves

’ model, we performed hyper-parameter tuning, varying epochs among

and learning rate among 1e-3, 1e-4, 1e-5. We found a learning rate of 1e-3 and 50 epochs to be optimal. We did not train in batches due to GPU VRAM constraints. We used a learning rate scheduler to train Graves’ network. After three consecutive epochs were the network’s validation loss did not improve, the learning rate was decreased one order of magnitude. Networks were trained for the chosen number of epochs, or until the learning rate fell to 1e-8. Graves uses a pair-wise margin ranking loss to train networks as this penalizes poor ranking of any verifier, regardless of its true position in the label.

6. Model Evaluation

In the following section, we discuss our model’s performance across a variety of metrics. We look at how it compares to previous techniques and how different components of the network affect it’s ability to make predictions.

6.1. Technique Performance

We address RQ1 by comparing Graves against the baseline techniques across the metrics we discuss in Section 5.3. Figure 3

lists the results for the Successful Verifier Selection Accuracy, Spearman Rank Correlation, and Top-K Error. For Successful Verifier Selection Accuracy and Spearman Rank Correlation, we list the average and standard deviation of the results of 10 selectors of each technique. For Top-K Error, we present a line graph where each point is the average Top-K Error for 10 selectors of the given technique. Above and below each line is a shaded region which represents one standard deviation from the average. Note that for many techniques this is barely perceptible as their standard deviation is very low.

For WLJ, since we could not recreate the authors’ results, we quote the spearman correlation reported in (richter2020algorithm). In (richter2020algorithm), the authors list several configurations of WLJ, parameterized by two values, i and j. The i value corresponds to the number of node relabeling iterations they complete, similar to message passing. The j value corresponds to the depth of the AST their selector looks at when comparing graphs. In Figure 3, we report the optimal selector from their study, where and .

We list 4 configurations of Graves’ GNN: Full, Success, Spearman, and TopK. Full corresponds to a network which uses all possible edge sets and 5 GAT layers, the maximum we used. Success corresponds to networks consisting of 4 GAT layers that only uses the AST edge set when performing message passing. Spearman corresponds to networks consisting of 3 GAT layers and only uses the ICFG edge set when performing message passing. TopK corresponds to networks consisting of 2 GAT layers and all possible edge sets when performing message passing.

Each GNN configuration was found to be experimentally optimal for the metric for which they are named. For the results of all configurations, see the appendix. Note that the CST networks have minuscule standard deviation () due to the fact that they use a static seed in training their networks. ISS has no standard deviation as the same selections are made for each selector.

Algorithm Selector Success Spearman
Graves 0.856 0.006 0.723 0.004
Graves 0.866 0.008 0.720 0.005
Graves 0.863 0.008 0.727 0.003
Graves 0.860 0.006 0.726 0.007
WLJ - 0.654 0.014
CST 0.601 0.000 -0.059 0.000
ISS 0.697 0.000 0.243 0.000
Random 0.414 0.009 0.002 0.012
Figure 3. Graves is better in both successful verifier selection accuracy and Spearman rank correlation. WLJ was only evaluated on the task of ranking, but it performs well at it, only roughly 10% worse than GNN. CST and ISS perform well at successful verifier selection, but are much worse at ranking the verifiers accurately, with CST being worse than random. The fact that the Random selector perform as well as it does at the success verifier selection shows how easy of a task it is. For Top-K error, Graves is the best performing technique until K=6 where it performs similarly to the ideal static ordering. When K5, if a verifier were randomly selected, the best verifier is more likely to be selected than not.

Successful Verifier Selection Accuracy

Graves shows a 40% improvement over the best performing selector (CST) and a 22% improvement over the next best technique (ISS). The CST model does not take specification type into account, which GNN does. Both CST and GNN use a simple feedforward neural network to make final predictions. Its reasonable to believe that CST could perform better if they simply append this information to their feature vector.

In general, the problem of selecting a successful verifier is not very interesting. Simply by randomly choosing a verifier, there is roughly a 40% chance of making a correct choice. The best performing verifier in the training set, which is what ISS uses, performs better than CST in the and is within 24% of the best performing dynamic selector, Graves. Because verification can be such an expensive process, it is important to select not only a verifier that can verify a system, but one that can do it efficiently.

Spearman Rank Correlation

Graves improves on the state-of-the-art (WLJ) by roughly 10% in terms of Spearman rank correlation. Similar to the success metric, this is most likely due to the fact that our model takes into account the problem type of the verification problem. Unlike CST, the WLJ technique does not allow for this information to be conveniently added. Because their SVM uses a graph isomorphism based kernel, they can not simply add this information to a feature vector. A possible solution is to add a node in the graph containing this information, but it may negatively affect the graph isomorphism test their kernel uses.

Spearman rank correlation is a more interesting and rigorous metric than successful verifier selection accuracy. The ideal static selector performs much worse on this metric as there is actual competition as to which verifier performs best on a given instance. An issue with Spearman rank correlation is that it rewards getting the last item correct just as much as getting the first item correct. Let label = , pred, and pred. . Since pred predicts the best verifier first, it is a better ordering than pred. Selectors may be receiving high scores simply because they can order the bad verifiers better. In order to get faster and more accurate verification results, we want selectors to predict the best verifiers for a given problem first. It is less important that they get the correct ordering for the verifiers which will perform poorly.

Top-K Error

When all problems are taken into account, Graves is decidedly better than any other selector until K=6. After this, the difference between the GNN selector and the ISS selector is negligible. We argue that K5 is not an interesting metric, as there are 10 verifiers to select from. Randomly selecting more than 5 verifiers is going to include the best verifier more often than not. Once again, we omit WLJ as we could not replicate their studies to collect the appropriate data.

These results can help us infer some interesting information about previous metrics. The Top-1 error shows how often the technique selects the best performing verifier. This implies that when Graves chooses a “successful” verifier, roughly 60% of the time, it is the best performing verifier. CST on the other hand chooses the best performing verifier only approximately 15% of the time. These results also show that the best performing verifier is one of the first 3 verifiers Graves selects over 80% of the time. This suggests that the network is not succumbing to the Spearman rank correlation issue mentioned earlier of optimizing the order of the poor performing verifier and ignoring the best performing ones.

We argue Top-K error is a better metric than Spearman correlation or successful verifier accuracy to measure algorithm selectors. Modern CPUs allow for high levels of parallelization. Top-K error reveals how much parallelization is needed on average to reach the best result. Graves ensures that 90% of the time a portfolio run on a 4-core CPU will be able to achieve the best performance, i.e., since the top-performing verifier will be among the Top-4.

Algorithm Selector Reach Safety Termination Memory Safety Overflow
Graves 0.693 0.008 0.843 0.015 0.761 0.019 0.527 0.028
Graves 0.698 0.005 0.844 0.015 0.765 0.020 0.507 0.026
Graves 0.693 0.006 0.857 0.010 0.755 0.022 0.507 0.031
Graves 0.698 0.005 0.844 0.015 0.765 0.020 0.507 0.026
Graves 0.690 0.006 0.839 0.013 0.754 0.021 0.533 0.013
WLJ 0.719 0.019 0.879 0.021 0.647 0.057 0.777 0.046
WLJ 0.717 0.020 0.881 0.020 0.644 0.054 0.779 0.044
WLJ 0.715 0.021 0.877 0.019 0.649 0.054 0.769 0.042
WLJ 0.717 0.020 0.881 0.020 0.644 0.054 0.779 0.044
CST 0.180 0.000 0.112 0.000 0.175 0.000 0.214 0.000
ISS 0.309 0.000 0.369 0.000 0.470 0.000 0.433 0.000
Random 0.000 0.016 0.001 0.021 0.003 0.012 -0.004 0.055
Table 3. Spearman Rank Correlation Results for Category specific training. When training selectors for specific specifications, WLJ is superior in all but the memory safety case. It may appear that Graves could benefit from category specific training, but Table 4 shows otherwise.

Category Specific Training

So far, we have evaluated networks on their ability to rank verifiers where the specification, , is variable. Now, we evaluate how the techniques perform when remains constant, as WLJ cannot easily incorporate this information to enhance their predictions and CST does not. Graves can and does incorporate this information when making it’s prediction. By fixing , we remove this advantage from Graves.

We list the results of this evaluation in Table 3 and Figure 4. Each column in Table 3 shows the Spearman correlation for a given technique and configuration when trained and evaluated on a specific problem type, denoted by the column header. We report the average and standard deviation for 10 selectors of each technique and configuration. Each chart in Figure 4 displays the technique’s Top-K error when trained and evaluated on a specific problem type, denoted by the chart title.

We list five versions of Graves: full, reach, term, mem, and flow. Each of these use the GNN configuration we found to be optimal for the problem it derives it’s name from. The full configuration once again uses all edge sets and 5 GAT layers. The reach and mem configurations include ICFG and DFG edges during message passing and 5 GAT layers. The term configuration includes AST and ICFG edges during message passing and 4 GAT layers. The flow configuration includes only AST edges during message passing and 4 GAT layers.

Similarly, we list 4 configurations of WLJ, each being optimal for the problem it derives it’s name from. Once again, these results come from the authors’ evaluation of WLJ in (richter2020algorithm). Each configuration has an (i, j) pair as follows: reach=, term=, mem=, flow=.

(a)
(b)
(c)
(d)
Figure 4. Graves outperforms other techniques in Reach Safety, Termination, and Memory Safety for values of K less than 6. For K values of 6 or higher, the static selector performs very similarly. In the Overflow category, Graves

is only the best when K=1. The Reach Safety and Termination categories have far more training data than the Overflow and Memory Safety categories which may explain why they have more variance (denoted by the shaded regions) and why

Graves does poorly at overflow problems.

In (richter2020algorithm), Richter et al. showed that training algorithm selectors for verification on specific problem types can provide significant gains in Spearman correlation. This may be due to their technique’s machine learning component. As stated previously, there is not a convenient way to provide information about to their SVM’s kernel. Thus, it cannot make decisions informed by the problem type. This is an issue as certain verifiers may perform well at proving termination properties, but suffer at proving reachability properties. Because of this, they stand to gain a lot from category specific training.

Since our network ultimately predict from a graph feature vector, we can append information about to the vector. In practice, we map each to a number [0, ——], where is the set of unique values. The network can then inform it’s decision not only by the program being verified, but also the property. Looking at Table 3 alone, it appears Graves improves in the memory safety and termination categories from category specific training. However, Table 4 shows this is not true.

Each column of Table 4 lists the results of Graves trained on the entire training set, but evaluated only on the tests set of one value. Once again, we show the results for the network and the optimal configurations for each category. The full configuration once again uses all edge sets and 5 GAT layers. The reach configuration includes AST and ICFG edges during message passing and 5 GAT layers. The term configuration includes AST and DFG edges during message passing and 2 GAT layers. The mem and flow configurations include AST and ICFG edges during message passing and 4 GAT layers.

What we find is that category specific training provides some benefit to Graves in memory safety and overflow, but training on all data leads to improvements in termination and no difference in reach safety. Overall, the differences in spearman correlation between category specific training and training on the entire dataset are marginal for Graves, making it unnecessary to train a selector for each value of .

Its feasible to achieve slightly better results in reach safety and termination by training the WLJ model specifically for those problem sets. It is worth noting that Graves is within one standard deviation in both of these categories and, since category specific training shows little benefit, Graves requires only training a single model to achieve roughly the same results in all but one category. Graves does falter on overflow problems. This is most likely due to the limited data (roughly 300 training examples).

Algorithm Selector Reach Safety Termination Memory Safety Overflow
GNN 0.693 0.008 0.868 0.006 0.757 0.015 0.496 0.026
GNN 0.698 0.005 0.861 0.008 0.752 0.015 0.496 0.026
GNN 0.692 0.007 0.874 0.006 0.753 0.010 0.485 0.021
GNN 0.696 0.006 0.860 0.009 0.757 0.017 0.514 0.021
GNN 0.696 0.006 0.860 0.009 0.757 0.017 0.514 0.021
Table 4. Spearman Rank Correlation Results when the GNN is trained on all data. Graves only has marginal gains in the Overflow category when we do category specific training. In reach safety, we receive the same results, and in termination and memory safety, there is a noticeable improvement from training on all of the training data.

6.2. Ablation Study

To analyze the importance of the components of the network architecture and address RQ2, we perform an ablation study. Looking at Figure 2, there are 3 components to the GNN in Graves: a series of GAT layers, a jumping knowledge layer, and a pooling layer. In order to perform graph prediction using a feedforward neural network, the GNN must output input of a fixed size. Thus, we must always have a pooling layer. The remaining two components will be the subject of our ablation study.

Figure 5 shows the average successful verifier accuracy, spearman correlation, and Top-K error for several configurations of Graves’ GNN. Each GNN was trained on graphs including all three edge sets. Each GNN had 0 to 5 GAT layers, half with jumping knowledge layer and half without.

The number of GAT Layers in the network has very little affect on any of our metrics. Having at least one GAT layer improves all metrics by at least 2%, which is only significant since this is several standard deviations above the accuracy of no GAT layers. What this does show is the attention pooling mechanism is able to produce a capable selector. Without any GAT layers, the attention pool is able to collate graphs into a vector a neural network is able to classify with high accuracy.

The jumping knowledge layer appears to play a small, but noticeable role in the networks abilities to form predictions. Each network without the jumping knowledge layers performs roughly as well as the network with no GAT layers. When the jumping knowledge layer is included, there is the 2% improvement mentioned earlier.

It is also apparent that variance is much lower when the jumping knowledge layer is present. When the jumping knowledge layers are absent, variance is, with one exception, higher for successful verifier accuracy and Spearman correlation. For Top-K accuracy, the difference in variance when the jumping knowledge is not present is far greater. We see a similar phenomenon in the results of category specific evaluation. Reach safety and termination have roughly half as much variance as memory safety and overflow in terms of Spearman correlation. In Top-K accuracy, the variance is nearly imperceptible for reach safety and termination, but clearly visible for memory safety and overflow.

There are many components in the field of GNNs which have counterparts in the field of convolutional neural networks (CNNs). Some GNN techniques can even be thought of as an abstraction of convolutions. Jumping knowledge layers are similar to skip connections in CNNs. They both allow the network to use previous states of the input data to make calculations. Its been shown that early convolution layers in a CNN can identify coarse features in an image, like the shape of an animal, and then later layers can identify fine grain features, like that the animal is a cat 

(bau2017network). By allowing the network to see the states of the graph across each layer, it may be able to identify coarse features, like the presence of certain tokens, and then more fine grain features, like intricate branching patterns.

These results are supported by our findings in Section 6. Each model we observed as optimal for a given metric and used some number of GAT layers and the jumping knowledge layer.

GAT
Layers
Jumping
Knowledge
Success Spearman
5 True 0.857 0.008 0.724 0.006
5 False 0.856 0.006 0.717 0.005
4 True 0.859 0.006 0.722 0.006
4 False 0.856 0.011 0.721 0.006
3 True 0.862 0.002 0.726 0.003
3 False 0.849 0.012 0.714 0.008
2 True 0.851 0.006 0.722 0.006
2 False 0.848 0.008 0.710 0.006
1 True 0.856 0.006 0.723 0.004
1 False 0.852 0.016 0.711 0.007
0 N/A 0.849 0.005 0.705 0.003
Figure 5. Ablation Study Results. The number of GAT layers has little to no effect on the results of Graves. As long as there is a single GAT layer, it makes little difference. The Jumping knowledge layer has a slight, but noticeable affect on the success and Spearman metrics. For Top-K, it is more noticeable as there is more variance and worse results. Jumping knowledge allows the attention pool to form calculations using the graph representation after each GAT layer.

6.3. Mitigating Training Bias

The authors entered the SV-Comp 2022 verification competition with an instantiation of Graves called Graves-CPA. This tool uses a suite of 5 configurations of the CPAChecker (beyer2011cpachecker) framework: symbolic execution, value analysis, value analysis with CEGAR, predicate analysis, and bounded model checking with K-Induction. To ensure Graves-CPA did not memorize the optimal ordering, it’s underlying model was trained using a 20% subset of the SV-Comp21 benchmark dataset. It is important to note that the dataset for SV-Comp evolves each year, so Graves-CPA was evaluated on an slightly different dataset than the one it’s training subset comes from.

Using this configuration, Graves-CPA was able to achieve a Spearman correlation of 0.698. Since Graves-CPA runs the algorithms in sequence, Spearman correlation is the most relevant metric. Even with the vastly reduced training samples, the Graves technique is still able to make accurate predictions. This is a sign that our technique is not overfitting to the training set and instead learning in a way which generalizes. As a result, Graves-CPA placed 6th out of 16 verifiers competing in the reach safety category.

6.4. Threats to Validity

In the following section, we discuss potential threats to the validity of our model evaluation experiments.

6.4.1. Internal Threats

A potential internal threat to this study is our implementation of the approach. To mitigate this, we inserted assertions in our implementation to ensure it matched our specifications. This included ensuring all edges only referenced nodes that exist in the AST, every verification problem in our dataset had a specification and a label, and that during training our model made valid predictions. We also performed sanity checks on the graphs, such as checking that our ASTs were in fact trees.

6.4.2. External Threats

An external threat to validity revolves around the training data. To produce our graphs, we used only C programs and Clang’s AST tokens. Different compilers and languages may have different tokens, though compilation systems for standardized languages tend to have very similar internal representations. As a result, we do not know how Graves would transfer to different languages or ASTs.

The 2018 SV-Comp dataset contains many realistic examples of C software, such as Linux device drivers and the GNU Coreutils. However, there are also many unrealistic examples, such as programs consisting of 20 SLOC containing a simple loop. These unrealistic examples may affect how our models generalize to examples in the wild.

With these issues in mind, we proceeded with this dataset for several reasons. For one, many examples, while small, could not be verified by all the verifiers used to evaluate Graves. This shows that these small portions of code may give insight into what causes these tools to falter. Previous techniques also used the SV-Comp 2018 dataset. By using the same dataset, we were able to perform a direct comparison between techniques. Finally, and most importantly, in order to perform training, we need the ground truth for each verification problem and building a new dataset was outside the scope of this project. While this dataset is not at the scale of some machine learning datasets, we were able to obtain results that, for the most part, exceeded or were comparable to the previous state-of-the-art.

7. Model Interpretability

7.1. Case Study

To address RQ3, we perform a qualitative study using the GNNExplainer technique to examine which portions of the program affect Graves’ predictions. For each verification technique, we select 5 programs where said technique performs best out of the suite and Graves’ correctly select said technique as the best. We limit our selection of programs to those that produce graphs with less than 500 nodes. Graphs above this size were deemed to be too costly to analyze by hand even by an expert in program representations and verification techniques.

GNNExplainer produces a score for each edge in a graph which has edges. The higher the score, the larger the affect the edge has on the networks ability to make its prediction. Since the values of can vary widely for a given graph, choosing a threshold to use across graphs was problematic. So, we adopted a ranking approach. For each graph, we look at the highest scoring edges where is defined as follows:

This definition of identifies edges which are important relative to the given graph.

1void err() { ERROR: __VERIFIER_error();}
2
3int * return_self (int * p){
4    return p;
5}
6
7int main(){
8    int a,*q;
9    a = 1;
10    q = return_self(&a);
11    *q = 2;
12    if (a != 2) err();
13}
Figure 6. The program alias_of_return.i. From the verification techniques we observed, bounded model checking was fastest at proving .

To evaluate our graphs, we went through the process of open coding (corbin2014basics)

. Open coding is a technique in qualitative data analysis which summarizes data using various terms or short phrases, called codes, identified by a group of researchers. These codes can then reveal different patterns in the data, such as reoccurring themes or outliers.

To illustrate this process, we examine Figure 7. This is the graph representation of the program alias_of_return.i shown in Figure  6. The 6 highest scoring edges are emboldened, end in arrows, and are labeled. The remaining edges are less saturated and end in a square.

Figure 7. This is the program graph Graves generated of alias_of_return.i from Figure6. The GNNExplainer determines that the conditional in the IfStmt is important. This IfStmt determines if the CallExpr (the assertion function) is called. It also determined the control edge from the IfStmt to main is important. This edge represents ending the main function, which can only be reached by the assertion not being called.

We start by evaluating the AST edges. The edge labeled 1 goes from the TranslationUnit (the root of the AST) to a Function. From Figure 6, it is determined that this function is err() which unconditionally calls __VERIFIER_error(). This edge is labeled with the code “Edge to error function”. Next the AST edge labeled 2 deals with the declaration of the variable q. This variable directly affects the value of a, which affects whether or not err() is called. Thus, it is given the code “Decl of indirectly (1) dependent variable” as the error statement is directly dependent on a, which is directly dependent on q. We specify the value as one as there can be larger dependence chains. The AST edge label 3 deals with the call of the function return_self. This function affects the value of an “Indirectly (1) dependent variable”, thus we label the edge “Indirectly (1) dependent function”. The last AST edge deals with the conditional on line 19. This condition determines whether or not the error function is called and is equivalent to an assertion. Thus, the edge is labeled “assert condition”.

Bounded Model Checking
Code Occurrence
Update Value Of Directly Dependent Variable 0.250
Assert Condition 0.113
Update Value Of Indirectly (1) Dependent Variable 0.087
Edge To Error Function 0.075
Edge To Assert Function 0.050
Decl Of Directly Dependent Variable 0.037
Directly Dependent Function Return 0.037
Typedef 0.037
Edge From Error Function 0.025
Edge To Assert Call 0.025
Bounded Model Checking with K-Induction
Code Occurence
Update Value Of Indirectly (1) Dependent Variable 0.149
Update Value Of Directly Dependent Variable 0.132
Assert Condition 0.114
Loop Condition 0.096
Decl Of Indirectly (1) Dependent Variable 0.053
Decl Of Directly Dependent Variable 0.044
Update Loop Index Variable Value 0.044
Assert Call 0.026
Branch To Error Call 0.026
Assume Condition 0.018

CEGAR
Code Occurence
Update Value Of Directly Dependent Variable 0.092
Branch Condition 0.092
If Body 0.076
Unused Function 0.067
Update Value Of Indirectly (1) Dependent Variable 0.059
Infeasible Branch 0.059
Error Call 0.050
Assert Condition 0.042
Decl Of Indirectly (1) Dependent Variable 0.034
If Header 0.034
Symbolic Execution
Code Occurence
Update Value Of Directly Dependent Variable 0.165
Assert Condition 0.089
Update Value Of Indirectly (1) Dependent Variable 0.063
Branch Condition 0.051
Assume Condition 0.051
Parameter To Indirectly (1) Dependent Function 0.051
Decl Of Directly Dependent Variable 0.038
Call Of Directly Dependent Function 0.038
Main Return 0.038
Call Of Indirectly (1) Dependent Function 0.038
Table 5. The ten most common codes for each verification technique and the rate at which they occur in the edges GNNExplainer determines as important to the models ability to predict.

Next, the ICFG edges must be labeled. The first ICFG edge, labeled 5, goes from the error function to its body. So, it is labeled as “Error function”. Finally, the last edge, labeled 6, goes from the IfStmt to main. Normally, return edges exist from return calls to the function header. Since there is no return in main, there is an edge from the final statement in the body to itself. This edge is then labeled as “Main return”.

If data edges were deemed important, they would be labeled with codes after ICFG edges. This process is repeated for the remaining 39 graphs.

7.2. Results

Of the 40 programs selected, we identified 74 codes for the 388 identified edges. Several edges could be described with two codes, which left us with 392 uses of our codes. Table 5 list the ten most common codes for each verification technique and the rate at which they occur. A full list of codes and the rate at which they occur is listed in the Appendix. The phrase “directly dependent” refers to functions which have assertion or error calls or variables that are referenced directly in an assertion. The phrase “indirectly dependent (1)” refers to functions which call directly dependent function or variables that directly influence variables referenced in an assertion. The phrase “indirectly dependent (2)” means they are one further step down the dependence chain. In the following sections, we describe overarching patterns and patterns specific to the individual algorithms that these codes reveal.

7.2.1. Overarching Patterns

Overall, the codes imply that the network is learning the reachability problem. To solve reachability problems, verification tools must determine whether or not an error state, such as an assertion violation, is reachable. For all techniques except CEGAR, the codes the occur the most are “Update Value Of Directly Dependent Variable”, “Update Value Of Indirectly (1) Dependent Variable”, and “Assert Condition”. For CEGAR, these are the 1st, 5th, and 8th most common codes.

For all four algorithms, the network determines that the assertion condition and the values said condition depends on are important for it’s ability to make a prediction. This directly applies to the reachability problem. If an assertion statement is reachable, verifiers must determine if the condition in the assertion can be invalidated. If it can, then an error state can be reached.

7.2.2. Bounded Model Checking

When an error state is reachable, BMC tools must be able to prove this within a given loop unrolling bound — usually provided as a configuration parameter. When an error state is unreachable, the state space should be finite and all looping or recursive behavior should have a finite, generally small, number of iterations. As a result, the network identifies loop index variables with relatively small bounds as important to selecting bounded model checking.

The model also determines many indirectly dependent functions, variables and edges are important, more so than the other algorithms. These factors determine how complex it can be to reach an assertion statement. If there is no complex looping or recursive behavior, BMC tools will most likely be more efficient than a BMC+K or CEGAR tool. Symbolic executors may be competitive, and, as we show in Section 7.2.5, there are commonalities between the two algorithm’s codes.

7.2.3. Bounded Model Checking with K-Induction

K-Induction is a generalization of the induction principle. While induction attempts to prove both the base case and the n+1 case, K-Induction attempts to prove both the base case and the n+k case. By integrating K-Induction into BMC, tools do not have to exhaustively explore looping and recursive behavior. Instead, they can attempt to prove the base case and the n+k case are safe.

Over 30% of all edges deemed important by the model for BMC+K graphs deal with looping behavior, whether it be the loop itself or a loop index variable which bounds the loop’s execution. Two thirds of the loop index variable edges deal with large or non-determinate bounds, scenarios BMC alone should not be able to handle.

7.2.4. Counter-example Guided Abstraction Refinement

Conditional statements are crucial to a CEGAR approach. It must refine it’s abstract domain so that it can accurately capture branching behavior of a program. An assertion statement is syntactic sugar for the statement “if(!condition) error()”.

The CEGAR program graphs contained the largest set of codes. Roughly 24% of the edges identified dealt directly with conditional statements, such as if statements, switch statements, or loop conditions. Over 25% of the edges dealt with the values used to evaluate conditions, like loop index values values, directly dependent variables and indirectly dependent variables.

7.2.5. Symbolic Execution

Symbolic executors operate by exploring the execution tree of a program. While exploring this path, if they encounter an error state, they check if an assignment to free variables (inputs) allows for the path to said state to be feasible. As a result, like BMC tools, they struggle with large or unbounded loops and recursion. Just like the BMC program graphs, the model identifies loops and loop index variables with small bounds as important.

One of the symbolic execution programs contains a loop which has a free variable as its bound (meaning the loop has an indeterminate bound). Normally, the symbolic executor would need to iterate over the loop the maximum number of times based on the max possible value of the free variable. However, there is an assumption inserted into the programs which bounds the size of the free variable. This allows the symbolic executor to bound the loop. The model determines both the assumption and the loop index variable are important in this case. The GNNExplainer identifies assumption statements in programs that symbolic execution performs best far more than the other techniques. This is most likely because of this free variable loop index phenomenon.

7.3. Threats to Validity

In the following section, we describe potential threats to the validity of our experiment.

7.3.1. Internal Threats

Due to the fact that this study is a qualitative study, potential internal threats to validity come from the bias of the researchers. The codes we used come from terms and definitions from the community at large (ferrante1987program; ottenstein1984program; aho1986compilers). Thus, any code, edge pair should be determinable by other experts.

7.3.2. External Threats

As with the experiments in Section 6, a potential threat to this study is the choice of data. In order to make this a tractable problem, we limited the size of graphs we evaluated. In order to be able to reason about the graph as a whole, this was necessary. We randomly selected the programs for this study from this abbreviated set in order to avoid bias.

8. Conclusion

In this work, we have proposed Graves, a technique to perform algorithm selection on program verifiers using graph neural networks. Graves automatically generate a graph representation of a program using traditional graph representations that preserve semantic and syntactic components of said program. Using graph neural networks, Graves calculates a graph feature vector. Using several attention mechanisms, the network learns to form this feature vector by emphasizing portions of the graph which help make more accurate predictions. Graves passes this vector to a simple feedforward neural network which scores verifiers on how likely it is that they could successfully verify the given program.

We evaluated Graves using three metrics on over 8000 programs against several baseline techniques. We found that Graves is superior to several state-of-the-art baseline techniques we evaluated on the problem of selecting a verifier for a given program, property pair by over 10%.

We performed a study to interpret how Graves determines which algorithm to select. To do this we looked at three fundamental techniques and one technique variant: CEGAR, symbolic execution, bounded model checking, and bounded model with K-Induction. We found that Graves was able to identify portions of the graph related to the verification problem at hand. We also found that it selected portions of the graph specific to the given algorithms approach.

Moving forward, we would like to explore other applications for GNNs and software engineering problems. While we only explored the problem of verification algorithm selection, there is reason to believe that this approach could produces strong results in the space of software engineering. We would like to explore modifying our graphs using compiler optimizations, such as dead code elimination or loop unrolling, as this my produce richer graphs which are closer to a verifier’s abstraction. We would also like to apply our technique to SMT tools and model counters by encoding logical formulas as graphs.

Acknowledgements

We would like to thank Hongning Wang for his advice on graph neural networks and prediction systems. This material is based in part upon work supported by the U.S. Army Research Office under grant number W911NF-19-1-0054 and by the DARPA ARCOS program under contract FA8750-20-C-0507.

References

Appendix A Data

GAT Layers AST ICFG DFG Success Accuracy Spearman Correlation
0 N/A N/A N/A 0.849 0.005 0.705 0.003
1 False True False 0.860 0.006 0.720 0.004
1 False True True 0.857 0.009 0.718 0.007
1 True False False 0.854 0.011 0.718 0.004
1 True False True 0.859 0.008 0.719 0.005
1 True True False 0.858 0.007 0.719 0.004
1 True True True 0.857 0.008 0.720 0.008
2 False True False 0.857 0.012 0.723 0.004
2 False True True 0.857 0.009 0.721 0.002
2 True False False 0.857 0.008 0.723 0.004
2 True False True 0.864 0.009 0.724 0.005
2 True True False 0.856 0.008 0.722 0.005
2 True True True 0.860 0.006 0.726 0.007
3 False True False 0.863 0.008 0.727 0.003
3 False True True 0.854 0.006 0.724 0.004
3 True False False 0.862 0.008 0.720 0.006
3 True False True 0.859 0.006 0.721 0.006
3 True True False 0.860 0.009 0.726 0.007
3 True True True 0.859 0.006 0.725 0.003
4 False True False 0.859 0.009 0.725 0.006
4 False True True 0.865 0.008 0.725 0.004
4 True False False 0.866 0.008 0.720 0.005
4 True False True 0.865 0.009 0.722 0.005
4 True True False 0.863 0.008 0.726 0.006
4 True True True 0.851 0.006 0.722 0.006
5 False True False 0.859 0.005 0.722 0.004
5 False True True 0.865 0.008 0.725 0.004
5 True False False 0.865 0.010 0.720 0.005
5 True False True 0.861 0.010 0.722 0.004
5 True True False 0.860 0.006 0.726 0.004
5 True True True 0.857 0.006 0.724 0.006
Table 6. Average results for all configurations of Graves’ GNN trained on the entire training set

plus -1fill

Code BMC BMC + K CEGAR
Symbolic
Execution
Assert Call 0.000 0.026 0.008 0.025
Assert Condition 0.113 0.114 0.042 0.089
Assert Function 0.000 0.000 0.008 0.013
Assert Return 0.000 0.018 0.025 0.013
Assume Condition 0.013 0.018 0.000 0.051
Assume Function 0.000 0.009 0.000 0.013
Branch Condition 0.000 0.009 0.092 0.051
Branch To Error Call 0.000 0.026 0.000 0.000
Branch To Loop 0.000 0.018 0.008 0.000
Branch To Loop Back Edge 0.000 0.000 0.008 0.000
Branching Statement 0.000 0.000 0.000 0.013
Call Of Directly Dependent Function 0.000 0.000 0.017 0.038
Call Of Indirectly (1) Dependent Function 0.000 0.000 0.000 0.038
Conditional Branch 0.000 0.000 0.017 0.000
Decl Of Directly Dependent Variable 0.037 0.044 0.017 0.038
Decl Of Indirectly (1) Dependent Variable 0.013 0.053 0.034 0.013
Decl Of Input 0.013 0.018 0.017 0.000
Decl Of Loop Index Variable 0.000 0.009 0.008 0.000
Decl Of Loop Index Variable (Bound Small Value) 0.000 0.009 0.000 0.013
Decl Of Loop Index Variable (Unbounded) 0.000 0.009 0.000 0.000
Decl Of Unused Variable 0.000 0.009 0.000 0.000
Directly Dependent Function 0.013 0.009 0.025 0.025
Directly Dependent Function Header 0.000 0.000 0.017 0.000
Directly Dependent Function Return 0.037 0.009 0.000 0.000
Edge From Error Function 0.025 0.000 0.000 0.000
Edge To Assert Call 0.025 0.000 0.000 0.000
Edge To Assert Condition 0.013 0.000 0.000 0.000
Edge To Assert Function 0.050 0.009 0.000 0.000
Edge To Directly Dependent Function Call 0.013 0.000 0.000 0.000
Edge To Error Function 0.075 0.000 0.008 0.000
Edge To Loop 0.000 0.009 0.000 0.000
Edge To Loop Index Variable (Bound Small Value) 0.013 0.000 0.000 0.000
Error Call 0.013 0.018 0.050 0.025
Error Function 0.013 0.000 0.000 0.000
If Body 0.000 0.000 0.076 0.000
If Header 0.000 0.000 0.034 0.013
Indirect (1) Edge To Error 0.013 0.000 0.000 0.000
Indirectly (1) Dependent Function 0.025 0.000 0.000 0.013
Infeasible Branch 0.000 0.000 0.059 0.000
Input Call 0.013 0.009 0.008 0.025
Input Function 0.000 0.000 0.008 0.000
Input Variable 0.013 0.018 0.017 0.000
Loop Body (Bound Large Value) 0.000 0.009 0.000 0.000
Loop Body (Bound Small Value) 0.000 0.000 0.000 0.025
Loop Body (Nondet Bound) 0.000 0.000 0.008 0.013
Loop Break 0.000 0.009 0.000 0.000
Loop Condition 0.013 0.096 0.025 0.025
Loop Exit 0.000 0.009 0.000 0.000
Loop Index Variable Start Value (Bound Large Value) 0.000 0.009 0.000 0.000
Loop Index Variable Start Value (Bound Small Value) 0.013 0.018 0.000 0.000
Loop Index Variable Start Value (Infinite Loop) 0.000 0.009 0.008 0.000
Loop Index Variable Start Value (Nondet Bound) 0.000 0.009 0.008 0.000
Loop Index Variable Start Value (Unbounded) 0.000 0.009 0.000 0.000
Main Return 0.013 0.000 0.000 0.038
Main Return Value 0.013 0.000 0.008 0.000
Nested If 0.000 0.000 0.017 0.000
Nested Loop 0.000 0.009 0.008 0.000
Parameter To Directly Dependent Function 0.000 0.009 0.017 0.013
Parameter To Indirectly (1) Dependent Function 0.013 0.000 0.000 0.051
Return Of Directly Dependent Function 0.000 0.000 0.000 0.025
Return Of Indirectly (1) Dependent Function 0.000 0.000 0.008 0.000
Return Of Input Call 0.000 0.000 0.017 0.000
Return Of Input Function 0.000 0.000 0.008 0.000
Return Value Of Indirectly (1) Dependent Function 0.025 0.000 0.017 0.000
Switch Statement 0.000 0.000 0.008 0.000
Typedef 0.037 0.000 0.000 0.000
Unused Function 0.000 0.000 0.067 0.000
Update Loop Index Variable Value 0.000 0.044 0.017 0.000
Update Of Unimportant Variable 0.000 0.000 0.000 0.013
Update Value Of Directly Dependent Variable 0.250 0.132 0.092 0.165
Update Value Of Indirectly (1) Dependent Variable 0.087 0.149 0.059 0.063
Update Value Of Indirectly (2) Directly Dependent Variable 0.000 0.000 0.000 0.038
Update Value Of Loop Index Variable 0.000 0.018 0.000 0.000
Update Value Of Loop Index Variable (Bound Small Value) 0.000 0.000 0.000 0.025
Table 7. Full list of codes and occurrence rate for the qualitative study in Section 7