I Introduction
Symbolic execution [26] is one of the most powerful techniques for bug finding and program testing. It is successfully used on realworld code [7, 20, 38]. It is also used in reverse engineering [9] and program repair [34]. Many real vulnerabilities have been found using either coveragebased fuzzing, e.g., AFL [3], or symbolic execution. In the Darpa Cyber Grand Challenge, combining both fuzzing and symbolic execution was found to be effective [39, 40].
Our goal is to help improve symbolic execution through benchmarking, in particular, benchmarks which exercise the core reasoning techniques used which are essentially constraint solving, path exploration, and search. While path explosion [8, 39] is a challenge for symbolic execution, we show that systematic variation of problem difficulty with benchmarks combining constraint solving and path exploration is a way of measuring the performance of a symbolic execution engine. Constraint solving is also reported to be the dominant cost in symbolic execution of nontrivial programs [41, 29, 35]. Programs which naturally exhibit such challenges are when there is difficulty to invert computation such as a cryptographic computation or deliberately obfuscated code (including malware) which has been written intentionally to be difficult to analyze [47]. One intrinsic reason for this difficulty is that such code is combinatorially difficult for symbolic execution (both path exploration and/or constraint reasoning).
In this paper, we propose that the intrinsic path conditions challenges which underlie symbolic execution can be exercised and evaluated directly by using benchmarks of nontrivial combinatorial problems. We are also motivated by how “hard benchmarks” from the SAT competition have been used to drive the development of SAT solvers.
There are also other practical issues which can challenge symbolic execution such as system calls, memory modeling, heap reasoning [8, 17]—these are orthogonal to this paper. Our focus is on evaluating the core reasoning power of tools based on symbolic execution independently from the other challenges above. Unlike typical benchmarking of symbolic execution tools which use existing programs, we design special synthetic benchmarks to deliberately “stress test” path conditions (constraint solving), path explosion, and exploration.
Symbolic execution relies on techniques for constraint solving, typically satisfiability modulo theory (SMT) solvers, and handling of path conditions and path exploration. Since the core techniques rely essentially on constraint solving and search, we propose the use of discrete combinatorial problems for benchmarking. One wellknown framework for discrete combinatorial problems is Constraint Satisfaction Problems (CSP). We transform CSP instances to C programs to create our benchmarks. The transformation makes reaching a particular program point the same as determining the satisfiability of the underlying CSP (and also finding a solution). We give several transformations (into C code) for the same CSP resulting in different C programs. Ideally, as the CSP is the same, different tools should have similar runtimes. The results show this is not the case, justifying how our benchmarks can show weaknesses in the core reasoning of the analysis tools. The benchmarks also test the sensitivity of symbolic execution to different programs for the same problem and scalability (where reasoning becomes more difficult).
Our preliminary experiments are on the following tools, which analyze C programs using LLVM [30] as the common representation of the program: KLEE [7] varying the constraint solvers (STP [19] and Z3 [15]); TracerX [42]
which adds interpolation to KLEE; and the LLBMC
[18] bounded model checker (LLBMC can be considered as “static” symbolic execution [4]). The experiments show that KLEEbased tools are much more sensitive than LLBMC to the C constructs and structure of the code. We also show that the transformation (encoding) can affect different solvers in rather different ways. As our benchmarks involve intrinsically harder path exploration, preliminary experiments suggest that the execution times of the KLEEbased systems depend more heavily on the solver used than on the sophistication of the search the technique used, for instance, it seems that interpolants and abstraction learning does not yet have sufficient gains. There are also substantial differences in running times by merely replacing the solver used for the same tool.Ii Related Work
Symbolic execution generalizes program instructions into mathematical relations (constraints) between the state before and after each instruction [26]. In this way, instead of actually executing each instruction, a symbolic execution system collects the constraints that are translated from each instruction into a conjunction, called the path condition, which can be tested for satisfiability using constraint solvers, e.g. satisfiability modulo theories (SMT) solvers [33], such as STP [19] or Z3 [15]. Symbolic execution is widely used for whitebox fuzzing, the advantage is that each program path is only tested once, whereas traditional blackbox fuzzing may exercise the same path multiple times.
KLEE [7, 28, 27] is one of the most wellknown symbolic execution tools and shown to be effective. A critical challenge in symbolic execution is to make it more scalable. One approach is TracerX [42] which extends KLEE with learning (interpolation) to prevent redundant execution [24] to mitigate path explosion. A recent work, chopped symbolic execution [41], also reduces path explosion by extending KLEE so that parts of the code can be manually excluded by executing them lazily during symbolic execution to make overall execution faster. In our benchmarks, the programs are transformations of nontrivial CSP benchmarks such that there is no obvious part of the code, i.e., the transformed C program, that can be removed from consideration. Thus, chopped symbolic execution is not applicable. In this paper, our benchmarks are C code, hence, we focus on KLEEbased systems.
Other approaches related to symbolic execution are DART [20], CUTE [38], SAGE [21], and SE [10]. These can be categorized as concolic testers, where the construction of the path condition follows an actual execution of the program. Surveys of symbolic execution are in [5, 8].
There has been work to measure the capabilities of symbolic execution tools. Several challenges are identified with concolic testing, in particular, constraint solving [25]. Benchmarking of symbolic execution tools with logic bombs, code fragments which can cause difficulty for symbolic execution such as buffer overflow, arithmetic overflow, etc. is proposed in [46]. We differ in that we address the challenge which occurs in the simplest code, this is different from challenges which can arise from dealing with undefined behavior in C such as buffer overflow. Another challenging problem for symbolic execution is how to handle cryptographic functions [45, 11], the combinatorial challenges are related to the CSP benchmarks we propose. Evaluation of symbolic execution tools applied to industrial software is in [36] and considers limitations in handling language features. Language features are also tested in [13] using Java and .NET test generators.
Iii CSP Background
One formulation of constraint problems are Constraint Satisfaction Problems (CSP). A Constraint Satisfaction Problem (CSP) is a pair where is a set of variables and a set of constraints [16, 37]. In this paper, we focus on finite discrete combinatorial problems. In a finite domain CSP, variables take values from their domain which is a finite set of values. Each has two components: a scope () which is an ordered subset of variables of ; and a relation over the scope (). Given is the set of satisfying tuples (combinations of values) for the variables in . A constraint is satisfied if there is valuation for the variables from at least one of the tuples in its relation which takes values from the variables’ domain. A solution to is a valuation for such that every constraint is satisfied. is satisfiable iff a solution exists. If no solution exists then is unsatisfiable.
Type  Version  Construct  Operator  Grouped  Benchmarks (#lines, #variables)  

SatAim100  SatAim200  Dubois  
Extensional 
1  if  logical  no  (853, 100)  (2080, 200)  (933, 130) 
2  if  logical  yes  (352, 100)  (657, 200)  (432, 130)  
3  if  logical  all  (316, 100)  (616, 200)  (412, 130)  
4  if  bitwise  no  (1040, 100)  (2080, 200)  (933, 130)  
5  if  bitwise  yes  (352, 100)  (657, 200)  (432, 130)  
6  if  bitwise  all  (316, 100)  (616, 200)  (412, 130)  
7  assume  logical  no  (782, 100)  (1552, 200)  (759, 130)  
8  assume  logical  yes  (531, 100)  (1033, 200)  (676, 130)  
9  assume  logical  all  (516, 100)  (1016, 200)  (672, 130)  
10  assume  bitwise  no  (782, 100)  (1552, 200)  (759, 130)  
11  assume  bitwise  yes  (531, 100)  (1033, 200)  (676, 130)  
12  assume  bitwise  all  (782, 100)  (1552, 200)  (759, 130)  
Intensional 
AllInterval  CostasArray  HayStacks  
1  if  NOP  no  (549, 74)  (718, 98)  (3302, 173)  
2  if  logical  yes  (259, 74)  (361, 98)  (555, 173)  
3  if  logical  all  (239, 74)  (310, 98)  (535, 173)  
4  if  bitwise  yes  (252, 74)  (354, 98)  (548, 173)  
5  if  bitwise  all  (238, 74)  (309, 98)  (534, 173)  
6  assume  NOP  no  (2727, 74)  (1189, 98)  (2231, 173)  
7  assume  logical  yes  (242, 74)  (323, 98)  (538, 173)  
8  assume  logical  all  (238, 74)  (309, 98)  (534, 173)  
9  assume  bitwise  yes  (242, 74)  (323, 98)  (538, 173)  
10  assume  bitwise  all  (238, 74)  (309, 98)  (534, 173) 
A constraint can be defined in two ways: (i) intensional; or (ii) extensional (also known as table constraint). An intensional constraint is one where the relation of the constraint is defined implicitly. This is a common form of constraints supported in constraint solvers, e.g., a linear arithmetic inequality is implicitly understood in the usual arithmetic fashion over either real numbers or integers. Constraints can also be defined extensionally—the relation is defined as a (positive) table giving the satisfying tuples of constraint (respectively, also a “negative table” defining the set of “negative tuples”, namely, the tuples not satisfying the constraint). Consider the constraint where the domain of the variables is binary. The equation above is already in the intensional form of the constraint. The positive extensional form is given by the tuples (respectively, the negative extensional form is ) for the variable sequence .
As our benchmarks used the XCSP3 format, we briefly describe XCSP3. Figure 1 illustrates three kinds of constraints in XCSP3 format [6, 44]. XCSP3 is an XML format for defining CSP instances which can be used with many constraint solvers. It is also used in the XCSP3 competition for constraint solvers [43]. We use XCSP3 problem instances in our experiments. The details of the XCSP3 format are beyond of the scope of the paper, so examples in XCSP3 are meant to be illustrative only.
We consider CSPs where the constraints are in:
(i) intensional form, and
(ii) extensional form
(see Figure 1 (a) which defines a negative table
with two constraints
on and using
the same table definition).
Some examples of intensional constraints are:
eq(%0,dist(%1,%2))
representing for a given substitution of , , and
(see Figure 1 (b)); and the alldifferent
constraint (see Figure 1 (c)), which constrains the variables
to all take different values.
We remark that alldifferent is considered a global constraint but
in this paper, what matters is the intensional versus extensional distinction
of the constraint.
Iv Transforming Combinatorial Problems to C
The idea behind turning a CSP into a C program is as follows. Solving a combinatorial problem can be imagined as two components: firstly an oracle guesser for a solution; and secondly a checker for the solution. We can think of symbolic execution as finding a solution which the oracle could have returned and we transform the CSP into a C program that functions as the checker.
Our approach is to transform a combinatorial problem, a CSP ,
into a C program to be tested on program analysis tools as follows:
(i) the finite domain variables of the CSP correspond to integer variables
in the program (C variables are treated as symbolic);
(ii) CSP variable domains are converted to assume statements (see below);
and
(iii) the constraint relations are encoded into conditional statements in
the program or as assume statements.
The encoding of the CSP to C ensures that when the CSP
is satisfiable (symbolic) execution of is able
to reach a distinguished program point—the values of
the C variables in are a solution to CSP .
To test the analysis tools, the distinguished program
point is mapped to an assertion failure, i.e. assert(0)
.
Similarly, when CSP is unsatisfiable, the assert cannot be reached from
any path, i.e., all paths will not succeed.
From a single CSP , we propose several transformations of into C programs.The purpose of various transformations which result in different programs is to exercise the tools in different ways (as we show in the results). We employ two general approaches for encoding the constraints in :

if approach: if statements are created, whose condition captures the values of variables satisfying the constraint. The execution can terminate with an exit(0) statement in their else branches. If it is possible that execution takes the then branches, then there will be satisfying values for the variables of the encoded constraint. In symbolic execution, the condition is simply added into the current set of constraints, i.e., the path condition.

assume approach: the constraint solver of the analysis tool is used directly—the constraints of the CSP are translated into an argument to assume statements (klee_assume of KLEE or __llbmc_assume of LLBMC). When an assume statement is executed in symbolic execution, the constraint is simply added into the path condition and not intended to be executed as C, whereas the if is C code.
When all constraints have been translated, we place an assert(0) at the end of the program (the distinguished program point). The assert(0) triggers an assertion failure, so if the CSP is satisfiable (no contradiction is found), the program terminates with an assertion failure. The generated test case from symbolic execution, e.g., KLEE or TracerX, is a solution for . If the CSP is unsatisfiable, assertion failure will not be possible and will be reported as a failure.
Table I (to the left of the double vertical line)
shows how different features are combined
to obtain different transformed (versions) C programs for a CSP .
Overall we have designed 12 extensional transformations for extensional CSPs
and 10 intensional transformations for intensional CSPs.^{1}^{1}1
Intensional versions do not need operators in every case (shown as NOP in the table).
For example, both Extensional Versions 1 and 4, correspond to Intensional Version 1.
This explains the difference in the number of transformations for extensional and intensional problems.
Each transformation version (different row in Table I) employs different C constructs on the same underlying CSP .
The transformations are designed to be correct by construction.
There is insufficient space to formalize the transformations, rather,
we do it by example.
Figure 2 (a) shows how
the negative tables in Figure 1 (a)
are transformed into C under the Version 1 transformation. The green coloured code in Figure 2(a) shows details of the transformation common to all versions of a particular CSP: C variable declaration, KLEE symbolic variable construction, constraining variable domain, and distinguished assertion program point.
CSP variables are symbolic in KLEE and domains are encoded
with klee_assume
(similarly for LLBMC).
Reaching the assert
means the CSP is satisfiable.
Similarly, Figures 2 (b),
and (c) are the results of Extensional Versions 5 and 8 on the same constraint.
Figure 2 (d) and (e) are from Intensional
Versions 1 and 2 on Figure 1 (b), and Figure
2 (f) is from Intensional Version 9
on Figure 1 (c). Figure
2 (g) is from Intensional Version 3
combining constraints in Figures 1 (b) and (c)
as a single problem.
Table I shows how the transformations in each column (Construct, Operator, Grouped) are combined to give a particular version. The Construct column gives the construct used in the C program, if/assume approaches. Figures 2 (a) and (b) are the transformation results of the if approach on the constraint example in Figure 1 (a), while Figure 2 (c) shows the result of the assume approach, which uses an assume statement (example is for KLEE using klee_assume, whereas when testing on LLBMC, __llbmc_assume is used).
The Operator column gives the logic operators
used in the conditions, C logical operators (&&
or 
) or C bitwise operators, which are used to combine the
expressions making up the conditions.
The difference between the
logical and bitwise operators is that the logical
operators are from Cstyle shortcircuiting of conditions, i.e.
the condition is broken up into a cascading conditional branching
structure with only a atomic condition guarding every
branch. Atomic conditions do not contain logical or bitwise operators.
Figure 2 (b) and (f) shows the usage
of bitwise operators in the transformation, while
Figure 2 (a), (c), (e) and (g) show the usage of logical operators.
For Intensional Transformations, when there is no grouping of conditions,
no operator is needed (denoted by NOP in Version 1 and 6) in Table I and illustrated in Figure 2 (d).
The CSP may have several constraints defined in the same way but on
different variables, i.e. in Figure 1 (a),
there is a single constraint group with two individual constraints.
The Grouped column shows how translation is grouped by constraints: the translation is per constraint group (yes); per individual constraints defined in the group (no); or the entire CSP is grouped together as a single condition (all). For example, Figure 2 (a) and (d) show the no case, Figure 2 (g) shows the all case given the constraints in Figures 1 (b) and (c), and the rest of Figure 2 show the yes case. Figure 2 (g) simply combines the assert(0) statement in the then body of the if statement. We remark that as the benchmarks are for testing path explosion and constraint solving, the transformations produce straightline code. However, it is straightforward to make more compact forms with loops as well, but that only makes symbolic execution more complex.
V Experiments
Va Experimental Setup
We present experiments evaluating selected symbolic execution tools on the C programs created from our transformations of extensional and intensional CSP problems from the XCSP3 benchmarks [44]. We tested three classes of extensional constraint problems: AIM100 and AIM200 (each consists of 16 satisfiable and 8 unsatisfiable problems), and Dubois (30 unsatisfiable problems). For intensional CSP problems, we used the AllInterval (19 satisfiable problems), CostasArray (8 satisfiable problems), and Haystacks (14 problems) problem classes. The intensional constraints used in these CSP instances are shown in Table II. Statistics on the C benchmarks are given on the right of () in Figure 2 giving the average number of lines of code in the C programs and the average number of variables of the CSP instance for the problem classes tested. These problem instances from the XCSP3 benchmarks were chosen primarily because the transformation results are comparatively smaller than other instances (e.g., most other extensional XCSP3 benchmarks gave much larger C programs which could not finish running in a day). All C programs in our benchmark suite are compiled into common bytecode (LLVM bitcode) instances using LLVM’s clang at O3. Our benchmarks are publicly available [14].
Experiments were run on a 3.40 GHz Intel i76700 32GB RAM machine running Ubuntu 16.04. We focus on C code and only evaluate tools which are available for C. To keep things on a common footing, we experimented with the following LLVMbased tools analyzing the common compiled C instances: KLEE [7] with STP and Z3 as the backend SMT solvers, TracerX [42] which uses only Z3 as the backend solver, and LLBMC, which uses STP as the backend solver. KLEE is used as it is perhaps the most wellknown symbolic execution engine and is commonly used, at least 130 papers use KLEE [27]. It was 2nd placed, highest systematic tool, in the 2019 TestComp competition. TracerX builds upon Tracer [22], extending KLEE with abstraction learning [23] and interpolants [12] which can prune paths to mitigate the path explosion problem. LLBMC [18] is not a symbolic execution engine, it is a bounded model checker, but it also addresses the same core issues (constraint encoding, constraint solving, solver search space) and uses the same underlying solvers (STP). The results show LLBMC to be a relevant and useful comparison. All tools used a timeout of 1000 seconds.^{2}^{2}2 LLBMC exits upon finding a solution, for a fairer comparison, we use the option exitonerror so that KLEE and TracerX also exit upon reaching assert(0) statement (cf. Section IV) at the end of the program.
We also ran the AbsCon CSP solver [31, 1] with default options on the original CSP instances as a baseline comparison for the XCSP3 instance. AbsCon is a Javabased constraint solver using the XCSP3 format. As such, no further input encoding or transformations are needed for XCSP3 instances, making it a convenient comparison. We use the AbsCon baseline to make it easy to present the timings—normalizing the execution time of the tools to the AbsCon timing.^{3}^{3}3 AbsCon is not meant for solver comparison; interestingly, some LLBMC timings are faster.
We evaluate the benchmarks on the analysis tools on: (i) transformation robustness; and (ii) tool scalability. Transformation robustness measures how varying the transformations used on a single problem instance changes the execution time. Ideally, if the tool is robust with respect to the transformations, the timing variations should be small, i.e., it is insensitive. We highlight that the different transformed C programs are on the same instance , hence, the programs can be “considered equivalent”. One might expect (or even demand) a certain level of robustness for equivalent code. However, the results show that the tools are not robust with respect to transformation.
Tool scalability is the change in tool analysis time as the size of the problem changes (the chosen scalability benchmarks have a size parameter). In the scalability experiments, varying the size naturally changes the problem instance but the problem and code structure remains similar. We remark that although the AIM problems vary the number of variables, we do not use them for scalability as they do not have a structural scaling parameter defined by the problem, i.e. the number of variables may not be a fair parameter.
Problem  Constraints 

AllInterval  alldifferent, 
CostasArray  alldifferent, , 
Haystacks  , , , 
VB Evaluation
Figure 3 investigates transformation robustness for the extensional problems. We use average timings for analyzing the C code normalized to AbsCon solving time of the XCSP instance. Different transformations result in a different number of paths explored for the KLEEbased tools. We see that the KLEEbased tools are sensitive to the transformations and exhibit considerable variation, up to 2 orders of magnitude slower for certain transformations. They also exhibit a similar pattern in the transformation running times between AIM100 and AIM200 instances. Due to lack of space, results for Dubois are not shown, but it has a similar pattern to AIM instances. The timings show that there can be a complex relationship between the solver and performance of transformation. Sometimes KLEE/Z3 and TracerX/Z3 are slower than KLEE/STP but in others, it is faster (AIM200 Version 4), showing the importance of the solver effect. (We remark that KLEE and STP are codesigned while Z3 is independent). Greater differences can be seen in the intensional benchmarks. We can see that generally, the greater the grouping (Table I Grouped column), the better the performance, in order of no to yes to all. There is a large difference between logical versus bitwise operators, e.g., Version 1 versus Version 4.
We see that LLBMC is significantly more robust under all transformations. This may be because LLBMC creates a single (SMT) constraint problem. The LLBMC runtime relative to Abscon in Figure 3 for AIM100 is between 0.150.22 and for AIM200 between 0.170.26, i.e., LLBMC is faster.
The results for the transformation robustness of intensional problems is shown in Figure 5
. While LLBMC is still often better than KLEE, there is a larger variance. For example, AllInterval Version 9, KLEE(STP) is better than LLBMC but TracerX, which is KLEEbased, is similar to LLBMC. In the extensional code, the effort in TracerX to compute interpolants turned out to be wasted, but the value in the reduction of paths can be seen here where TracerX can outperform KLEE(STP/Z3). Here, STP is generally faster than Z3 compared to the extensional problems which only use equals or notequals constraints. Haystacks shows an interesting solver issue, the only transformation difference between Version 3 and 8 is the
if versus assume—one might think that the assume (more direct solver usage) should handle the constraints better, but it turns out to be worse. We conjecture this may be due to differences in the handling of the shortcircuiting from the logical operators. TracerX behaves better for these instances.Figures 3 and 5 use running time normalized to AbsCon running time. This shows that the KLEEbased symbolic execution tools are significantly slower than running the original CSP problem with the AbsCon solver. Note that this is not intended to be a fair solver comparison, all the systems use some solvers, but the solvers are used in different ways and the solvers themselves employ different techniques. LLBMC is often faster than AbsCon for the AIM problems. One possible reason is that the AIM instances are SAT problems and the AIM instances are known to be easy for modern SAT solvers. LLBMC uses STP, which in turn, uses the MiniSat SAT solver [32] can be expected to give good results. The AbsCon solver uses generalized arc consistency for the extensional constraints and does not employ learning. The performance of SAT solvers, on the other hand, is highly dependent on the use of clause learning. The intensional benchmarks show that STP solver performance is different from the extensional benchmarks, now STP is generally faster than Z3 but AllInterval Version 2 shows this is not always the case.
We investigate tool scalability with extensional (Dubois) and intensional problems where there is a size parameter. The results are shown in Figures 4 and 6. In these graphs, a greater problem number (xaxis) indicates a larger size parameter and the yaxis gives the number of timeouts (more timeout is worse). As can be observed, LLBMC is more scalable than the others, with KLEE/STP slightly more scalable than the two KLEEbased tools with the Z3 solver. We can see for the extensional problems that although TracerX was generally slower in the AIM benchmarks, it can solve more problems compared with KLEE(STP/Z3).
In summary, we see that LLBMC generally executes the benchmarks faster than other tools. It is known in the literature that the invocation of the constraint solver is expensive (e.g., [48]). During path enumeration, KLEEbased tools invoke the constraint solver multiple times to determine the feasibility of controlflow branches. Compared to the KLEEbased tools, LLBMC invokes the solver only once with a single global (much larger) problem. The versions run on KLEEbased systems which correspond most closely to LLBMC solver usage are: Extensional Versions 6 and 12 and Intensional Versions 5 and 10. As can be seen from Figures 3 and 5, these versions show much lower execution times for the KLEEbased tools compared with the other transformations.
Our experiments may give a different impression from conventional use of KLEE on realworld C programs, e.g., KLEE may be expected to work on small programs but may not scale to large ones; however, the Dubois C programs are small (hundreds of lines) and still timeout. This may be because solving and path explosion is more pronounced in our “benchmark code” versus “typical code”. However, the choice of artificial code is deliberate—to allow us to study the tradeoffs and interactions between the solver and path exploration. Realworld programs may pose less of a challenge to this aspect, but can still pose challenges in other aspects, such as system calls. The results show that the execution times of the KLEEbased tools on the benchmarks depend heavily on the solver used (STP or Z3) and are significantly affected by the C constructs (from the transformations). Although both LLBMC and KLEE/STP use the same solver, the timings differ substantially, which suggests that KLEEbased symbolic execution has considerable room for optimizations. Our benchmarks are designed so that the reasoning on the program execution paths are nontrivial. TracerX uses interpolants which can prune (irrelevant) execution paths, however, in our benchmarks, interpolants can have some benefit but the cost of interpolants is also a drawback. This suggests that the use of interpolants may not help so much in the harder path exploration cases arising in our benchmarks.
VC Discussion
We summarize how the experimental results can help address the performance of symbolic execution with the following research questions:

Q1:What kinds of programs are easier/harder to analyze with symbolic execution? The experiments show that KLEE/STP, KLEE/Z3, and TracerX/Z3 are not robust under different transformations, thus, programs coded in a particular way will be harder to analyze.

Q2:Do different constraints change the difficulty of the symbolic execution? The benchmarks use various constraints. The extensional benchmarks use simple equality constraints of the form or , while the intensional benchmarks have more complex constraints, e.g. . There is a larger variation in the overall robustness with the intensional benchmarks.

Q3:Are there differences in overall solver performance between different solvers? The experiments show solver choice is significant, e.g., between STP and Z3; however, this relationship is complex and may not be simple to predict. From anecdotal experience, KLEE/STP is expected to outperform KLEE/Z3; however, for certain kinds of code, KLEE/Z3 can be much faster.
Overall, the results suggest that considerable improvements to symbolic execution tools may be possible, such as better solver integration and improved solver reasoning. While TracerX was in many cases slower than KLEE, possibly due to the interpolation overhead, there were cases where it was faster, which suggests that smarter removal of execution paths is beneficial but much research remains to be done as shown by the overall results. As our transformations can be considered a kind of preprocessing step, a smart preprocessing step may be another direction for improving symbolic execution. This may be analogous to some constraint solvers which perform extensive preprocessing, for example, the preprocessing phase in the CPLEX MIP solver is sufficient to solve some problems without going to the actual solver [2]. As far as we are aware, such more advanced constraint processing ideas have not been employed in symbolic execution.
We highlight that the objective of this paper is to build benchmarks suitable to improve the core reasoning of symbolic execution. So while our benchmarks generally have bounded model checking (LLBMC) being faster, this is partly an artifact because we have simple systematic code which is loopfree. The comparison with LLBMC is also because many of the core reasoning components are similar to KLEE and TracerX, in particular, the same underlying solver. We found instances where KLEE/STP outperforms LLBMC. For general and more complex code, symbolic execution has advantages over bounded model checking. The main limitation of bounded model checking being the bound which also affects scalability if the bound is large. Symbolic execution also has advantages when it comes to system calls and mixing with concrete execution. However, for our benchmarks, we want to deliberately focus on the solver, path exploration and search, so these other factors are orthogonal to our work.
While we have deliberately chosen simple transformations, more complex ones are possible. Rather than loopfree code, extensional constraints can be transformed into a generic table driven code with loops. This may not work so well on LLBMC if the specified bound is exceeded but would make little difference to KLEE. We have chosen a simple code structure to focus on the solver, path exploration and constraint handling mechanisms intrinsic to symbolic execution.
Vi Conclusion
We proposed finitedomain constraint satisfaction problems (CSPs) for evaluating the effectiveness of the reasoning techniques in symbolic execution tools. We show how to encode CSP instances by transforming the problem into C programs using different C constructs. Although multiple C programs are created, they are essentially equivalent from the satisfiability and solution of the CSP perspective. Preliminary testing with the KLEE, TracerX, and LLBMC tools with STP and Z3 solvers show that the transformation used significantly affects the performance of the tools even though the underlying reasoning problem is the same. The degree of sensitivity of symbolic execution tools to “writing style” of the program has not been studied, and we feel is an important direction for improved and more robust symbolic execution technology.
The use of logical operators common in C has a significant effect compared to the use of bitwise operators. It suggests that a viable strategy may be to flip the semantics of the code between bitwise and logical operators in cases where there are no side effects. This could give a performance boost—we are not aware that this strategy has been tried before.
Ideally, as the programs are “equivalent to the CSP instance”, we may expect performance from the tools to be not dissimilar since the underlying reasoning techniques are also similar. However, our benchmarks also show complex and substantial differences. There is also a large performance gap between LLBMC and KLEE/STP even when both use the same underlying STP solver, showing another direction for system improvement. Although our benchmarks are synthetic, precisely because they focus on the basic core reasoning engine, we believe that benchmarks of these form can help drive basic improvements to symbolic execution tools. What we would like is that the analysis tools are more robust and more scalable. We believe that if symbolic executionbased tools can perform better on the kinds of benchmarks we proposed, this will carry forward to improved overall analysis performance. In particular, for challenging code such as cryptographic code, obfuscated code, and malware where program analysis tools may not work well.
The results using TracerX suggest that optimizations which prune paths can increase scalability but it also suggests that much more improvements are needed. Future work may be to investigate redundant constraints as yet another dimension. Finally, we hope this can spur more synergy between techniques in constraint programming, testing and program analysis.
Acknowledgement
We acknowledge the support of R252000592112 and R252000A39112. We would like to thank Andrew Santosa who helped with a preliminary version of this work.
References
 [1] AbsCon. Note: http://www.xcsp.org/tools Cited by: §VA.

[2]
(2013)
Mixed integer programming: analyzing 12 years of progress.
In
Facets of Combinatorial Optimization
, Cited by: §VC.  [3] American Fuzzy Lop. Note: http://lcamtuf.coredump.cx/afl Cited by: §I.
 [4] (2014) Enhancing symbolic execution with veritesting. In International Conference on Software Engineering, Cited by: §I.
 [5] (2018) A survey of symbolic execution techniques. ACM Comput. Surv. 51 (3). Cited by: §II.
 [6] (2016) XCSP3: an integrated format for benchmarking combinatorial constrained problems. CoRR abs/1611.03398. Cited by: §III.
 [7] (2008) KLEE: unassisted and automatic generation of highcoverage tests for complex systems programs. In USENIX Conference on Operating Systems Design and Implementation, Cited by: §I, §I, §II, §VA.
 [8] (201302) Symbolic execution for software testing: three decades later. Commun. ACM 56 (2). External Links: Link, Document Cited by: §I, §I, §II.
 [9] (2010) Reverse engineering of binary device drivers with revnic. In European Conference on Computer Systems, Cited by: §I.
 [10] (2011) S2E: a platform for invivo multipath analysis of software systems. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, Cited by: §II.
 [11] (2011) Efficient symbolic execution for analysing cryptographic protocol implementations. In International Conference on Engineering Secure Software and Systems, Cited by: §II.
 [12] (1957) Three uses of the HerbrandGentzen theorem in relating model theory and proof theory. The Journal of Symbolic Logic. Cited by: §VA.
 [13] (2015) Evaluating symbolic executionbased test tools. In IEEE International Conference on Software Testing, Verification and Validation, Cited by: §II.
 [14] CSPC. Note: https://github.com/vsahil/XCSP3_to_C Cited by: §VA.
 [15] (2008) Z3: an efficient smt solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Cited by: §I, §II.
 [16] (2003) Constraint processing. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: ISBN 9780080502953 Cited by: §III.

[17]
(2018)
Shape neutral analysis of graphbased datastructures.
Theory and Practice of Logic Programming
18 (34). Cited by: §I.  [18] (2013) The bounded model checker LLBMC. In IEEE/ACM International Conference on Automated Software Engineering, Cited by: §I, §VA.

[19]
(2007)
A decision procedure for bitvectors and arrays
. In International Conference on Computer Aided Verification, Cited by: §I, §II.  [20] (2005) DART: directed automated random testing. In ACM SIGPLAN Conference on Programming Language Design and Implementation, Cited by: §I, §II.
 [21] (2012) SAGE: whitebox fuzzing for security testing. Queue 10 (1). Cited by: §II.
 [22] (2012) TRACER: a symbolic execution tool for verification. In International Conference on Computer Aided Verification, Cited by: §VA.
 [23] (1999) An interpolation method for clp traversal. In International Conference on Principles and Practice of Constraint Programming, Cited by: §VA.
 [24] (2009) An interpolation method for CLP traversal. In International Conference on Principles and Practice of Constraint Programming, Cited by: §II.
 [25] (2015) Challenges and opportunities with concolic testing. In National Aerospace and Electronics Conference, Cited by: §II.
 [26] (1976) Symbolic execution and program testing. Commun. ACM. Cited by: §I, §II.
 [27] KLEE papers. Note: https://klee.github.io/publications Cited by: §II, §VA.
 [28] KLEE system. Note: https://klee.github.io Cited by: §II.
 [29] (2014) A comparative study of incremental constraint solving approaches in symbolic execution. In Haifa Verification Conference, Cited by: §I.
 [30] LLVM compiler infrastructure project. Note: https://llvm.org/ Cited by: §I.
 [31] (2001) AbsCon: a prototype to solve csps with abstraction. In International Conference on Principles and Practice of Constraint Programming, Cited by: §VA.
 [32] MiniSat. Note: http://minisat.se Cited by: §VB.
 [33] (2011) Satisfability modulo theories: introduction and applications. Commun. ACM. Cited by: §II.
 [34] (2013) SemFix: program repair via semantic analysis. In International Conference on Software Engineering, Cited by: §I.
 [35] (2013) Multisolver support in symbolic execution. In International Conference on ComputerAided Verifcation, Cited by: §I.
 [36] (2011) A case study of concolic testing tools and their limitations. In 2011 International Symposium on Empirical Software Engineering and Measurement, Cited by: §II.

[37]
(2006)
Handbook of constraint programming (foundations of artificial intelligence)
. Elsevier Science Inc.. Cited by: §III.  [38] (2005) CUTE: a concolic unit testing engine for c. In European Software Engineering Conference, Cited by: §I, §II.
 [39] (2016) SOK: (state of) the art of war: offensive techniques in binary analysis. In IEEE Symposium on Security and Privacy, Cited by: §I, §I.
 [40] (2016) Driller: augmenting fuzzing through selective symbolic execution. In Annual Network and Distributed System Security Symposium, Cited by: §I.
 [41] (2018) Chopped symbolic execution. In International Conference on Software Engineering, Cited by: §I, §II.
 [42] TracerX: nextgeneration symbolic execution. External Links: Link Cited by: §I, §II, §VA.
 [43] XCSP3 competition. Note: http://xcsp.org/competition Cited by: §III.
 [44] XCSP3. Note: http://www.xcsp.org/ Cited by: §III, §VA.
 [45] (201705) Cryptographic function detection in obfuscated binaries via bitprecise symbolic loop mapping. In IEEE Symposium on Security and Privacy, Cited by: §II.
 [46] (2018) Benchmarking the capability of symbolic execution tools with logic bombs. IEEE Transactions on Dependable and Secure Computing. Cited by: §II.
 [47] (2015) Symbolic execution of obfuscated code. In ACM Conference on Computer and Communications Security, Cited by: §I.
 [48] (2012) Speculative symbolic execution. In IEEE International Symposium on Software Reliability Engineering, Cited by: §VB.
Comments
There are no comments yet.