Evaluating Manual Intervention to Address the Challenges of Bug Finding with KLEE

by   John Galea, et al.
University of Oxford

Symbolic execution has shown its ability to find security-relevant flaws in software, but faces significant scalability challenges. There is a commonly held belief that manual intervention by an expert can help alleviate these limiting factors. However, there has been little formal investigation of this idea. In this paper, we present our experiences applying the KLEE symbolic execution engine to a new bug corpus, and of using manual intervention to alleviate the issues encountered. Our contributions are (1) Hemiptera, a novel corpus of over 130 bugs in real world software, (2) a comprehensive evaluation of the KLEE symbolic execution engine on Hemiptera with a categorisation of frequently occurring software patterns that are problematic for symbolic execution, and (3) an evaluation of manual mitigations aimed at addressing the underlying issues of symbolic execution. Our experience shows that manual intervention can increase both code coverage and bug detection in many situations. It is not a silver bullet however, and we discuss its limitations and the challenges encountered.



page 1

page 2

page 3

page 4


On Benchmarking the Capability of Symbolic Execution Tools with Logic Bombs

Symbolic execution is an important software testing approach. It has bee...

Logic Bug Detection and Localization Using Symbolic Quick Error Detection

We present Symbolic Quick Error Detection (Symbolic QED), a structured a...

MUZZ: Thread-aware Grey-box Fuzzing for Effective Bug Hunting in Multithreaded Programs

Grey-box fuzz testing has revealed thousands of vulnerabilities in real-...

Interoperability-Guided Testing of QUIC Implementations using Symbolic Execution

The main reason for the standardization of network protocols, like QUIC,...

Symbolic Security Predicates: Hunt Program Weaknesses

Dynamic symbolic execution (DSE) is a powerful method for path explorati...

Binsec/Rel: Efficient Relational Symbolic Execution for Constant-Time at Binary-Level

The constant-time programming discipline (CT) is an efficient countermea...

Twin-Finder: Integrated Reasoning Engine for Pointer-related Code Clone Detection

Detecting code clones is crucial in various software engineering tasks. ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Symbolic execution is a flexible approach to program analysis which can be embedded in various approaches to program analysis. Applications include test case generation [14], static bug finding [5], patch validation [22], deobfuscation [24] and exploit generation [7]. However, when applying symbolic execution on a new application it is common to encounter significant problems. In this paper we document our experiences using KLEE to perform bug detection across a diverse set of programs, and of using manual intervention to alleviate the issues encountered.

The term symbolic execution refers to analysis systems with a broad range of capabilities and approaches. In this paper, we focus on a path-wise forwards analysis that uses symbolic variables for the inputs from the environment. We exclude from consideration systems that utilise concrete inputs in order to produce a trace that is then executed symbolically (e.g., concolic execution [14]). We also exclude systems that utilise state merging in order to produce states that represent multiple paths  [2, 18]. The rationale for this focus is that the majority of maintained open source implementations111https://klee.github.io222https://angr.io333https://github.com/trailofbits/manticore do not implement either of these features. While state merging and concolic execution may resolve some issues we discuss, at the present time, these features are unavailable to a user who wants to utilise symbolic execution via the existing tools.

We furthermore focus on the KLEE symbolic execution engine [5] in our experimental work. KLEE is a mature, maintained symbolic execution engine and has served as the basis for a large number of academic publications444https://klee.github.io/publications/ from a diverse set of research groups.

Generally, existing approaches that attempt to address the scalability issues faced by symbolic execution are aligned in two complementary directions. Firstly, there are automated solutions, for example state merging [18, 2] and compositional testing [20]. Secondly, there are manual solutions, such as the introduction of logical assumptions [29, 30], where a domain expert assists symbolic execution in order to improve effectiveness.

While path explosion can be demonstrated with a very simple program, namely, a loop that contains a branch, detailed case-studies of problematic patterns that occur in real-world software are lacking in the literature. Similarly, while many members of the research community expect that manual intervention can deal with problematic patterns in the software under test, there are few experience reports on how feasible this approach is across diverse applications and how successful, or not, it can be. We aim to address both of these issues.

To generate a baseline dataset we use KLEE [5] to analyse a set of real-world applications with the goal of discovering security-relevant bugs. We document the limiting factors we observe and categorise the root causes of these issues. To mitigate the challenges encountered, we employ existing techniques that require manual intervention, which often involve trading soundness or completeness for scalability. Exemplars are the manual insertion of logical constraints, early state termination and the use of driver programs. Finally, we evaluate the changes in code coverage and bug detection that result from these interventions and discuss their successes and limitations.

We view our research as orthogonal to work on automated solutions to the categories of problems that we describe. An automated solution that effectively manages one limitation still often leaves the symbolic execution engine struggling due to other issues left unaddressed. This hampers adoption of symbolic methods. Thus, it is necessary to also understand the outstanding issues that are significant when analysing real programs, and how these issues can be alleviated via manual intervention.

To support our research, we have compiled Hemiptera, a substantial bug suite containing 133 genuine bugs found in open-source applications. The bugs are individually labelled with commit IDs identifying a software version they are present in, and test cases that trigger the bugs are provided. A substantial amount of previous work uses synthetic datasets [28], a very limited number of real-world programs [17, 31] or real-world programs where a lower bound for the number of bugs that should be detectable is not known [2, 5]. Hemiptera is shared openly with the community to support further research.

I-a Contributions

This paper provides three main contributions.

  1. Hemiptera, a bug suite containing over 130 genuine bugs in open-source programs, with test cases, and labelled with their associated commit IDs.

  2. A categorisation of code constructs that pose severe challenges to symbolic execution via a large-scale investigation of more than 24 program versions, across eight programs, using KLEE.

  3. An evaluation of the feasibility and effectiveness of manual intervention in addressing state space explosion within symbolic execution.

We provide all raw data, scripts and source within the artefacts555Via GitHub, this link has been removed to anonymise the paper for review.

I-B Organisation of Paper

The rest of the paper is organised as follows: Section 2 delves into Hemiptera, Section 3 presents an evaluation of KLEE on Hemiptera, Section 4 presents an in-depth exploration of the challenges encountered by symbolic execution in the evaluation from Section 3, and the mitigations we employ to overcome these challenges, Section 5 presents an evaluation of the success of the mitigations described in Section 4, Section 6 lists potential threats to validity, Section 7 discusses related work, and Section 8 concludes the paper.

Ii The Hemiptera Bug Suite

Hemiptera666Our bug suite is named after the order of bugs Hemiptera, meaning “True Bugs”. is a large corpus of patched bugs that have been present in eight major real-world applications. In total, the corpus contains 133 bugs, many of which are security relevant and could be leveraged to gain malicious code execution. The types of bugs present in Hemiptera include invalid memory accesses, failed assertions, and division by zero exceptions. The applications considered are all open-source and written in C.

While some existing bug suites [19] do consider real-world applications, they are often limited due to small bug counts. Meanwhile, others [3, 16, 15] provide a large number of bugs but are synthetic. For instance, Dolan-Gavitt et al. [12] have proposed LAVA, a tool capable of automatically injecting artificial bugs into software. Although such an approach facilitates the creation of a large bug corpus, it still threatens the validity of the claims that might be made about real-world code when used as the sole evaluation corpus for an analysis. The DARPA CGC corpus also faces similar issues777http://archive.darpa.mil/cybergrandchallenge/tech.html, since the programs are synthetic.

# Invalid
# Failed
# Div
by Zeros
# Dang
Ptr. Use
Size of
Avg. lines
of C Code
JasPer 11 6 2 0 2 25,891
LibTIFF 36 1 4 0 5 56,573
tcpdump 10 0 0 0 2 47,782
libjpeg-turbo 1 0 1 0 1 25,920
zlib 2 0 0 1 2 7,835
file 5 1 1 0 3 12,597
w3m 44 0 0 2 3 53,338
FLAC 5 0 0 0 1 46,114
Total 114 8 8 3 19 34,506
TABLE I: Summary of Hemiptera’s composition.

Table I summarises the composition of Hemiptera. Several of the bugs contained in Hemiptera are security relevant, and could have been leveraged to gain malicious code execution. Where applicable, Hemiptera provides the CVE ID that is associated with a given bug.

For each bug in the corpus, we recorded the commit ID of the patch as well as the commit ID of the immediate unfixed predecessor (the parent ID). The inclusion of every unfixed version, for each bug, would result in an impractically high number of application versions to analyse. Consequently, for each application, Hemiptera provides the minimal set of commit IDs (referred to as a Min-Set) such that all of the bugs in the corpus that pertain to an application are present. We calculated the average line counts of the versions in each Min-Set. Overall, the corpus considers reasonably large applications, ranging from 7,835 (zlib) to 56,573 (LibTIFF) lines of C code.

Selection Criteria.

We selected applications for inclusion in Hemiptera based on four major criteria:

  1. The applications are similar to those considered suitable for analysis by state of the art symbolic execution tools. Examples are file parsers, utility libraries and system tools.

  2. If an application contained bugs, such bugs would be likely to have security implications.

  3. The application’s project must be open-source.

  4. The project must have a publicly accessible version control system (e.g. Git), such that changes to files are tracked, and we can therefore identify the buggy and patched versions for each bug.

Construction Methodology.
Fig. 1: Methodology of building Hemiptera.

Figure 1 illustrates the methodology of Hemiptera’s construction. Once an application is selected, we run a set of scripts to facilitate the inspection of change logs. The scripts automatically retrieve the commit history of the master branch of the application, and select commits that have messages that contain relevant terms. These terms include “bug”, “cve”, “zero”, “overflow” and “crash”. For all selected commits, the scripts extract their date, hash ID and parent ID. The process continues with manual analysis, whereby the resulting commits are inspected to check whether they are likely to be patches for security relevant bugs. Bugs that seem to be security irrelevant, such as memory leaks and syntax errors, or are very platform dependent, are discarded.

The presence of these bugs is then confirmed by searching for an input which triggers it. Test cases that successfully trigger the bugs are retrieved by inspecting bug trackers, test folders, and online repositories, such as Exploit-DB888https://www.exploit-db.com/. Failure to find a test case that successfully triggers a bug results in the bug being omitted from the corpus.

The process then takes an incremental approach in order to derive the Min-Set of the application. It begins by first including the commit ID of earliest buggy version to the set. We then run the version against the collected test cases iteratively, in an order that is sorted by the date of fix of the bugs. In the event that a test case fails to reproduce its bug, the commit ID of the version that successfully triggers the bug is included in the Min-Set. In turn, the version associated with the newly inserted commit ID is considered for checking the remaining test cases. This process continues until all bugs are triggered successfully.

Although the process of building Hemiptera has been partially automated through scripting, it still incurred a significant amount of manual effort. We believe that dealing with this challenge is worthwhile to reduce threats to the validity of our results, as well as contribute a useful bug corpus to the community.

Iii Evaluation of the State of the Art

To generate a dataset to categorise challenges in real-world symbolic execution, we ran KLEE on 24 different program versions, across eight programs, as provided by Hemiptera. The configuration for KLEE as recommended by its authors999https://klee.github.io/docs/coreutils-experiments/ was used, combined with our own KLEE modifications as detailed in the next section.

Iii-a Experiment

We used KLEE version 1.3.0 with two search strategies, covnew and random-path101010https://klee.github.io/docs/options/. The covnew state selection algorithm selects the state it believes to be closest to uncovered code, while the random-path algorithm selects a state by randomly traversing the tree of all paths taken by all live states until a state representing the head of a path is reached.

We also tested a modification for KLEE, which was introduced to resolve an issue we encountered with state selection: When using the batching searcher, if a conditional branch based on symbolic data is encountered, where both the taken and fall-through cases are feasible, KLEE will always schedule the state corresponding to the branch taken case. On several targets, this often results in failure to analyse the remaining code in functions that have a branch towards a return early in their body. Consequently, we implemented a mechanism that randomly selects between the two states of a branch, post forking. We refer to this feature as rspf (random state, post-fork) in the remainder of this section.

As well as using covnew and random-path together, we also ran KLEE against each program with only random-path, and with a combination of covnew, random-path, and rspf. All target applications were compiled to LLVM bitcode using wllvm111111https://github.com/travitch/whole-program-llvm. Optional third-party libraries were not linked, unless they were required for a bug found in Hemiptera. As libtiff comes with a significant number of driver applications, we focused specifically on tiff2pdf, which triggers the highest proportion of its bugs. Each run was one hour long, and each experiment was executed three times to increase reliability. We ran our experiments on an Intel Xeon X5667 at 3.07 GHz with 24 GB of RAM available to each process.

Iii-B Results

Table II quantifies the effectiveness of KLEE. We present averages over all runs within the table. The performance appears to be generally poor, finding very few of the known bugs in Hemiptera. KLEE failed to find bugs in most program versions. In general, code coverage appears to be low in absolute terms, but without any external baseline to compare against, it is difficult to determine what proportion of “interesting” code this corresponds to. However, considering most bugs are not found, it is clear that a substantial amount of “interesting” code is not reached.

When using covnew combined with random-path, KLEE found bugs in six different program versions, while rspf improved bug detection for one program version. By contrast, random-path finds significantly fewer bugs (a total of six bugs compared to a total of 109 bugs for covnew + random-path). However, on some programs (e.g. imginfo, zlib and file), random-path achieves higher coverage, despite the fact that covnew has been designed specifically to maximise coverage. Furthermore, if we exclude tcpdump, then random-path has nearly equivalent code coverage and bug finding performance to the coverage-driven runs.

Program Commit121212An asterisk on the commit number represents that the head version of a project was considered.
covnew &
random-path &
Bugs in
Hemiptera131313It is unknown how many bugs exist in the latest version of a program.
Bugs H. Bugs ICov141414ICov refers to the instruction coverage, which is calculated over all library code, even if it was unreachable. Bugs H. Bugs ICov Bugs H. Bugs ICov
tcpdump 17a3c288* 32 - 4.94% 0 - 3.14% 28 - 4.23% 0
tcpdump a9e4211 40 1 4.91% 0 0 3.39% 28 2 4.20% 9
tcpdump 8322d3a 31 0 4.61% 0 0 3.14% 23 0 4.40% 1


w3m 1ac245b* 0 - 9.01% 0 - 8.77% 0 - 9.15% 0
w3m 02ba3d6 4 0 8.96% 3 0 8.75% 4 1 9.07% 9
w3m 06caca1 0 0 8.90% 0 0 8.74% 0 0 9.01% 15
w3m a56a8ef 0 0 8.96% 0 0 8.75% 0 0 9.06% 11


libjpeg-turbo a0b7de9* 0 - 23.10% 0 - 22.72% 0 - 20.61% 0
libjpeg-turbo 3091354 1 0 15.01% 1 0 14.98% 1 0 15.38% 2


imginfo 4212e7e* 0 - 10.36% 0 - 14.68% 0 - 9.93% 0
imginfo b702259 1 1 10.39% 2 1 12.41% 1 1 11.08% 18


jasper 4212e7e* 0 - 12.96% 0 - 13.26% 0 - 13.93% 0
jasper ed355a6 0 0 13.47% 0 0 13.75% 0 0 12.04% 2


file df74b09* 0 - 12.61% 0 - 12.75% 0 - 8.82% 0
file b6e8437 0 0 13.83% 0 0 13.90% 0 0 13.94% 6
file 4a51454 0 0 17.11% 0 0 17.73% 0 0 17.70% 2
file 7445748 0 0 15.63% 0 0 17.80% 0 0 17.04% 1


zlib cacf7f1d* 0 - 26.70% 0 - 27.38% 0 - 25.08% 0
zlib 7c2a874 0 0 26.56% 0 0 29.08% 0 0 24.45% 2
zlib 14763ac 0 0 31.44% 0 0 31.65% 0 0 32.12% 1


tiff2pdf d57ccfc9* 0 - 8.48% 0 - 8.37% 0 - 8.27% 0
tiff2pdf f64949a 0 0 8.42% 0 0 8.47% 0 0 8.47% 3


flac d2cb0d1* 0 - 12.48% 0 - 12.16% 0 - 13.77% 0
flac d8d1717 0 0 12.29% 0 0 12.11% 1 1 14.23% 5
TABLE II: Bug count and coverage results attained by KLEE on real-world programs. H. Bugs represents the detected bugs which are also presented in Hemiptera.

Several bugs found in tcpdump were unknown and present on the most recent version at the time of conducting our experiments. Therefore they were not included in Hemiptera, as the methodology of constructing our corpus requires that included bugs are patched.

One possible cause why covnew performs worse than random-path, in certain cases, is covnew’s algorithm for calculating proximity to uncovered code. This algorithm is static, path-insensitive and over-approximates the points-to sets for function pointers. We discovered that covnew frequently schedules states forked from the same location repeatedly, as it erroneously believes that a state at that location is close to uncovered code, and can reach it. KLEE then runs the scheduled state but, importantly, never succeeds in actually reaching the code it thought it was close to. The algorithm does not react to this failure, e.g., by penalising the starting location, and therefore covnew picks another state from exactly the same initial location, only to fail again with high likelihood.


Forking Percentage

Programs - Head Version
Fig. 2: The percentage of state forks produced by the program location responsible for most state forks, per application. A single program location can often be responsible for the consumption of an inordinate amount of the analysis engine’s resources.

One notable observation is KLEE’s vulnerability to code constructs that are similar to canonical examples of the path explosion problem. In Figure 2 we can see that a single program location is often responsible for the majority of states created across the applications in Hemiptera. Upon investigation, most of these locations are branches embedded in a straightforward loop over input.

It is clear that no search strategy is particularly effective on all programs. On many applications in Hemiptera, KLEE struggles to find bugs and attain high code coverage. Moreover, whatever problems KLEE is encountering are sufficiently severe to diminish the benefit of the coverage-driven search strategies in all but one application.

Iv Categorisation of Challenges and Manual Interventions

In this section we present a categorisation of the main issues KLEE encountered during the analysis of the applications in Hemiptera, using the dataset generated via the experimental runs outlined in Section III. The categories cover the most significant issues encountered and we hypothesise that they are likely to generalise to targets not only in Hemiptera, but within the realm of applications one might reasonably expect symbolic execution to be able to analyse.

We also offer mitigations for these issues, based on manual intervention, where applicable. Manual intervention is often proposed as a means by which one can alleviate the issues encountered by symbolic execution. However, there are few experience reports documenting its effectiveness on benchmarks like Hemiptera.

To investigate the challenges, we use the following information: (1) runtime analysis data as generated by KLEE, from which we can determine the program locations responsible for generating forks in the symbolic state and solver queries, (2) data provided by KLEE’s existing analysis tools, such as klee-stats and klee-replay, (3) logs of the scheduler’s decision process151515We extended KLEE’s logging in a number of different ways to allow us to understand what states were being scheduled and why, as well as what code they were covering., (4) data on the final location for states still live at the end of the experiment161616We also added logging of the states which were still alive at the end of each experiment, in order to determine what locations in the target states were blocked at., and (5) warnings from KLEE.

An analysis run is often hindered by a single dominant challenge. Therefore, we have followed an iterative approach when introducing our modifications. We run each target for an hour, analyse the information mentioned above, introduce a small number of changes, and then repeat the process. We stopped iteration when there was no longer a clear significant challenge that we can address using manual intervention.

Iv-a State-Space Explosion of Semantically Similar States

Across all targets, we encountered constructs which resulted in state space explosion. Often, the path constraints of the resulting states did not differ in a manner which would have a significant impact on future branches. We deem such states as semantically similar.

Iv-A1 State Forking in Avoidable Code

One class of functions which fit this category, and commonly produce semantically similar states, are those related to the formatting and printing of data for output. For example, uclibc’s printf171717klee-uclibc/libc/stdio/printf.c, and its called functions, contain branches which will produce new states if the arguments to printf are symbolic. However, the resulting states are semantically similar as they only pertain to the printing of data. When analysing tcpdump, the functionality related to the formatting and printing of data is responsible for 61% of the states forked, while on libtiff similar functionality was responsible for 10% of solver time.

Such state forking can be resolved by introducing a low fidelity model of the functions in question. A low fidelity model is a significantly less complex replacement function. In the case of functions which only produce semantically similar states, that model can effectively be an empty function body such as that which we used for printf (shown in Listing 1). These changes may of course impact soundness and completeness, and so it is important to consider them on an application by application basis.

1int printf(const char * format, ...) { return 0; }
Listing 1: Low fidelity mode for printf.

Avoidable code that is responsible for significant state forking also appears within the applications themselves. For example, tcpdump makes use of libpcap to load packets for processing. It can achieve this by making use of one of two different parsers, which effectively produce the same end result, but one of which contains significantly fewer branches on symbolic data than the other. Listing 2 shows the loop which iterates over a list of function pointers, and calls each function with the header from the input file. The function indicates that it can load the data by returning a non-null value. If a null value is returned the next function is consulted. Consequently, we can force the less complex parser to be utilised by inserting a single assumption, as shown in Listing 3 at line 1.

1amt_read = fread((char *)&magic, 1, sizeof(magic), fp);
2for (i = 0; i < N_FILE_TYPES; i++) {
3    p = (*check_headers[i])(...);
4    if (p != NULL)
5        goto found;
Listing 2: fread in pcap.
1klee_assume(magic == TCPDUMP_MAGIC);
2if (magic != TCPDUMP_MAGIC && ...) {
3    return (NULL);  /* nope */
Listing 3: check_headers in pcap.

Iv-A2 State Forking in Input-Discarding Code

Another common source of semantically similar states is code which skips over input bytes while searching for some delimiter or terminating character. Listing 4 gives an example from zlib, which is intended to read and discard a comment in the input. Owing to the conditional statements, the number of states quadruples on each loop iteration. In zlib, this construct is responsible for 42% of forked states, while similar constructs appear in several of the other targets; e.g. jasper (74%), file (63%) and libjpeg (91%).

1if ((flags & COMMENT) != 0) {
2    while ((c = get_byte(s)) != 0 && c != EOF);
Listing 4: A while loop with discarded data in zlib.

In each of these targets we resolved the issue via the insertion of assumptions. The mitigation for the example shown in Listing 4 can be seen in Listing 5. Assumptions can be seen as guidance through the code by compromising theoretical completeness (which is effectively impossible in practice) for analysis in the more interesting areas of the code, as deemed by a domain expert.

1if ((flags & COMMENT) != 0) {
2    c = get_byte(s);
3    klee_assume(c == 0 | c == EOF);
4    while (c != 0 && c != EOF) ;
Listing 5: Avoiding forking via an assumption in zlib.

Iv-B Resource Consumption by States on Error Paths

Often, once a path has reached a certain point, it will never return to interesting code that we wish to analyse. Usually, this arises when a state has reached an error path and the remainder of its processing exclusively involves standard error handling. In order to avoid analysing such uninteresting paths, we modify the code of the highest function in the error handling chain to immediately exit the program, which terminates the state.

Alternatively, in some situations, the error handling function may only be invoked after expensive processing and it may be preferable to add assumptions which ensure that no path can enter the code that reaches an error state in the first place. The downside of this approach is that each line of code that may reach an error state must be individually handled. Listing 6 gives an example of a blocked path after the introduction of a klee_assume statement.

2if (!tif->tif_dir.td_nstrips) {
3    TIFFErrorExt(...);
4    goto bad;
Listing 6: A call to TIFFErrorExt in LibTIFF.

Our mitigation poses risks as bugs may be present in the uninteresting paths we have manually sliced out. However, the alternative is fatal state space explosion. We reduced this risk by verifying that the sliced out paths are simple and unlikely to expose bugs.

Iv-C Expensive Initialisation of Static Data

Code related to static instantiation of concrete data can often be expensive due to the sheer number of instructions involved. For example, libjpeg contains a function to instantiate an array with static data. This function contains a number of nested loops, that when unrolled amount to 19,791,892 LLVM instructions. Essentially, the flattened loop encompasses 1.2x more instructions than that executed by KLEE in a one hour run for libjpeg. Furthermore, the function lies on a path which all states must pass through in order to continue deeper into the program. During our experimental runs, no paths managed to break through the function.

A less extreme case was also encountered during the analysis of tcpdump when it initialises a table for the calculation of CRC values.

There are two ways to resolve issues such as these. In cases like libjpeg, where the total number of instructions that must be executed is excessively large, one can statically compute the resulting table ahead-of-time and simply embed it in the target software. This was the approach taken for libjpeg, and code coverage increased from 13.18% to 19.18%. In cases like tcpdump, where the number of instructions is sufficiently small that a single state exercising them is not overly time consuming, then the initialisation code can often be lifted to a point in the program before any state forking has taken place.

Iv-D Overhead of Initialisation Code

Some applications have distinct subcomponents which can be more efficiently tested in isolation. For instance, w3m contains a library called libwc, which is used for character conversion. The library is accessed through a single API, wc_Str_conv_with_detect, and it is invoked to convert the stream of input characters prior to HTML parsing. The input provided to w3m is directly passed to this API, with no constraints imposed. Therefore, testing the safety of the character conversion code entails the unnecessary overhead of first analysing the extensive initialisation routines which w3m goes through prior to reaching the interesting API.

To avoid this, a driver program can be constructed which calls the API function of interest. In the case of w3m, this is also a convenient way to solve another problem. Depending on the locale information for the machine on which w3m is running, and the values of the initial bytes of input, wc_Str_conv_with_detect will invoke the appropriate conversion function. Crucially, since the locale is fixed for a given analysis machine and not treated as symbolic by KLEE, only a limited subset of the conversion functions is available to be invoked, and therefore tested. In the driver program, shown in Listing 7 we can simply mark the locale (given by the hint variable) as symbolic and KLEE will discover and test all the available conversion functions.

1from = WC_CES_UTF_8;
2klee_make_symbolic(&hint, sizeof(wc_ces), "hint");
3klee_assume(hint == WC_CES_US_ASCII || hint == ...);
4to = WC_CES_UTF_8;
5s = wc_Str_conv_with_detect(s, &from, hint, to);
Listing 7: libwc driver program.

Iv-E Inability to Reason about Meta-Properties of Input

Symbolic execution engines generally track the propagation of symbolic data via direct data-flow; e.g. if a is symbolic and b = a + 10, then the engine will represent b as a symbolic value derived from a and 10 via the addition operator. However, applications often make decisions based on meta-properties of input data (e.g. string length), and these meta-properties usually do not have a direct data-flow dependency or transition on the input data. Therefore, the engine is unable to track and reason about the relationship between the input data and meta-properties.

tcpdump out-of-bounds memory access.

When processing an input file, libpcap reads an integer from the file, bounds its value to the range (0, 262144) and then allocates a buffer based on the value. When KLEE  detects a symbolic value being passed to malloc, it concretises that value through a solver query.

Later, in tcpdump, a number of security-relevant bugs arise when certain sequences of code are triggered which increment a pointer beyond the end of a buffer. Whether they are detectable or not depends on what allocation size KLEE  previously produced during concretisation. The impact of concretisation is significant; a small value (e.g. 6) results in 4 bugs being found, a very large value (e.g. 262144) results in 3 bugs, while a value of 128 results in 32 bugs.

A general solution to this problem would require adding functionality to KLEE to support dynamically allocated memory regions with variable bounds. In our case we handled it by detecting when concretisation occurred, inspecting the code which makes use of the produced buffer and adding assumptions on the concretisation size to ensure it is within a sensible range e.g. for tcpdump we added a constraint to bound the allocated buffer’s size to the range (128, 256).

zlib out-of-bounds memory access.

In zlib, the root cause of why KLEE  fails to detect CVE-2005-2096 is similarly due to its inability to track and reason about a metaproperty of input. A key variable in triggering the bug is min, the value of which is calculated as shown in Listing 8. Even though an attacker can control the value of min by controlling the values found within the count array, there is no direct data-flow dependency between them and thus KLEE  cannot reason about any conditions which are based on the value of min. Essentially, min is not considered symbolic.

1for (min = 1; min <= MAXBITS; min++)
2    if (count[min] != 0) break;
Listing 8: Metaproperty calculation in inflate_table.

Such a case is synonymous to a bug that is triggered according to the length of a symbolic string, rather than the string’s actual content. There is no direct way to deal with this issue without extensive modifications to the analysis engine. Instead, we hope that by alleviating the other problems mentioned, we enable a sufficient increase in the number of processed paths where one with the desired property is produced and scheduled.

Iv-F Inefficient Environment Models

Often, real-world programs interact with code provided by the environment (e.g. the kernel and third-party libraries). For such code, the analysis engine has the choice of evaluating it as if it were code within the target application, or skipping over the call and mimicking its effects via a model. The former approach is more straightforward, as an accurate model can be difficult to construct. However, we encountered several situations in which analysing the environment-provided code resulted in critical state-space explosion.

1void *memcpy(void * s1, const void * s2, size_t n) {
2    char *r1 = s1;
3    const char *r2 = s2;
4    while (n) {
5        *r1++ = *r2++;
6        --n;
7    }
8    return s1;
Listing 9: Code snippet of memcpy.

For instance, the memcpy functionality provided by uclibc is shown in Listing 9. State space explosion growth occurs when is symbolic as the while loop will potentially create states on each memcpy call. KLEE also performs queries at three different locations on each iteration of the loop: (1) to check the value of at the head of the loop, (2) to check that the dereference of is in bounds, and (3) to check that the dereference of is in bounds. During an analysis run of tcpdump, the while loop was responsible for 50% of the states produced, and 25% of the solver queries issued.

The memset and strtol family of functions have similar issues. A direct mitigation to this problem is to add handlers to KLEE for such functions, which model their effects without invoking the native code. We were able to alleviate the issues in some situations by adding assumptions which limited the bounds of symbolic variables or concretised them outright.

V Evaluation of the Impact of Manual Intervention

We took an iterative approach when introducing our modifications, as detailed in Section IV, and stopped the mitigation process when there was no longer a clear and significant challenge that we can address using manual intervention. As manually mitigating the issues is quite time consuming, for each program we focused on two versions: the head, as provided by version control at the start of our evaluation, and one other version found in Hemiptera.

Program Commit
Default State
of the Art
With modifications
Bugs H. Bugs ICov
Bugs H. Bugs ICov (%) Mean Mean ICov (%) (pp)
libjpeg-turbo a0b7de9* 0 - 23.1 0 0 - - 27.08 +3.98
libjpeg-turbo 3091354 1 0 15.01 1 0 0 0 21.46 +6.45


imginfo 4212e7e* 0 - 10.36 0 0 - - 14.22 +3.86
imginfo b702259 1 1 10.39 2 +1 2 +1 11.3 +0.91


file df74b09* 0 - 12.61 0 0 - - 16.62 +4.01
file b6e8437 0 0 13.83 1 +1 1 +1 16.25 +2.42


tcpdump 17a3c288* 32 - 4.94 61 +29 - - 8.89 +3.95
tcpdump a9e4211 40 2 4.91 69 +29 3 +1 10.03 +5.12


zlib cacf7f1d* 0 - 26.7 0 0 - - 30.66 +3.96
zlib 7c2a874 0 0 26.56 1 +1 1 +1 33.83 +7.27


tiff2pdf d57ccfc9* 0 - 8.48 0 0 - - 7.68 -0.80
tiff2pdf f64949a 0 0 8.42 0 0 0 0 9.17 +0.75


flac d2cb0d1* 0 - 12.48 0 0 - - 17.3 +4.82
flac d8d1717 0 0 12.29 1 +1 1 +1 17.3 +5.01


w3m 1ac245b* 0 - 9.01 0 0 - - 9.43 +0.42
w3m 02ba3d6 4 0 9.07 5 +1 0 0 9.13 +0.06


w3m-driver 1ac245b* 0 - 9.01 0 0 - - 10.13 +1.12
w3m-driver 02ba3d6 4 0 9.07 24 +20 1 +1 11.13 +2.06
Average change +3.08
TABLE III: Coverage and Bug Count when running KLEE on modified programs

Table III shows the results obtained by running KLEE on the target applications, using manual intervention to alleviate issues, as described in Section IV.

V-a Results

Code Coverage

Our modifications resulted in increased coverage in every analysed program, except one version of tiff2pdf. The most significant improvements in percentage points (pp) were in zlib-7c2a874 (+7.27), libjpeg-3091354 (+6.45) and tcpdump-a9e4211 (+5.12). The average increase in coverage across all applications and versions was 3.08 percentage points.

Regarding tiff2pdf: despite our attempts, KLEE still struggled to get beyond the initial portion of the parser. The parser contains a large number of loops with branches based on symbolic data and there was no clear set of logical assumptions which would help KLEE progress. The reason for the drop in code coverage is that the added assumptions, which aim to eliminate paths that appear superfluous, e.g. error paths, did not result in a corresponding increase in coverage elsewhere.

Bugs Found

Without manual intervention, KLEE found bugs in six of the program versions analysed. With manual intervention, bugs were found in three new programs, namely file, zlib and flac. Within the six program versions where bugs were discovered with manual intervention, the modifications resulted in a increase in the number of bugs found.

The most significant improvement was in tcpdump with 29 new bugs discovered. At the time of analysis, these bugs were still present in the release version. Notable interventions responsible for the increase in bug count in tcpdump include: (1) the adding of assumptions to avoid KLEE concretising integers to zero that later were used to control allocation sizes, and (2) the replacement of libc’s printf with a low fidelity function

The number of bugs found, and their diversity, also increased significantly on w3m-02ba3d6, from 4 bugs across 2 conversion functions to 24 bugs across 10 conversion functions. The most significant change for w3m was the use of a driver program, which enabled the direct analysis of compartmentalized libraries without the need to go through HTML parsing code.

Required Effort

Scientifically quantifying manual effort is always going to be a tricky endeavour. It depends on numerous variables where their measurements are often subjective, e.g. the skill of the manual analysts and the complexity of the application. Nevertheless, the time required for indentifying and making the changes was a matter of minutes, i.e. less that 15 minutes. In total we spent approximately two days per target (16 hours), which includes the analysis time taken by KLEE.

We conclude that the above data validates the assumption that manual intervention can alleviate a variety of issues encountered when analysing real world programs.

V-B Limitations

Manual intervention resulted in an increase in code coverage across most targets, and an increase in bugs found across almost half of the targets. However, the number of bugs found did not increase for 10 of the 18 targets on which we applied manual intervention. Despite increasing code coverage for many of these 10 targets we were still not alleviating a sufficient number of problems encountered by KLEE.

Manual intervention appears to hit diminishing returns after a few iterations of our process. In general, we observed that at first there are a very small number of problematic constructs that when addressed result in significant gains. Over time state forking and solver query time (which are usually the limiting factors) become more evenly distributed across the covered code and it becomes more difficult to determine what paths can be eliminated, and which ones should be kept. The cost of discovering the correct assumptions increases, and the pay-off, in terms of additional code coverage, decreases.

To give a concrete example: after 7 iterations of changes and analysis on file, 46% of all forks were distributed across its parse function. It is a complex function through which all states must pass, and thus it could not be replaced with a low fidelity model. As no paths were obviously uninteresting we also could not utilise assumptions to remove them.

Predicting the impact and validity of a change can also be quite difficult. On a number of occasions, we added an assumption only to discover at runtime that it was incompatible with an assumption added elsewhere, or conditions imposed by the source code itself.

For instance, Listing 10 is a code-snippet taken from w3m, where line 3 is responsible for a large number of forks (45%). The first attempt made to alleviate this issue involved the addition of the klee_assume shown on line 2. However, the assumption is invalid, as states which reach this code have passed through functions that guarantee that the string is NULL terminated. Such a mistake was only realised during runtime as KLEE printed a warning regarding the logical inconsistency.

1for (i = 0; i < s->length; i++) {
2    klee_assume(s->ptr[i] != ’\0’);
3    if (s->ptr[i] == ’\0’) {...}
Listing 10: An insertion of an invalid assumption for w3m.

Vi Threats to Validity

Analysis Validity.

The key threat to validity concerns the quality of our modifications, which are derived manually. Whilst we were cautious, a priori knowledge of the bugs present in Hemiptera could have indirectly influenced us during our experiments. We reduced this threat in two ways. Firstly, apart from buggy commits, we also experimented with the head versions of the programs where no bugs were known. Secondly, modifications are justified according to concrete evidence; they are strictly included only if they address issues that are apparent in the data collected from experimental runs.

It is possible that different symbolic execution engines would have presented different challenges that were not apparent within KLEE. We primarily selected KLEE as it is seen as the state-of-the-art and has been employed extensively in previous work [6, 21, 22, 10].

Validity of Hemiptera.

A concern associated with internal validity of Hemiptera is the potential for false positives in the bugs included. We minimised this threat by triggering all bugs to confirm their presence. Another potential concern is biases in the selection process. Although completeness (i.e. recording all previous bugs in an application) was never our objective, we did strive to consider the majority of bugs in a practical manner so that such biases may be reduced.

Another potential threat is the generalisability of Hemiptera with respect to other applications. We address this concern by considering a reasonably high number of bugs across a diverse set of applications.

Vii Related Work

In this section, we give an account on previous works aimed at making symbolic execution more effective for bug finding. We also give a short review on previously proposed bug datasets, and contrast them to Hemiptera.

Vii-a Symbolic Execution

There is a substantial amount of research that has sought to alleviate the limitations of path-based symbolic execution. Many of these approaches involve applying symbolic execution on fragments of the target applications. For instance, ZESTI [21] uses KLEE to analyse program paths, exercised by regression tests, against all possible inputs, while UC-KLEE [22] is an extension of KLEE capable of analysing arbitrary function code with under-constrained symbolic input.

State pruning [4] has been employed to remove redundant states. For instance, AEG [1] only considers states that satisfy a given precondition, whilst Yi et al. [33] propose pruning off states that have their path conditions subsumed by previous exploration. FIE [11] uses state pruning to scale symbolic execution for bug discovery in firmware programs. It also makes loop counters symbolic in order to limit the number of states forked at looping constructs.

Concolic execution has been successful in avoiding some of the issues of symbolic execution when applied to whole applications [7, 28, 14]. It is of interest to assess manual intervention done within the context of concolic exection, but such an endeavour would warrant a full study on its own. Mayhem [7] takes a hybrid approach; it begins with symbolic execution, with forking enabled, but switches to concolic execution once a threshold is reached in order to avoid excessive memory consumption. Moreover, in both concolic and symbolic execution, there has also been research aimed at automatically resolving state space explosion due to loops and frequently called functions. Christakis et al. [20] make use of summarisation to verify the memory safety of the ANI Windows Image Parser. A number of researchers have presented systems that make use of state merging [2, 18].

As detailed in Section IV-E, indirect data flows pose challenges to tracking symbolic variables. Corin et al. [9] employ control-flow analysis techniques to mitigate this. In relation to the challenge in Section IV-B, Rawat et al. [23]

propose heuristics to automatically identify error paths to avoid their consideration.

We expect that many of the mitigations we have documented, and manual intervention in general, would be beneficial in allowing the above tools to gain increased code coverage and bug finding capabilities, but this remains to be evaluated empirically.

Vii-B Bug Datasets

One of the earliest works related to bug datasets was carried out by Wilander et al. [32]. The authors manually developed 44 C function calls to evaluate tools designed for static intrusion prevention. Zitser et al. [34] presented a dataset comprised of 14 bugs based on exploitable buffer overflows found in three open-source applications. The bugs were extracted and transformed into smaller program models to enable tools, which suffer from scalability issues, to be evaluated. Kretkiewicz et al. [15] built upon the work of Zitser et al. by proposing a greater number of test cases (as opposed to just 14). A similar approach was taken by Ku et al. [16], where their dataset consists of 298 generated test cases (half of which are faulty) derived from 22 bug kernels. Overall, these datasets have one underlying issue; their test cases are all synthetic. Threats to external generalisability arise as they fail to capture significantly the complexities of real-world software.

BugBench [19] was one of the first notable bug dataset to consider entire real-world applications, rather than bug kernels. Unfortunately, the main limiting factor of BugBench is its relatively low bug count (19 bugs in total), which hinders its diversity. BegBunch [8] was later released. Its test cases were generated by making use of bug kernels extracted from real-world software, similar to the approach taken by Zitser et al. [34]. However, a greater number of kernels were used, namely 64. In total, their relevant corpus includes 181 synthetic test cases.

The Juliet Test Suite [3] is a large collection of 81,056 synthetic C/C++ and Java programs provided by NIST Software Assurance Metrics and Tool Evaluation (SAMATE). All together, the test cases encompass 118 different CWEs. Furthermore, Shiraishi et al. [25] present a corpus of 1,276 simple synthetic programs that encapsulate common characteristics of automotive software. Only half of the test cases contain flaws.

More recently, Dolan-Gavitt et al. [13] proposed LAVA, a technique that automatically injects vulnerabilities into source code. The authors made use of their tool to inject vulnerabilities into four real-world applications; for example, LAVA inserted 1064 bugs in the readelf application. Although LAVA facilitates the construction of a large bug corpus, there is no guarantee that the injected flaws are representative of issues that occur in real world code.

The DARPA Cyber Grand Challenge [26, 27] provided a set of challenges comprised of 131 vulnerable binaries, constructed for the competition. They are designed to run on DECREE, a custom operating system which supports a limited number of system calls and was also specifically created for the challenge. As with the other synthetic corpora, it is not clear how well the programs, their vulnerabilities and execution environment correspond with real software.

Hemiptera has more real bugs than any other benchmark to date. Although it still faces threats related to generalisability, a researcher should have a higher degree of confidence when using Hemiptera, as results correlate with a large number of real-world bugs.

Viii Conclusion

In this work, we present Hemiptera, a novel bug suite, and use it to investigate the issues faced by KLEE, a state of the art symbolic execution engine, when analysing real-world software. We document an in-depth analysis of the issues that hinder KLEE and categorise the main challenges faced. While KLEE is capable of finding bugs on some of the targets in Hemiptera, our evaluation highlights the fact that a number of commonly occurring code constructs can severely hinder its progress.

We utilised the above categorisation to experimentally evaluate the effectiveness and limitations of manual intervention in symbolic execution. Whilst the approach is not without its issues, our experience shows an improvement in both bugs found and code coverage. According to these results, we conclude that manual intervention is likely to be a rewarding effort in the real world deployment of symbolic execution tools. We believe that the analysis of the issues found should also prove useful for future work on automated solutions, and that Hemiptera will be a valuable resource in evaluating such solutions.


  • [1] T Avgerinos, SK Cha, B Lim, and D Brumley. AEG: Automatic exploit generation. In Network and Distributed System Security Symposium. Internet Society, 2011.
  • [2] Thanassis Avgerinos, Alexandre Rebert, Sang Kil Cha, and David Brumley. Enhancing symbolic execution with Veritesting. In Proceedings of the 36th International Conference on Software Engineering, pages 1083–1094. ACM, 2014.
  • [3] Tim Boland and Paul E Black. Juliet 1.1 C/C++ and Java test suite. Computer, 45(10):0088–90, 2012.
  • [4] Suhabe Bugrara and Dawson Engler. Redundant state detection for dynamic symbolic execution. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference, USENIX ATC’13, pages 199–212, Berkeley, CA, USA, 2013. USENIX Association.
  • [5] Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, volume 8, pages 209–224, 2008.
  • [6] Cristian Cadar, Patrice Godefroid, Sarfraz Khurshid, Corina S Păsăreanu, Koushik Sen, Nikolai Tillmann, and Willem Visser. Symbolic execution for software testing in practice: preliminary assessment. In Proceedings of the 33rd International Conference on Software Engineering, pages 1066–1071. ACM, 2011.
  • [7] Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. Unleashing mayhem on binary code. In 2012 IEEE Symposium on Security and Privacy, pages 380–394. IEEE, 2012.
  • [8] Cristina Cifuentes, Christian Hoermann, Nathan Keynes, Lian Li, Simon Long, Erica Mealy, Michael Mounteney, and Bernhard Scholz. Begbunch: Benchmarking for c bug detection tools. In Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction with the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2009), pages 16–20. ACM, 2009.
  • [9] Ricardo Corin and Felipe Andrés Manzano. Taint analysis of security code in the klee symbolic execution engine. In Proceedings of the 14th International Conference on Information and Communications Security, ICICS’12, pages 264–275, Berlin, Heidelberg, 2012. Springer-Verlag.
  • [10] Heming Cui, Gang Hu, Jingyue Wu, and Junfeng Yang. Verifying systems rules using rule-directed symbolic execution. In ACM SIGPLAN Notices, volume 48, pages 329–342. ACM, 2013.
  • [11] Drew Davidson, Benjamin Moench, Thomas Ristenpart, and Somesh Jha. Fie on firmware: Finding vulnerabilities in embedded systems using symbolic execution. In USENIX Security, pages 463–478, 2013.
  • [12] Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. LAVA: Large-scale automated vulnerability addition. Proceedings - 2016 IEEE Symposium on Security and Privacy, SP 2016, pages 110–121, 2016.
  • [13] Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. Lava: Large-scale automated vulnerability addition. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 110–121. IEEE, 2016.
  • [14] Patrice Godefroid, Michael Y Levin, and David Molnar. Sage: whitebox fuzzing for security testing. Queue, 10(1):20, 2012.
  • [15] Kendra Kratkiewicz and Richard Lippmann. Using a diagnostic corpus of c programs to evaluate buffer overflow detection by static analysis tools.
  • [16] Kelvin Ku, Thomas E Hart, Marsha Chechik, and David Lie. A buffer overflow benchmark for software model checkers. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, pages 389–392. ACM, 2007.
  • [17] Volodymyr Kuznetsov, Vitaly Chipounov, and George Candea. Testing closed-source binary device drivers with DDT. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10, pages 12–12, Berkeley, CA, USA, 2010. USENIX Association.
  • [18] Volodymyr Kuznetsov, Johannes Kinder, Stefan Bucur, and George Candea. Efficient state merging in symbolic execution. ACM Sigplan Notices, 47(6):193–204, 2012.
  • [19] Shan Lu, Zhenmin Li, Feng Qin, Lin Tan, Pin Zhou, and Yuanyuan Zhou. Bugbench: Benchmarks for evaluating bug detection tools. In Workshop on the evaluation of software defect detection tools, volume 5, 2005.
  • [20] Patrice Godefroid Maria Christakis. Proving memory safety of the ANI Windows image parser using compositional exhaustive testing. Technical report, Microsoft, November 2013.
  • [21] Paul Dan Marinescu and Cristian Cadar. Make test-zesti: A symbolic execution solution for improving regression testing. In Proceedings of the 34th International Conference on Software Engineering, pages 716–726. IEEE Press, 2012.
  • [22] David A Ramos and Dawson R Engler. Under-constrained symbolic execution: Correctness checking for real code. In USENIX Security, pages 49–64, 2015.
  • [23] Sanjay Rawat, Vivek Jain, Ashish Kumar, Lucian Cojocar, Cristiano Giuffrida, and Herbert Bos. VUzzer: Application-aware Evolutionary Fuzzing. In NDSS, February 2017.
  • [24] Jonathan Salwan, Sébastien Bardin, and Marie-Laure Potet. Deobfuscation of vm based software protection. In Symposium sur la sécurité des technologies de l’information et des communications, SSTIC, France, Rennes, June 7-9 2017, pages 119–142. SSTIC, 2017.
  • [25] Shinichi Shiraishi, Veena Mohan, and Hemalatha Marimuthu. Test suites for benchmarks of static analysis tools. In Software Reliability Engineering Workshops (ISSREW), 2015 IEEE International Symposium on, pages 12–15. IEEE, 2015.
  • [26] Jia Song and Jim Alves-Foss. The darpa cyber grand challenge: A competitor’s perspective. IEEE Security & Privacy, 13(6):72–76, 2015.
  • [27] Jia Song and Jim Alves-Foss. The darpa cyber grand challenge: A competitor’s perspective, part 2. IEEE Security & Privacy, 14(1):76–81, 2016.
  • [28] Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. Driller: Augmenting fuzzing through selective symbolic execution. In Proceedings of the Network and Distributed System Security Symposium, 2016.
  • [29] [The KLEE Team]. Overview of the main KLEE intrinsic functions. https://klee.github.io/docs/intrinsics/. Accessed: 07-04-2017.
  • [30] Julien Vanegue and Shuvendu K. Lahiri. Towards practical reactive security audit using extended static checkers. In Proceedings of the 2013 IEEE Symposium on Security and Privacy, SP ’13, pages 33–47, Washington, DC, USA, 2013. IEEE Computer Society.
  • [31] Tielei Wang, Tao Wei, Zhiqiang Lin, and Wei Zou. Intscope: Automatically detecting integer overflow vulnerability in x86 binary using symbolic execution. In NDSS, 2009.
  • [32] John Wilander and Mariam Kamkar. A comparison of publicly available tools for static intrusion prevention. 2002.
  • [33] Qiuping Yi, Zijiang Yang, Shengjian Guo, Chao Wang, Jian Liu, and Chen Zhao. Eliminating path redundancy via postconditioned symbolic execution. IEEE Transactions on Software Engineering, 2017.
  • [34] Misha Zitser, Richard Lippmann, and Tim Leek. Testing static analysis tools using exploitable buffer overflows from open source code. In ACM SIGSOFT Software Engineering Notes, volume 29, pages 97–106. ACM, 2004.