Automated Customized Bug-Benchmark Generation

01/09/2019
by   Vineeth Kashyap, et al.
GrammaTech
0

We introduce Bug-Injector, a system that automatically creates benchmarks for customized evaluation of static analysis tools. We share a benchmark generated using Bug-Injector and illustrate its efficacy by using it to evaluate the recall of leading open-source static analysis tools. Bug-Injector works by inserting bugs based on bug templates into real-world host programs. It searches dynamic program traces of the host program for points where the dynamic state satisfies a bug template's preconditions and modifies related host program's code to inject a bug based on the template. Every generated test case is accompanied by a program input whose trace has been shown to contain a dynamically-observed state triggering the injected bug. This approach allows us to generate on-demand test suites that meet a broad range of requirements and desiderata for bug benchmarks that we have identified. It also allows us to create customized benchmarks suitable for evaluating tools for a specific use case (i.e., a given codebase and bug types). Our experimental evaluation demonstrates the suitability of our generated test suites for evaluating static bug-detection tools and for comparing the performance of multiple tools.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

09/06/2021

Find Bugs in Static Bug Finders

Static bug finders have been widely-adopted by developers to find bugs i...
11/14/2017

Comparing Bug Finding Tools with Reviews and Tests

Bug finding tools can find defects in software source code us- ing an au...
12/18/2018

AVATA R : Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations

Fix pattern-based patch generation is a promising direction in Automated...
12/18/2018

AVATAR : Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations

Fix pattern-based patch generation is a promising direction in Automated...
12/11/2020

WITCHER : Detecting Crash Consistency Bugs in Non-volatile Memory Programs

The advent of non-volatile main memory (NVM) enables the development of ...
06/23/2018

Preventing Buffer Overflows by Context-aware Failure-oblivious Computing

In languages like C, buffer overflows are widespread. A common mitigatio...
10/12/2018

Linear Program Reconstruction in Practice

We briefly report on a linear program reconstruction attack performed on...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Several static analysis tools exist today for finding bugs in programs. Researchers in academia and industry are constantly working on creating new tools and sophisticated techniques for bug finding; however, evaluating static analysis tools remains a challenge. A good evaluation system will guide an impactful improvement in bug-finding tools, by finding blind spots in the static analysis tools, furthering adoption and effective use.

In this paper, we mainly focus on one key aspect of evaluating static analysis tools: the recall

of a tool. That is, how well can the tool find all the real bugs in a program? Answering this question in a convincing manner is difficult. It is hard—if not impossible—to enumerate all possible bugs in any non-trivial program. However, we can estimate the recall of a tool by counting how many

previously-known bugs in a given set of programs are found by the tool. Such estimated recall rates can be particularly useful for comparing different tools. There is a large body of previous work [17, 44, 42, 13, 45, 46, 32, 43, 29, 33, 14, 47, 34, 37, 21, 40] on creating benchmarks containing known bugs. Despite this significant progress, a recent study by Delaitre et al.  [19] found that there is still a shortage of test cases for evaluating static analysis tools and a need for real-world software with known bugs.

To address this need, we first discuss some desirable properties in a benchmark suite that contains known bugs and is targeted towards evaluating static analysis tools.

Real-world-like

The benchmarks should be representative of real-world code and complexity. While small and artificial benchmarks can also be useful, static analysis tools should ultimately be evaluated on the kinds of programs they will be applied to during software development or audit.

Reliable ground truth

The known bugs in the benchmarks should be real bugs, i.e., they should manifest on at least one execution of the program. If they are not, any recall estimate based on the benchmarks is not meaningful. Ideally, each known bug should come with a proof-of-existence, such as an input that can trigger the bug.

Automated, not fixed

An automated approach to benchmark suite generation is appealing because it eliminates the problem of benchmark staleness: new tests can be generated on demand, without manual effort, in the quantity desired for statistical significance.

Customizable

Users have differing needs and codebases: one fixed benchmark does not suit all. For example, Herter et al. [26] suggest that certain sectors (such as the aerospace and automotive industries) deem recursive function calls inappropriate. Clearly, the same restriction does not apply to all industries: some organizations will have codebases that do include recursion, and they will want static analysis tools to reason about recursive functions. Also, not all bug classes are equally important to all users of static analysis tools. Thus, it is useful to create a customized benchmark targeting the user’s codebase and bug distribution expectations.

Broad coverage of bug types

Benchmarks should include bugs of a wide variety of bug types: for example, bugs corresponding to a large set of different Common Weakness Enumeration entries (CWEs) [5]. Users can choose to customize their evaluation by disregarding certain types of bugs.

Suitable for usefully evaluating and comparing the recall of static analysis tools

In addition to having all the above properties, the benchmark should (a) be able to compare and contrast the strengths and weaknesses of static analysis tools, and (b) provide guidance to further improve the recall of a given tool, e.g., by including bugs which are within scope for the tool in principle, but which the tool is unable to detect.

Techniques used for benchmark suite creation are largely independent of techniques being evaluated

This property aims to avoid circularity: by not relying on static analysis techniques to create the benchmark suites, this property reduces the limitations and bias imposed by certain static techniques during their evaluation.

We address all of the above desired properties through Bug-Injector, a system that automatically generates benchmarks containing known bugs. Bug-Injector-generated benchmarks have a broad range of applications, but the one we present in this paper is particularly suited to estimating and comparing the recall rates of static analysis tools.

Bug-Injector starts from (i) a set of bug templates (§ III-B) that represent known bugs, (ii) a host program, i.e., an existing real-world software application, and (iii) tests to exercise the host program. It searches dynamic program traces of the host program to identify injection points where the dynamic state satisfies a bug template’s preconditions. Using dynamic state to identify bug injection locations provide independence from bias and limitations of static analysis techniques (e.g., the precision of pointer analysis or SMT solver weaknesses). For a random subset of injection points found, Bug-Injector modifies the host program code to insert a bug corresponding to the bug template, integrating with the host program’s data and control flow. Bug-Injector outputs multiple versions of each host program, each containing one injected bug, and each associated with the input to the program that can trigger the injected bug.

Bug-Injector can inject bugs from a variety of bug classes into a variety of programs of different size, functionality, and complexity (§ IV). This customizability allows for additional uses of Bug-Injector beyond tool evaluation. For instance, a tool developer who creates analysis checkers for a new kind of bug can use Bug-Injector to generate test cases containing bugs of that kind to quickly evaluate the checker against real-world software (§ III-D). Using Bug-Injector in this manner can complement the typical method of testing analysis checkers with small, hand-crafted test cases.

The specific contributions of this paper are:

  1. [noitemsep,nolistsep]

  2. The Bug-Injector system, a novel technique to automatically generate customized, real-world-like benchmarks with known bugs that can be triggered using accompanying inputs. We describe Bug-Injector’s architecture, functionality, and underlying algorithms in § III.

  3. An openly-available benchmark suite (§ IV) generated using Bug-Injector. We created bug templates (both manually and automatically) from different sources belonging to a wide variety of CWEs [5] and injected them into open-source real-world programs.

  4. An extensive evaluation of two leading open-source static analysis tools for C/C++ programs—Clang Static Analyzer (CSA) [2] and Infer [7]—on our generated benchmarks. Our results (§ VI), show that: (a) both of these tools fail to detect bugs that are seemingly in scope for them, (b) our benchmarks can contrast the two tools, and (c) our benchmarks can contrast between two analysis configurations of CSA, showing that Bug-Injector could be used to automatically tune analysis configurations customized to a codebase. Additionally, we show that a closely related work, LAVA [11], is not suitable for contrasting and comparing static analysis tools.

In the remainder of the paper, we survey and compare to related work (§ II), describe challenges in estimating recall of tools (§ V), discuss limitations and future work (§ VII), and conclude (§ VIII).

Ii Related work

Creating bug-containing benchmarks for testing and evaluating bug-finding tools has attracted significant research attention in recent years. In this section, we compare Bug-Injector to the closest related work, summarized in Table I.

Property BI LAVA EC Synth Wild
Real-world-like Yes Yes Yes No Yes
Reliable ground truth Yes Yes No Yes* Ltd.
Automated, not fixed Yes Yes Yes No No
Customizable Yes Yes Yes No No
Wide coverage of CWEs Yes No No Yes No
Static tool evaluation? Yes No No Ltd. Ltd.
Independent? Yes Ltd. No Yes Yes
TABLE I: Summary comparing Bug-Injector (BI) with other closely related work across the different properties outlined in I. Legend: EC=EvilCoder, Synth=Synthetic benchmarks, Ltd.=Limited, Yes*=subject to some errors. See § II prose for further details.

Synthetic benchmarks

Several efforts have targeted manual creation of artificial test programs containing bugs. Some prominent examples are: Juliet tests [17], the IARPA STONESOUP snippets [44], Toyota ITC benchmarks [42], OWASP WebGoat [13], Wilander et al., [45, 46], and ABM [32]. However, synthetic benchmarks have limited applicability in identifying how tools perform on real-world code.

Wild

Bugs may be mined and curated from real-world software. Some prominent examples of such curated bug collections are: BugZoo [43], Defects4J [28, 24], BugBench [29], BugBox [33], SecuriBench [14], and Zitser et al., [47]. While they have the advantage of being real-world-like, they have varying degrees of ground truth, and not all of them come with proof-of-existence. There is also very little benchmark-user customizability with respect to bug type coverage and distribution.

The curation of both wild and synthetic benchmarks requires substantial manual effort and is prone to errors (e.g., both the Juliet test cases and the Toyota ITC benchmarks have required corrections [17, 26]). They are fixed and not customizable, with pre-determined target code constructs and bug types. They therefore have limited applicability for evaluating and comparing the recall of static analysis tools. SARD [34] is perhaps the largest openly available collection of known buggy test programs, put together by the SAMATE group at NIST. It contains both synthetic and wild benchmarks.

EvilCoder

This system [37] uses static analysis to find sensitive sinks in a host program and connects them to a user-controlled source to inject taint-based bugs. A significant disadvantage is that there is no guarantee that inserted bugs are true positives—which makes it impossible to measure projected recall. Indeed, the paper does not evaluate bug-finding tools on EvilCoder test cases. EvilCoder injected bugs also inherit the limitations of the static analysis tools used as a part of the injection pipeline, and therefore may bias evaluation of other static analysis tools. EvilCoder is also limited to taint-based bugs.

Lava

This system [21, 11] inserts bugs into host programs by identifying situations where user-controlled input can trigger an out-of-bounds read or write. LAVA bugs come with an input to trigger the bug and are validated to check that they return exit codes associated with buffer overflows. However, this approach is limited to inserting buffer overruns, and other kinds of bugs are left as future work. More recently [27], LAVA has been extended to a small number of additional bug types. LAVA test cases are generated to satisfy an additional goal: the bugs must manifest only on a small fraction of all possible inputs. This requirement seems targeted towards testing fuzzing tools; we do not think it is necessarily applicable in the context of testing static analysis tools.111Many famous bugs, e.g. HeartBleed [16], execute on the majority of possible inputs. To satisfy the requirement, LAVA injects bugs with certain patterns (the “knob and trigger” pattern which relies on magic values). It is unclear how realistic this bug pattern is with respect to bugs found in production software. A more detailed discussion of the suitability of LAVA benchmarks for static analysis evaluation is provided in § VI-D. Another closely related technique is Apocalypse [40], which is similarly targeted towards creating challenging benchmarks for fuzzing and concolic execution tools.

As opposed to the synthetic and wild benchmarks, EvilCoder, LAVA, and Bug-Injector are automated and can create as many bugs as required in custom real-world programs.

Bug-Injector uses bug templates and a host program to produce a suite of programs containing one known bug apiece, along with an input that can trigger each bug. The available bug templates cover a large number of CWEs, and new bug templates are easy to create. Through empirical evaluation, we show in § VI that Bug-Injector generated benchmarks are suitable for evaluating and comparing the recall of static analysis tools.

Another related technique is mutation testing [20, 25]. Mutation testing is used to evaluate the quality of a test suite, and is different from our work because the mutations are much simpler, are not dynamically targeted, and are not guaranteed to introduce real bugs (e.g., with a demonstrating input).

Iii Bug Injector

In this section, we describe the tooling used for Bug-Injector, introduce bug templates, and describe how Bug-Injector works. We illustrate the injection of a bug template into a host program, and discuss potential applications.

Iii-a Tooling

Bug-Injector is implemented using the Software Evolution Library (SEL) [41], an open-source toolchain that provides a uniform interface for instrumenting, tracing, and modifying software. SEL supports multiple programming languages. Currently, Bug-Injector works on C/C++, Java, and JavaScript222Java and JavaScript support is experimental, under heavy development. software. In this paper, we focus on Bug-Injector as applied to C/C++ software. C/C++ software modifications are made using SEL, implemented via Clang’s libtooling API. Clang’s libtooling provides a solid foundation for parsing and program modification in the presence of the latest C/C++ syntactic features, making Bug-Injector applicable to a wide range of C/C++ software.

Iii-B Bug templates

1void f1(char *src) {
2 char *dst = 0; // ’dst’ initialized to a null ptr
3 memcpy(dst + 0, src, 10); // expected warning: null
4 // ptr argument in call to memory copy function
5}
(a) Clang Static Analyzer (CSA) regression test, annotated with expected tool behavior. This test program contains a bug: the first argument to the memory copy function is null.
code memcpy($dst + 0, $src, 10);
free-variables $dst: pointer to char
$src: pointer to char
precondition $dst == 0
postcondition true
(b) A bug template containing a single patch corresponding to the bug in (a). For presentation, the patch has been abstracted from its Common Lisp definition (example definitions are available in Appendix A).
1/* global variable declarations */
2static char *lastout;
3static char *prog;
4/* ... lots of code, some of which manipulates globals */
5static int grep(int fd) {
6  /* ... more code */
7+ /* from input (./harness.sh test BIN 1) */
8+ /* POTENTIAL FLAW */
9+ memcpy(lastout + 0, prog, 10);
10  reset(fd);
11  lastout=0;
(c) The diff between the original and buggy version of grep resulting from injection of the bug-template in 0(b).
Fig. 1: A CSA regression test (0(a)), a corresponding bug template (0(b)) created manually, and the diff resulting from injection of this bug template into the grep program (0(c)).

Bug-Injector is able to inject a wide range of bug types, based on the provided bug templates. A bug template is defined in Common Lisp, and it specifies: (a) the dynamic and static requirements for a successful bug injection, (b) the code constituting the bug itself, and (c) how this code should be integrated into the program. An example bug template is provided in (b). A bug template consists of one or more patches, where a patch has the following fields.

code

The buggy code that will be inserted into the host program. In (b), the buggy code is a call to memcpy.

free-variables

A list of type-qualified free variables in the buggy code. These free variables are matched to type-compatible in-scope variables at the injection location. The occurrences of the free variables in code are replaced with the matched host program variables before injection. In the example in (b), the specified free variables ensure that during injection, $dst and $src are bound to pointer to char.

precondition

An arbitrary boolean predicate over the values of the in-scope variables at a program point in the dynamic trace and the abstract syntax tree of the related program point. Bug-Injector uses this function to search the dynamic traces for suitable injection locations: points in the trace where dynamically-observed variable values satisfy the precondition. The input that gives rise to a trace is called the “witness” of that trace. The buggy code injected into the source at the precondition-matching location will be executed by the witness input. In the example given in (b), the precondition specifies that at an injection location, an in-scope variable that will be bound to $dst is a null pointer.

postcondition

An optional boolean predicate that must hold after the buggy code has been exercised. This predicate is used to validate dynamically that the witness triggers the bug. If no postcondition is specified (or equivalently, a trivial “true” postcondition is specified, as in the example in (b)), no additional validation is performed. Else, validation instrumentation is inserted to validate that the postcondition holds. All validation instrumentation is removed before delivery of the buggy program.

The example bug template in (b) was manually created from an existing regression test ((a)) for CSA. This regression test contains a bug at the call to memcpy: that its first argument is a null pointer. A successful injection of the bug template in (b) will insert a call to memcpy, where $src and $dst are replaced with host program variables, and the variable bound to $dst is null before the call to memcpy. Thus, the bug injection attempts to create the same kind of bug, but embedded and integrated with the host program’s data and control flow. In general, creating a bug template from a bug example requires identifying (a) the relevant buggy code, (b) the free variables in the buggy code that must be rebound in the host program, (c) preconditions to ensure the bug is successfully transferred to the host program, and (d) optionally, postconditions to verify the existence of the bug after injection.

Iii-C Technique

max width=0.5

Host Program

Instrument

Execute

Inject & Validate

Buggy Programs + Witnesses

Bug Templates

Program Inputs

Trace

Points

Fig. 2: Bug-Injector pipeline.

The Bug-Injector pipeline of instrument, execute, and inject is shown in Figure 2 and described in the algorithm in Figure 3. Bug-Injector takes three inputs: (1) a host program, (2) a set of tests for this program, and (3) a set of bug templates. It attempts to inject bugs from the set of bug templates into the host program, and returns multiple different buggy versions of the host program. Each returned buggy program variant has at least one known bug (the one that was injected), and is associated with a witness—a test input which is known to exercise the injected bug.

1:
2:
3:
4:
5:
6:let
7:let Instrument
8:for  do Execute
9:     
10:end for
11:for  do
12:     
13:     
14:     for  do
15:           Inject
16:          if   then Validate
17:               
18:          end if
19:     end for
20:end for
21:return
Fig. 3: Bug-Injector algorithm.

The Bug-Injector algorithm begins by instrumenting the host program (Figure 3, line 7). The method rewrites the source code of the host program, inserting code to emit dynamic trace output. Traces include the values of all in-scope variables (currently limited to primitive types and pointers) at every program statement. The algorithm then runs the instrumented program with test inputs. The collected traces (capped by size MaxTrace) and the test inputs that produced them are stored efficiently in a persistent binary-format database, (line 9).

Bug-Injector then attempts to inject each of the bug templates NumInjection times. For every bug template, it uses to search for candidate program point sets that match all the preconditions for the patches in that bug template. returns a list of candidates: each candidate is a tuple of points—one program point per patch in the bug template—and a witness input. Bug-Injector randomly samples NumInjection candidates for injection. The candidates picked for injection are then used by (line 15), which takes the code in the patches of the bug template and rewrites the source code locations associated with each of the Points. Source rewriting involves inserting the associated code snippet into the host program, renaming all the free variable names with the precondition-matching and type-compatible in-scope variables of the host program.

To validate a non-trivial template postcondition, Bug-Injector adds instrumentation (removed after validation succeeds) to the modified program and dynamically validates the injected bug upon re-execution against (line 16). The buggy program and the associated are added to the output (line 17). After exhausting the given number of injections, is returned.

As an example, consider the injection of bug template given in (b) into the C program grep, resulting in a buggy version of grep shown in (c). In this instance, Bug-Injector reuses host program global static variables, lastout and prog, for the call to memcpy. For diagnostic purposes, the injected buggy code is preceded by a comment: this comment indicates the input “witness” for the injected bug. When run with this input, the value of lastout is null before the call to memcpy, during at least one point in the program execution. This injection successfully reproduces the “memcpy should not be called with its first argument being a null pointer” bug, but in a different code context.

CSA emits a warning about the bug in their regression test (a). However, CSA fails to report a warning for the similar injected bug in this buggy version of grep. Thus, in this case, CSA has “lost” the bug due to its injection into a more complex context, and is worth further investigation by the static analysis check developer.

Iii-D Uses of Bug-Injector

One of the applications of Bug-Injector is to provide feedback to static analysis tool developers regarding the false-negative rate of their “checkers” on real-world programs. A typical workflow for building static analysis checkers333This workflow is based on our personal experience talking to static analysis tool developers, and is partly inspired by the static analysis checker development tutorial for Phasar [15]. is an iterative process: (1) develop a checker to detect violations of a program property, (2) test the static analysis checker on some manually crafted test programs, (3) deploy the checker into production, (4) identify failures and false-negative corner cases for the checker, (5) iterate and improve the checker. Bug-Injector can be used to improve and speed up this slow process: instead of manually crafting test cases for the checker being developed, we can craft relevant bug templates. Bug-Injector can then generate customized benchmarks by injecting these bug templates into real-world programs. The static analysis checker can then be tested on the generated benchmarks to obtain early feedback regarding the checker’s performance (such as expected false-negative rate, scalability), before deploying the checker into production.

Another application of Bug-Injector is customized evaluation of static analysis tools, as we have done in § VI. We also provided the SAMATE group at NIST with Bug-Injector. This group is conducting SATE VI [35]: the sixth iteration of Static Analysis Tool Exposition. SATE is a non-competitive study of static analysis tool effectiveness, aiming at improving tools and increasing public awareness and adoption. SATE VI is already making use of Bug-Injector generated test programs, in addition to manually crafted test programs. Further, NIST is expecting to make extensive use of Bug-Injector for SATE VII, the next iteration of SATE. To quote the initial experience of the NIST team with Bug-Injector: “using Bug-Injector to generate benchmarks is much faster (at least five times as fast) than using our current manual benchmark generation process.” For SATE VI, the participating static analysis vendors can compare how well they perform on Bug-Injector generated benchmarks vs. the manually created benchmarks, which will be a useful broader study regarding the effectiveness of Bug-Injector. NIST also plans to add Bug-Injector generated tests to the SARD dataset [34].

Iv Our benchmark suite

In this section, we describe the test suite that we have generated using Bug-Injector. We use an appropriate subset of these generated tests in our evaluation (§ VI). The test suite and the bug templates used are available online for use by the community [1]. We also discuss performance measurements and statistics related to generating the test suite.

Iv-a Selection of bug templates

We create bug templates from a number of sources to satisfy several goals. First, we want our test suite to allow a fair evaluation of CSA [2] and Infer [7], and inject bugs that these tools care about. Both tools support the detection of buffer overflows and null pointer dereferences. Therefore, we collect examples of those two kinds of bugs that appear directly in each tool’s documentation [3, 10] and regression test suites [4, 9]. We did not perform a formal verification of whether those examples contained the bugs they claimed, although we manually examined each example before converting it to a bug template. The conversion of a bug example to a bug template is fairly straightforward (described in § III-B), and only took on the order of few minutes per example.

Second, we want to demonstrate that Bug-Injector can generate test cases across a wide variety of bug types and CWE categories [5]. To this end, we automatically converted a number of examples from the Juliet test suite (version 1.3. [17]) into bug templates; the examples span 55 unique CWE types, from stack-based buffer overflows (CWE-121) to type confusion (CWE-843). We exploited the uniform structure of Juliet tests to create these bug templates: we automatically extract free variables, preconditions, and code to inject from the Juliet test suite using both static and dynamic information from each example.

Table II lists the number of bug templates we created from each of the sources mentioned above. We also report additional statistics regarding the complexity of injected code. On average, the number of lines to be injected is quite small, with a low number of control-flow statements.

Source Kind Templates mean counts
LOC FVars CF Stmts
CSA documentation [3] 11 2.77 0.77 0.15
Infer documentation [10] 2 8.00 1.00 2.00
Infer regression tests [9] 4 2.25 0.75 0.25
Juliet tests [17] 55 7.79 1.25 0.85
TABLE II: The number of bug templates per source kind. The remaining columns provide means over each set of templates for: (a) the number of lines of code to be injected, (b) the number of free variables to be rebound, and (c) the number of control-flow statements in the injected code, respectively.

Iv-B Selection of host programs

Project Version LOC Prep Time Sites/ KLOC Query Time
grep [6] 2.0 12,225 66 372.76 1.76
nginx [12] 1.13.0 177,988 766 7.62 5.03
TABLE III: Host programs used for evaluation. LOC gives the lines of code in the programs. “Prep Time” and “Query Time” are given in seconds, and explained in § IV-C. Sites/KLOC provides the average number of injection sites available per lines of code.

We use the open-source projects listed in Table III as host programs for generating our test suite. We have successfully injected bugs into other C/C++ host programs, which we have not included in this paper due to experimental evaluation resource constraints. One such excluded program is WireShark version 1.12.9, which has 2.3 million lines of code and is the largest program we have successfully injected bugs into. The host programs we employ demonstrate a range of real-world programming constructs, and showcase Bug-Injector’s ability to inject into a variety of real-world projects. Test suites with good program coverage for the host programs provides a large number of distinct trace points for Bug-Injector, improving the chances of finding many suitable injection points.

Beyond utilizing real-world host programs, our injected bugs are similar to bugs arising from normal development. The test suite variants are uniformly formatted using a code beautification tool, ensuring the injection does not stand out due to code-style differences. As shown in Table II, the bug templates typically include a small amount of code. These characteristics, along with the use of existing program variables (through free variable rebinding), allow the injections to meld with the existing code and look realistic (e.g., see (c) and Figure 4).

Iv-C Performance characteristics of Bug-Injector

As discussed in § III, Bug-Injector operates in a pipeline of several stages. Performance in these stages depends on characteristics of the host program and the bug template set. Table III summarizes the key characteristics and performance data for the host programs. Timing experiments were performed on an Intel(R) Xeon(R) 2.10 GHz machine with 72 cores and 128 GB of RAM.

In the instrument and execute stages, Bug-Injector parses the host program, adds instrumentation, and runs the program with test inputs to collect traces. The time required for this stage depends on the size of the program, the number of variables it contains, and the number of input tests to run; the “Prep Time” column in Table III provides this information for each host program. This provided prep time is a one time cost, which gets amortized over the number of bugs to be injected into the same host program.

The inject stage involves searching the trace database for points satisfying the bug template preconditions. The time required per injection depends on the number of points collected in the trace, the percentage of points which satisfy the precondition and free variable requirements, as well as the complexity of the precondition. The “Sites/KLOC” column in Table III provides the number of matching host-program sites that are suited for injection based on our bug templates, per lines of code. The “Query Time” column gives the median time (in seconds) per query. The grep program contained a large number of string and integer variables, and therefore showed higher density of potential injection sites; conversely, nginx, with few integer variables, had lower density of injection sites.

Lastly, Bug-Injector edits the program, applies code formatting to the buggy software, and writes it out to disk. The time required to apply code formatting and printing the buggy program is directly proportional to the program size.

Overall, the prep time dominates the pipeline as the most expensive stage. Given the offline nature of benchmark creation, we believe the performance of Bug-Injector is reasonable.

V Estimating static analysis recall

As previously discussed in § I, it is difficult to compute the exact recall of a tool over all bugs in any real-world program, due to the absence of a complete list of bugs in the program. Thus, Bug-Injector (as well as all other related work) estimates the recall of a static analysis tool using the set of known bugs in a given benchmark, which is a subset (possibly strict) of all the bugs actually present in that benchmark. The set of known bugs in a given benchmark is referred to as the ground truth for the benchmark. In this section, we discuss some practical issues in representing ground truth for the purposes of evaluating static analysis tools.

Ground truth accuracy

That is, can the bugs in the provided list manifest in at least one execution of the program? LAVA [21] provides backtraces for each test case showing that the bugs included are real. EvilCoder [37], however, provides no such guarantees. Bug-Injector benchmarks come with inputs which can generate dynamically-observed program states where the required preconditions (and postconditions when provided) are met. Thus, the guarantees provided by Bug-Injector are relative to the correctness of the bug template specification. Consequently, it is important for the user to create bug templates with care.

Matching ground truth to tool output

Ground truth must include information such as location and bug type for each listed bug. This information allows automated or semi-automated matching of a tool’s output with the ground truth. There are various pitfalls in providing this information: there may be multiple locations associated with a bug, multiple bug types associated with the same bug, multiple bugs in the same location (depending on the granularity of the location), or lexically distinct languages used by tools to warn about the same bug type. Several recent studies [26, 19] elaborate on these problems.

Matching real-world bug distribution

Bug-Injector gives us control over how many of each type of bug we inject. By injecting bugs of a type that are harder or easier for a given tool to detect, one can influence the measured recall of the tool on the generated benchmark. Unfortunately, it is difficult to know the real-world distribution of different CWEs (which may be distinct from the distribution of known real-world CWEs). This does not prevent the use of Bug-Injector for comparing the relative recall of two tools on particular bug types of interest or between different settings of the same tool.

LAVA [21] injects only buffer overflows, so the bug type is known up front. Every test case includes a backtrace that showcases the bug. While this may be sufficient for evaluating fuzzers or manually inspecting static analysis results, it can be difficult to automate. For example, do you credit a tool with finding a bug only if it warns about the location at the top of the backtrace, or is it sufficient for it to warn about any location in the backtrace? Are there other relevant locations in the program that can be justifiably reported by static analysis tools? For the LAVA-1 dataset, we found empirically that key locations in the backtraces can be matched to invocations of the synthetic method lava_get() in the source code. Consequently, we interpret the ground truth to be the set of these locations.

For our Bug-Injector benchmark, such additional ground truth information is implicit in the bug templates (which specify the bug type) and the locations where the injection was performed. As shown in the example in Figure 1, the location of injection can be determined by examining the source code difference between the original and injected program.

A further hurdle to automation is that there is no standardized format for the output of a static analysis tool that all tools adhere to, and often no direct way to determine which specific bug a tool is reporting. In practice, the evaluator must typically rely on manually created heuristics that match the tool’s reports with ground truth based on location and warning type. This approach has some limitations, notably the possibility of mistakenly failing to credit the tool with a true positive because it reports a slightly different but related bug, or because it reports the correct bug at a slightly different location. Adding some “tolerances” to the location heuristics, such as allowing a neighborhood of several lines of code around the expected bug location, can mitigate this problem but may cause its own issues if the tool detects unrelated bugs in the neighborhood. In our experimental evaluation 

§ VI-B, we explicitly discuss how we credit tools for finding appropriate bugs in our benchmarks.

Vi Evaluation

In this section, we outline the research questions that direct our evaluation, describe our experimental methodology, report and discuss the results of our experiments, and compare our benchmark with the LAVA test cases [11].

Vi-a Research questions

The goal of our evaluation is to answer the following research questions about Bug-Injector and the generated benchmarks.

RQ1:

Do the benchmarks contain bugs which are seemingly in scope for the tool but which the tool fails to detect? Such bugs could provide useful feedback to the tool’s developers.

RQ2:

Can the benchmarks discriminate between different static analysis tools? Such a discrimination allows for showcasing each tool’s strengths and weaknesses.

RQ3:

Can the benchmarks discriminate between different parameter settings for a given static analysis tool? Such an ability suggests the use of Bug-Injector for automated parameter tuning of a tool, customized to a given codebase.

RQ4:

Can Bug-Injector create benchmarks that include bugs from multiple CWEs?

In addition to answering the above research questions, we also compare Bug-Injector with the LAVA test suite with respect to the same themes.

Vi-B Experimental setup and methodology

Static analysis tools and configurations

We perform our experiments using two state-of-the-art open-source static analysis tools for C/C++ programs: Clang Static Analyzer (CSA) [2] and Infer [7]. We use CSA version 3.8444This is the version of CSA associated with Clang version 3.8, CSA itself appears to no longer use separate version numbers. At the time of writing, the SWAMP platform offered this version of CSA. with all default checkers enabled, along with the optional alpha, security, osx, llvm, nullability, and optin checkers. We use Infer version 0.13.1 with default options and compute-analytics, biabduction, quandary, and bufferoverrun enabled. Our intention is to enable as many checkers as possible to maximize the tool’s chance of finding the injected bugs.

We run CSA using the Software Assurance Marketplace (SWAMP) on an Ubuntu 16.04 platform. SWAMP makes it easy to run a large number of analysis tasks. We run Infer using a Docker image provided by Facebook [8] on a machine running Ubuntu 14.04. In addition, we use Clang Static Analyzer with the analyzer configuration mode set to “shallow” (by default, the mode is “deep”). The shallow mode changes certain internal analysis parameters, such as the style of the inter-procedural analysis and maximum inlinable size. We name this configuration of Clang Static Analyzer as CSA-S. All other properties for CSA and CSA-S are the same (e.g., the checkers enabled, the white-listed bug types in Table IV).

Metrics computed

We compute two main metrics in our evaluation: the projected recall of the tool, and the average total warnings of the tool per KLOC.

The projected recall metric is the percentage of the intentionally injected bugs that are found by a tool. To determine whether a tool found an injected bug successfully, i.e. the issue raised in § V, we consider the locations of the bug injection as the bug locations. If a tool finds a bug of the appropriate type on these source lines, we give the tool credit for finding the bug. Finding a bug of a different type on that line is not sufficient. Table IV summarizes the correspondences we used between bug types reported by the two tools and two types of injected bugs in our benchmarks. Table IV summarizes which bug types reported by the tools are considered to be of the appropriate type. For our evaluation, we interpret the bug types reported by the tools quite generously, to maximize their chances of being credited with finding the injected bugs.

Buffer overrun Null pointer dereference
CSA Out of bound array access, Result of operation is garbage or undefined, malloc() size overflow Dereference of null pointer, Uninitialized argument value, Argument with ‘nonull’ attribute passed null
Infer Array out of bounds, Buffer overrun, Memory leak, Stack variable address escape Array out of bounds, Buffer overrun, Dangling pointer dereference, Null dereference, Memory leak
TABLE IV: Summary of bug types considered for each evaluated tool. These are all the bug types for a given tool (row headers) that correspond to the types (column headers) of injected bugs.

The average total warnings per KLOC metric provides the average number of total warnings reported by the tool for every thousand lines of code, on a given set of programs. A tool can obtain higher projected recall by simply reporting more warnings overall, which will increase its chances of also reporting an injected bug. Thus, it is useful to look at the above two metrics in conjunction. Note that comparing this metric directly between two analysis tools which do not have comparable warning classes is not particularly meaningful.

Vi-C Experiments and results

To help answer the research questions RQ1, RQ2, and RQ3, we run CSA, CSA-S, and Infer on an appropriate subset of the generated benchmarks described in § IV. In particular, we do not include Juliet tests as bug template sources, because several of the bug types included in Juliet tests are not within the scope of CSA and Infer.

We present the gathered metrics from the above runs in Table V. For the benchmark programs used in this experiment, bug templates derived from CSA documentation are called clang-all and Infer documentation and regression tests are called infer-all. These bug templates were injected into two host programs, grep and nginx, for a maximum of variants per pair of bug template and host program. The “No. of Bugs” column indicates the number of buggy program variants created per pair of bug template and host program. Note that there are a number of cases in which fewer than bugs were injected, including some cases in which none were injected at all. These cases indicate instances where the bug template preconditions had very few matches (or no match at all) with dynamic traces of the host program.

Bug
Template
Host
Program
No. of
Bugs
CSA
Projected
Recall
CSA
Warnings
/KLOC
CSA-S
Projected
Recall
CSA-S
Warnings
/KLOC
Infer
Projected
Recall
Infer
Warnings
/KLOC
clang-buffer1 grep 30 93.33% 6.46 93.33% 6.29 0% 1.71
nginx 0 - - - - - -
clang-buffer2 grep 30 63.33% 6.41 76.67% 6.26 0% 1.71
nginx 30 93.33% 2.28 100.00% 2.41 0% 0.27
clang-buffer3 grep 30 80.00% 6.39 100.00% 6.30 93.33% 1.73
nginx 2 100.00% 2.27 100.00% 2.40 100.00% 0.28
clang-buffer4 grep 30 70.00% 6.41 83.33% 6.24 83.33% 1.70
nginx 30 90.00% 2.27 90.00% 2.41 90.00% 0.27
clang-buffer5 grep 30 76.67% 6.33 100.00% 6.24 0% 1.65
nginx 2 100.00% 2.26 100.00% 2.40 0% 0.26
clang-buffer6 grep 30 76.67% 6.61 93.33% 6.43 86.67% 1.69
nginx 30 90.00% 2.28 90.00% 2.41 96.67% 0.27
clang-buffer7 grep 30 100.00% 6.62 100.00% 6.47 23.33% 1.68
nginx 0 - - - - - -
clang-pd1 grep 21 66.67% 6.40 95.24% 6.26 0% 1.72
nginx 0 - - - - - -
clang-pd2 grep 21 28.57% 6.16 76.19% 6.03 0% 1.64
nginx 0 - - - - - -
clang-pd3 grep 30 30.00% 6.21 30.00% 6.05 3.33% 1.63
nginx 30 46.67% 2.27 53.33% 2.40 0% 0.26
clang-pd4 grep 30 70.00% 6.40 86.67% 6.23 50.00% 1.67
nginx 30 93.33% 2.27 93.33% 2.41 53.33% 0.27
clang-all grep 312 69.87% 6.40 84.94% 6.25 32.69% 1.68
nginx 154 83.12% 2.27 85.71% 2.41 48.05% 0.27
infer-buffer1 grep 30 0% 6.20 0% 6.01 93.33% 1.75
nginx 2 0% 2.26 0% 2.40 100.00% 0.28
infer-buffer2 grep 30 0% 6.19 0% 6.00 93.33% 1.74
nginx 2 0% 2.26 0% 2.40 100.00% 0.28
infer-buffer3 grep 30 40.00% 6.17 56.67% 6.02 93.33% 1.71
nginx 30 63.33% 2.27 66.67% 2.40 83.33% 0.27
infer-buffer4 grep 30 76.67% 6.23 0% 5.97 0% 1.64
nginx 0 - - - - - -
infer-buffer5 grep 30 63.33% 6.39 66.67% 6.24 20.00% 1.62
nginx 30 16.67% 2.28 16.67% 2.41 6.67% 0.23
infer-pd1 grep 30 56.67% 6.38 70.00% 6.14 50.00% 1.67
nginx 30 93.33% 2.27 93.33% 2.41 50.00% 0.27
infer-all grep 180 39.44% 6.26 32.22% 6.06 58.33% 1.69
nginx 94 55.32% 2.27 56.38% 2.40 48.94% 0.27
all grep 492 58.74% 6.35 65.65% 6.19 42.07% 1.69
nginx 248 72.58% 2.27 74.60% 2.41 48.39% 0.27
TABLE V: Results of running CSA, CSA-S, and Infer on a subset of Bug-Injector generated benchmarks. The “No. of Bugs” column indicates the number of buggy programs in the benchmark (there is one bug per program) created by injecting the given “Bug Template” into the given “Host” program. The rows corresponding to “clang-all” and “infer-all” summarize the injection of CSA- and Infer-sourced bug templates, and “all” summarizes all bug templates. The projected recall percentages are provided under the “Tool> Projected Recall” columns, and the average total warnings per KLOC are provided under “<Tool> Warnings/KLOC” columns. All the above bug template definitions are available in Appendix A)

Addressing RQ1

The benchmarks we use for these experiments consist of bugs that CSA and Infer care about (that is, they are derived from bugs that appear in the tool’s documentation and regression tests), injected into popular open-source programs (expected targets of the chosen static analysis tools). The projected recall of both the tools (columns CSA Projected Recall and Infer Projected Recall) are a non-zero percentage for a majority of the bug template/host program pairs (i.e., rows of Table V). Therefore, both the injected bug kinds and the host programs selected for these experiments are within the scope of the analysis tools, because the tools found at least some bugs along each row.

If a tool reports a bug on a small example with a simple context, it is preferable that the tool also report a similar bug in a more complex setting. However, in the case of both CSA and Infer—the leading open-source static analysis tools for C/C++—we find that they “lose” bugs (projected recall is not ) across a majority of the bug template and host program pairs. That is, CSA and Infer find bugs in their respective documentation examples and regression tests, but in many cases they lose the ability to find the “same” bug when it is injected and integrated into a larger host program. The lost bugs provide concrete feedback about the false negatives of the checkers to the analysis tool developers.

Addressing RQ2

In a majority of the cases, CSA finds the kinds of bugs described in its documentation better (i.e., has higher projected recall on most rows until clang-all) than Infer does. On the infer-all bugs, neither tool seems to dominate the other. Thus, our generated benchmarks can be used for contrasting the evaluated tools. Furthermore, evaluations can be performed to suit specific customer needs, by controlling the distribution of the bug templates and host programs, and viewing them in conjunction with other metrics, such as time to perform the analysis.

Addressing RQ3

Static analysis tools are typically configurable, and the configuration chosen affects the recall, precision, and scalability of the tool. There is generally no single configuration that is best: it depends on the codebase and use case. To evaluate how our generated benchmarks discriminate different configurations of the same tool, we pick two configurations of Clang Static Analyzer: CSA and CSA-S. As shown in Table VI, CSA-S runs much faster than CSA. In a majority of cases, CSA-S also finds more of the injected bugs than found by CSA (i.e., has higher projected recall), while issuing a similar number of warnings overall (i.e., average total warnings per KLOC is very similar). Thus, on this particular benchmark, running CSA with mode=shallow instead of the default setting seems to be a better choice. This is a surprising result that may be of interest to CSA users and developers.

In this paper, we only compare two configuration points of CSA along three metrics. However, CSA (and many other analysis tools) have several configuration parameters. Projected recall from Bug-Injector generated tests can be used to tune the settings of these parameters for a given codebase.

Host CSA CSA-S Infer
grep 41.1 14.7 20.5
nginx 366.1 229.6 338.7
TABLE VI: Time taken (in seconds) to run different tools on the host programs. A four-core Intel(R) Xeon(R) 2.10Ghz machine with 16GB RAM running 64 bit Ubuntu 14.04 was used for these experiments. Each run was repeated five times, and the averages are reported.

Addressing RQ4

As described in § IV-A, our generated benchmark suite [1] contains injected bugs corresponding to different CWEs, based on bug templates sourced from Juliet test suite version 1.3. This artifact shows that Bug-Injector can be used to inject a wide-variety of bug types.

Causes for lost bugs

Given the large number of injected bugs “lost” by the evaluated tools, we have not extensively examined each lost bug. We sampled a small number of randomly-selected lost bugs to manually check if there was a particular pattern or language construct that was causing the tools to lose track of the bugs. However, we found no dominant obvious reason for the lost bugs. We found that without a more specific understanding of analysis implementation internals, we cannot pin-point a set of reasons why the analysis tools lost certain bugs. A detailed study of the causes of lost bugs by different tools is beyond the scope of this paper.

In some of the bugs lost by CSA, the buggy behavior involved local reasoning, but the buggy code was inside loops with no explicit terminating condition, with loop exit only through conditional goto or break statements. Some of the other lost bugs involved data/control dependence on global variables. But none of these are consistent patterns, e.g., CSA does find bugs when global variables are involved in many cases.

1static void prtext(/*some params*/ int *nlinesp) {
2  /* elided host program code */
3  if (!out_quiet) {
4    bp = lastout ? lastout : bufbeg;
5    /* from input (./harness.sh test BIN 5) */
6    if (!nlinesp) {
7      /* POTENTIAL FLAW */
8      *nlinesp = 0;
9  /* elided host program code */
Fig. 4: Example code snippet lost by CSA, detected by CSA-S.

Figure 4 shows an injected bug that is detected by CSA-S but lost by CSA. The bug is the null pointer dereference on the line after the comment “POTENTIAL FLAW”. It is a real bug, as it can be triggered with the provided input test.

Vi-D Comparison with LAVA benchmarks

We run CSA and Infer tools on the LAVA-1 benchmarks. We use version 7.0 (the latest available at the time of writing) of CSA for this experiment, keeping other analysis parameters the same as in § VI-B. The LAVA-1 benchmarks consist of variations of the file program, with each variant having one injected buffer overflow bug. As discussed in § V, we stipulate for the sake of this evaluation that the bug location is the line consisting of a lava_get() call, and give a tool credit for identifying the bug if it specifies a location within 5 lines of this. In all the LAVA test cases we used, we found that the lava_get() call location matched the first location provided in the corresponding backtrace included with the LAVA corpus.

In total, CSA reports between and warnings on each of the LAVA-1 benchmarks. In of the programs, CSA does not report on any LAVA-injected bugs. In the remaining programs, CSA issues warnings at the injected bug locations. Upon manual inspection of each of these examples, we determined these warnings to be unrelated to buffer overflows.555The reported warnings were one of: “pointer of type void* used in arithmetic”, “nested extern declaration of vasprintf”, “implicit declaration of function vasprintf”, “pointer arithmetic on non-array variables relies on memory layout: which is dangerous”. Thus, CSA reports no relevant warnings on the LAVA-injected bugs. In total, Infer reports between and warnings on each of the LAVA-1 benchmarks. However, none of the Infer warnings are at the LAVA-injected bug locations.

1  /* inside a for loop */
2  if (ml->map)
3    apprentice_unmap(((ml->map))+(lava_get())*((0x12345678
4     <= (lava_get()) && 0x123456f8 >= (lava_get())) ||
5     (0x12345678 <= __bswap_32((lava_get())) && 0x123456f8
6     >= __bswap_32((lava_get())))));
7  free(ml);
Fig. 5: Example LAVA-injected bug.

To summarize, both CSA and Infer report warnings on the LAVA-1 benchmarks, but none of these are related to the LAVA-injected bugs. Thus, the projected recall of both of these tools is on the LAVA-1 benchmarks. This result is not particularly surprising, as LAVA is biased towards testing the limits of fuzzing tools, and injects code that looks like the snippet in Figure 5. Such bugs would typically be out of scope for accurate reasoning by most static analysis tools, as the tools have to make static approximations and/or heuristic choices weighing a fine balance between precision, recall, and scalability. These results—that leading open-source static analysis tools have zero projected recall—indicate that LAVA benchmarks are not well-suited for discriminating between different static analysis tools (refer RQ2), or that they include bugs that are in scope for the evaluated static analysis tools (refer RQ1). Also, LAVA can only inject a very small number of bug kinds (refer RQ4).

Therefore, while LAVA has been successful in advancing fuzzing techniques [38] and helping create capture-the-flag-style competitions [27], it is less relevant in evaluating static analysis tools.

Vii Limitations and future work

Bug-Injector currently chooses an injection point in the host program uniformly at random from all the dynamic trace points that match the bug template’s preconditions. Thus, host program points that are exercised more frequently by the accompanying tests are more likely to be used for injection, as they appear more frequently in the dynamic traces. Bug-Injector can be combined with coverage-increasing input-generation techniques like concolic testing [30] to obtain an improved program-wide distribution of injected bugs. This would however sacrifice Bug-Injector’s technique independence property.

Bug-Injector does not currently support the injection of concurrency-related bugs. We plan to add such support. Our first step will be to improve instrumentation so that concurrency-related information such as the current thread and process is available in the trace.

Bug-Injector cannot always inject a bug template into a host program, because there is not always a dynamic trace point that matches all the preconditions and free-variable requirements for the template. To increase the chances of finding injection points in a host program, we plan to enhance Bug-Injector to allow for variable rebinding to aggregate structs and fields.

We envision running Bug-Injector’s pipeline multiple times in an evolutionarily-guided heuristic search. This process would allow injection of multiple bugs into a single host program, maximizing an objective function that balances factors such as number of injected bugs, naturalness of code [39], realistic distribution of bugs [36, 22], retention of the original program behavior, and syntactic/stylistic similarity [18] between the buggy program and the original program. Bug-Injector is built using SEL, which supports evolutionary search with multi-objective fitness functions. Leveraging this support we have early prototypes that fulfill this vision.

Regarding our test suite described in § IV, the realism of the bug templates used in the paper should be on a par with the sources from which they were automatically extracted, the Juliet suite and the Infer and CSA documentation and test suites. These sources represent the state of the art in evaluation of static analysis tools. The main remaining challenge is providing a stronger guarantee of realism through a comparison with real-world software. As we discuss in § II, there is a large body of research on bug patterns and characteristics in real-world code. In future work, we plan to apply the patterns and metrics from this research to assess our test suite and to tune it towards greater realism. We note that the evolutionary-search enhancement of Bug-Injector described above can use quantitative information about bug patterns and/or distribution as part of its fitness function. This would allow it to generate a more realistic test suite.

Regarding our experimental methodology, the main threat to validity relates to how we measure whether a tool finds a specific injected bug, both for Bug-Injector and LAVA test cases. As we explain in § V and § VI, we use simple heuristics to match the location contained in the tool’s warning with the location of the known bug and determine that the correct bug has been identified if the bug types match and the locations are within a certain maximum distance. We could refine this heuristic by using more sophisticated matching techniques from related work on the issue of deduplicating and/or clustering tool warning reports [23, 31].

Viii Conclusion

In this paper, we introduce Bug-Injector, a system that automatically generates bug-containing benchmarks suitable for evaluating and testing software analysis tools. Our experimental evaluation shows that Bug-Injector benchmarks are useful for several purposes: (a) they can showcase bugs that are seemingly in scope for a tool to find but that the tool misses, (b) they can discriminate between and guide the improvement of static analysis tools, and (c) they can help evaluate analysis parameters for a specific codebase.

Ix Acknowledgments

This material is based on research sponsored by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D17 PC00096 and the Department of Homeland Security (DHS) Science and Technology Directorate, Cyber Security Division (DHSS&T/CSD) via contract number HHSP233201600062 C. The views opinions findings and conclusions or recommendations contained herein are those of the authors and should not be interpreted as necessarily representing the official views policies or endorsements, either expressed or implied, of the Defense Advanced Research Projects Agency (DARPA); or its Contracting Agent, the U.S. Department of the Interior, Interior Business Center, Acquisition Services Directorate, Division III, or the Department of Homeland Security. We would like to thank Jeff Foster, Mikael Lindvall, Paul Black, and Daniel Krupp for their suggestions on earlier drafts of this paper, and the SAMATE group at NIST, Fraunhofer CESE, and John Regehr for their feedback on our work.

References

  • [1] Bug injector test suite. https://tinyurl.com/bug-injector-benchmarks.
  • [2] Clang static analyzer. https://clang-analyzer.llvm.org/. Accessed 2018-05-21.
  • [3] Clang static analyzer: Available checkers. http://clang-analyzer.llvm.org/available_checks.html/. Accessed 2018-05-28.
  • [4] Clang static analyzer: Regression tests. http://github.com/llvm-mirror/clang/tree/master/test/Analysis/. Accessed 2018-05-28.
  • [5] Common Weakness Enumeration - a community-developed list of software weakness types. https://cwe.mitre.org/. Accessed 2018-04-24.
  • [6] Gnu grep. http://www.gnu.org/savannah-checkouts/gnu/grep/. Accessed 2018-05-28.
  • [7] Infer. http://fbinfer.com/. Accessed 2018-05-21.
  • [8] Infer Docker file. https://raw.githubusercontent.com/facebook/infer/master/docker/Dockerfile. Accessed 2018-05-31.
  • [9] Infer: Regression tests. http://github.com/facebook/infer/tree/master/infer/tests. Accessed 2018-05-28.
  • [10] Inferbo: Infer-based buffer overrun analyzer. http://research.fb.com/inferbo-infer-based-buffer-overrun-analyzer/. Accessed 2018-05-28.
  • [11] LAVA synthetic bug corpora. http://moyix.blogspot.com/2016/10/the-lava-synthetic-bug-corpora.html. Accessed 2018-05-31.
  • [12] nginx. http://www.nginx.com/. Accessed 2018-05-28.
  • [13] OWASP WebGoat Project. https://www.owasp.org/index.php/Category:OWASP_WebGoat_Project. Accessed 2018-05-01.
  • [14] Stanford SecuriBench. https://suif.stanford.edu/l̃ivshits/securibench/. Accessed 2018-05-01.
  • [15] Static Analysis for C++ with Phasar. http://phasar.org/wp-content/uploads/2018/06/phasar_block_2-3.pdf. Slide 37, Accessed 2018-06-30.
  • [16] The Heartbleed Bug. http://heartbleed.com/. Accessed 2018-07-11.
  • [17] Paul E. Black. Juliet 1.3 Test Suite: Changes From 1.2. In National Institute of Standards and Technology (NIST) Technical Note (TN) 1995, June 2018.
  • [18] Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security 15), pages 255–270, 2015.
  • [19] Aurelien Delaitre, Bertrand Stivalet, Elizabeth Fong, and Vadim Okun. Evaluating bug finders. In First International Workshop on Complex faUlts and Failures in LargE Software Systems (COUFLESS), 2015.
  • [20] R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Hints on test data selection: Help for the practicing programmer. Computer, April 1978.
  • [21] Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. LAVA: Large-scale automated vulnerability addition. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 110–121. IEEE, 2016.
  • [22] Norman E. Fenton and Niclas Ohlsson. Quantitative analysis of faults and failures in a complex software system. IEEE Trans. Softw. Eng., 26(8):797–814, 2000.
  • [23] Z. P. Fry and W. Weimer. Clustering static analysis defect reports to reduce maintenance costs. In 2013 20th Working Conference on Reverse Engineering (WCRE), pages 282–291, Oct 2013.
  • [24] Andrew Habib and Michael Pradel. How many of all bugs do we find? a study of static bug detectors. In ACM/IEEE International Conference on Automated Software Engineering, ASE. ACM, 2018.
  • [25] R. G. Hamlet. Testing programs with the aid of a compiler. IEEE Transactions on Software Engineering, SE-3(4):279–290, July 1977.
  • [26] Jorg Herter, Daniel Kastner, Christoph Mallon, and Reinhard Wilhelm. Benchmarking static code analyzers. In Computer Safety, Reliability, and Security, pages 197–212, Cham, 2017. Springer International Publishing.
  • [27] Patrick Hulin, Andy Davis, Rahul Sridhar, Andrew Fasano, Cody Gallagher, Aaron Sedlacek, Tim Leek, and Brendan Dolan-Gavitt. Autoctf: creating diverse pwnables via automated bug injection. In USENIX Workshop on Offensive Technologies (WOOT). USENIX Association, 2017.
  • [28] René Just, Darioush Jalali, and Michael D Ernst. Defects4j: A database of existing faults to enable controlled testing studies for java programs. In International Symposium on Software Testing and Analysis. ACM, 2014.
  • [29] Shan Lu, Zhenmin Li, Feng Qin, Lin Tan, Pin Zhou, and Yuanyuan Zhou. BugBench: Benchmarks for evaluating bug detection tools. In Workshop on the evaluation of software defect detection tools, volume 5, 2005.
  • [30] Rupak Majumdar and Koushik Sen. Hybrid concolic testing. In ICSE, 2007.
  • [31] T. Muske and A. Serebrenik. Survey of approaches for handling static analysis alarms. In 2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM), pages 157–166, Oct 2016.
  • [32] Tim Newsham and Brian Chess. Abm: A prototype for benchmarking source code analyzers. In Workshop on Software Security Assurance Tools, Techniques, and Metrics. US National Institute of Standards and Technology (NIST) Special Publication (SP), pages 500–265, 2006.
  • [33] Gary Nilson, Kent Wills, Jeffrey Stuckman, and James Purtilo. BugBox: A vulnerability corpus for PHP web applications. In Presented as part of the 6th Workshop on Cyber Security Experimentation and Test, Washington, D.C., 2013. USENIX.
  • [34] NIST. SARD test cases. https://samate.nist.gov/SRD/testsuite.php. Accessed 2018-05-27.
  • [35] NIST. SATE: Static Analysis Tool Exposition. https://samate.nist.gov/SATE.html.
  • [36] Thomas J. Ostrand and Elaine J. Weyuker. The distribution of faults in a large industrial software system. In International Symposium on Software Testing and Analysis, pages 55–64, 2002.
  • [37] Jannik Pewny and Thorsten Holz. EvilCoder: automated bug insertion. In Proceedings of the 32nd Annual Conference on Computer Security Applications, pages 214–225. ACM, 2016.
  • [38] Sanjay Rawat, Vivek Jain, Ashish Kumar, Lucian Cojocar, Cristiano Giuffrida, and Herbert Bos. Vuzzer: Application-aware evolutionary fuzzing. In Proceedings of the Network and Distributed System Security Symposium (NDSS), 2017.
  • [39] Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. On the "naturalness" of buggy code. In Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, pages 428–439, New York, NY, USA, 2016. ACM.
  • [40] Subhajit Roy, Awanish Pandey, Brendan Dolan-Gavitt, and Yu Hu. Bug synthesis: Challenging bug-finding tools with deep faults. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2018, pages 224–234, New York, NY, USA, 2018. ACM.
  • [41] Eric Schulte and Contributors. Software Evolution Library. GrammaTech, eschulte@grammatech.com, 1 2018. https://grammatech.github.io/sel/.
  • [42] Shinichi Shiraishi, Veena Mohan, and Hemalatha Marimuthu. Test suites for benchmarks of static analysis tools. In Software Reliability Engineering Workshops (ISSREW), 2015 IEEE International Symposium on, pages 12–15. IEEE, 2015.
  • [43] Christopher Timperley, Susan Stepney, and Claire Le Goues. Poster: BugZoo – A Platform for Studying Software Bugs. In International Conference on Software Engineering, ICSE, 2018.
  • [44] William Vanderlinde. Securely taking on new executable software of uncertain provenance (STONESOUP). http://www.iarpa.gov/index.php/research-programs/stonesoup.
  • [45] John Wilander and Mariam Kamkar. A comparison of publicly available tools for static intrusion prevention. In Nordic Workshop on Secure IT Systems (NordSec), pages 68–84, Karlstad, Sweden, 2002/11/07/November 7 2002. Karlstad, Sweden.
  • [46] John Wilander and Mariam Kamkar. A comparison of publicly available tools for dynamic buffer overflow prevention. In Symposium on Network and Distributed System Security (NDSS), pages 149–162. The Internet Society, 2003/02/06/February 6 2003. San Diego, CA.
  • [47] Misha Zitser, Richard Lippmann, and Tim Leek. Testing static analysis tools using exploitable buffer overflows from open source code. In ACM SIGSOFT Software Engineering Notes, volume 29, pages 97–106. ACM, 2004.

Appendix A Bug Template Definitions

In this section, we provide the definitions for all the bug templates listed in Table V, in the same order that they appear in the table.

Bug Template Clang-Buffer1

Definition:

1(define-scion clang-buffer1
2    (make-instance ’clang-scion
3                   :name clang-buffer1
4                   :patches (list clang-buffer1-patch)))
Patches:
  • Definition for patch CLANG-BUFFER1-PATCH.

    1(defparameter clang-buffer1-patch
    2  (make-instance ’clang-dynamic-patch
    3    :precondition nil
    4    :cwe 122
    5    :cwe-line 3
    6    :free-variables (("s" "*char" :-const) ("c" "char" :-const))
    7    :code "s = \"\";
    8/* POTENTIAL FLAW */
    9c = s[1];
    10c++;"))

Bug Template Clang-Buffer2

Definition:

1(define-scion clang-buffer2
2    (make-instance ’clang-scion
3                   :name clang-buffer2
4                   :patches (list clang-buffer2-patch)))
Patches:
  • Definition for patch CLANG-BUFFER2-PATCH.

    1(defparameter clang-buffer2-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 121
    4    :cwe-line 3
    5    :precondition (lambda (obj location)
    6                    (vars-declarable-p obj
    7                                       (ast-at-index obj location)
    8                                       (list "s" "c")))
    9    :free-variables ()
    10    :code "char *s = \"\";
    11/* POTENTIAL FLAW */
    12char c = s[1];"))

Bug Template Clang-Buffer3

Definition:

1(define-scion clang-buffer3
2    (make-instance ’clang-scion
3                   :name clang-buffer3
4                   :patches (list clang-buffer3-patch)))
Patches:
  • Definition for patch CLANG-BUFFER3-PATCH.

    1(defparameter clang-buffer3-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 121
    4    :cwe-line 5
    5    :precondition (lambda (obj location)
    6                    (vars-declarable-p obj
    7                                       (ast-at-index obj location)
    8                                       (list "buf" "p")))
    9    :free-variables ()
    10    :code "int buf[100];
    11int *p = buf;
    12p = p + 99;
    13/* POTENTIAL FLAW */
    14p[1] = 1;"))

Bug Template Clang-Buffer4

Definition:

1(define-scion clang-buffer4
2    (make-instance ’clang-scion
3                   :name clang-buffer4
4                   :patches (list clang-buffer4-patch)))
Patches:
  • Definition for patch CLANG-BUFFER4-PATCH.

    1(defparameter clang-buffer4-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 121
    4    :cwe-line 5
    5    :precondition (lambda (obj location p)
    6                    (declare (ignorable p))
    7                    (var-declarable-p obj (ast-at-index obj location) "buf"))
    8    :free-variables (("p" "*int" :-const))
    9    :code "int buf[100];
    10p = buf;
    11p = p + 99;
    12/* POTENTIAL FLAW */
    13p[1] = 1;"))

Bug Template Clang-Buffer5

Definition:

1(define-scion clang-buffer5
2    (make-instance ’clang-scion
3                   :name clang-buffer5
4                   :patches (list clang-buffer5-patch)))
Patches:
  • Definition for patch CLANG-BUFFER5-PATCH.

    1(defparameter clang-buffer5-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 121
    4    :cwe-line 3
    5    :precondition (lambda (obj location)
    6                    (var-declarable-p obj
    7                                      (ast-at-index obj location)
    8                                      "buf"))
    9    :free-variables ()
    10    :code "int buf[100][100];
    11/* POTENTIAL FLAW */
    12buf[0][-1] = 1;"))

Bug Template Clang-Buffer6

Definition:

1(define-scion clang-buffer6
2    (make-instance ’clang-scion
3                   :name clang-buffer6
4                   :patches (list clang-buffer6-patch)))
Patches:
  • Definition for patch CLANG-BUFFER6-PATCH.

    1(defparameter clang-buffer6-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 121
    4    :cwe-line 4
    5    :precondition (lambda (obj location p)
    6                    (declare (ignorable p))
    7                    (var-declarable-p obj (ast-at-index obj location) "buf"))
    8    :free-variables (("p" "*int" :-const))
    9    :code "int buf[100][100];
    10p = &buf[0][-1];
    11/* POTENTIAL FLAW */
    12p[0] = 1;"))

Bug Template Clang-Buffer7

Definition:

1(define-scion clang-buffer7
2    (make-instance ’clang-scion
3                   :name clang-buffer7
4                   :patches (list clang-buffer7-patch)))
Patches:
  • Definition for patch CLANG-BUFFER7-PATCH.

    1(defparameter clang-buffer7-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 122
    4    :cwe-line 2
    5    :precondition (lambda (obj location n p)
    6                    (declare (ignorable obj location p))
    7                    (> (* (v/value n) 4) +max-int-32+))
    8    :free-variables (("n" "int") ("p" "*void" :-const))
    9    :code "/* POTENTIAL FLAW */
    10p = malloc(n * sizeof(int));"))

Bug Template Clang-Pd1

Definition:

1(define-scion clang-pd1
2    (make-instance ’clang-scion
3                   :name clang-pd1
4                   :patches (list clang-pd1-patch)))
Patches:
  • Definition for patch CLANG-PD1-PATCH.

    1(defparameter clang-pd1-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 476
    4    :cwe-line 5
    5    :precondition (lambda (obj location p)
    6                    (let ((ast (ast-at-index obj location)))
    7                      (and (= (v/value p) 0)
    8                           (ast-void-ret (function-containing-ast obj ast))
    9                           (var-declarable-p obj ast "x"))))
    10    :free-variables (("p" "*int"))
    11    :code "if (p) {
    12    return;
    13}
    14/* POTENTIAL FLAW */
    15int x = p[0];"))

Bug Template Clang-Pd2

Definition:

1(define-scion clang-pd2
2    (make-instance ’clang-scion
3                   :name clang-pd2
4                   :patches (list clang-pd2-patch)))
Patches:
  • Definition for patch CLANG-PD2-PATCH.

    1(defparameter clang-pd2-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 476
    4    :cwe-line 3
    5    :precondition (lambda (obj location p)
    6                    (declare (ignorable obj location))
    7                    (= (v/value p) 0))
    8    :free-variables (("p" "*int" :-const))
    9    :code "if (!p) {
    10    /* POTENTIAL FLAW */
    11    *p = 0;
    12}"))

Bug Template Clang-Pd3

Definition:

1(define-scion clang-pd3
2    (make-instance ’clang-scion
3                   :name clang-pd3
4                   :patches (list clang-pd3-patch)))
Patches:
  • Definition for patch CLANG-PD3-PATCH.

    1(defparameter clang-pd3-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 476
    4    :cwe-line 2
    5    :precondition  nil
    6    :free-variables ()
    7    :includes ’("<string.h>")
    8    :code "/* POTENTIAL FLAW */
    9strlen(0);"))’

Bug Template Clang-Pd4

Definition:

1(define-scion clang-pd4
2    (make-instance ’clang-scion
3                   :name clang-pd4
4                   :patches (list clang-pd4-patch)))
Patches:
  • Definition for patch CLANG-PD4-PATCH.

    1(defparameter clang-pd4-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 476
    4    :cwe-line 2
    5    :precondition (lambda (obj location x)
    6                    (declare (ignorable obj location))
    7                    (= (v/value x) 0))
    8    :free-variables (("x" "*char"))
    9    :includes ’("<string.h>")
    10    :code "/* POTENTIAL FLAW */
    11strlen(x);"))’

Bug Template Infer-Buffer1

Definition:

1(define-scion infer-buffer1
2    (make-instance ’clang-scion
3                   :name infer-buffer1
4                   :patches (list infer-buffer1-patch1
5                                  infer-buffer1-patch2)))
Patches:
  • Definition for patch INFER-BUFFER1-PATCH1.

    1(defparameter infer-buffer1-patch1
    2  (make-instance ’clang-static-patch
    3    :precondition (lambda (obj location)
    4                    (declare (ignorable obj))
    5                    (= location 0))
    6    :free-variables nil
    7    :code-top-level-p t
    8    :code "void set_i(int *arr, int index) {
    9  arr[index] = 0;
    10}"))’
  • Definition for patch INFER-BUFFER1-PATCH2.

    1(defparameter infer-buffer1-patch2
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 122
    4    :cwe-line 7
    5    :precondition nil
    6    :free-variables (("arr" "*int" :-const))
    7    :dependencies (list infer-buffer1-patch1)
    8    :includes (list "<stdlib.h>")
    9    :code "arr = (int *)malloc(9*sizeof(int));
    10if (arr != NULL) {
    11    int i;
    12    for (i = 0; i < 9; i+=1) {
    13        set_i(arr, i);
    14        /* POTENTIAL FLAW */
    15        set_i(arr, i + 1);
    16    }
    17}"))

Bug Template Infer-Buffer2

Definition:

1(define-scion infer-buffer2
2    (make-instance ’clang-scion
3                   :name infer-buffer2
4                   :patches (list infer-buffer2-patch)))
Patches:
  • Definition for patch INFER-BUFFER2-PATCH.

    1(defparameter infer-buffer2-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 122
    4    :cwe-line 7
    5    :precondition nil
    6    :free-variables (("arr" "*int" :-const))
    7    :includes ’("<stdlib.h>")
    8    :code "arr = (int *)malloc(9*sizeof(int));
    9if (arr != NULL) {
    10    int i;
    11    for (i = 0; i < 9; i++) {
    12        arr[i] = 0;
    13        /* POTENTIAL FLAW */
    14        arr[i+1] = 0;
    15    }
    16}"))’

Bug Template Infer-Buffer3

Definition:

1(define-scion infer-buffer3
2    (make-instance ’clang-scion
3                   :name infer-buffer3
4                   :patches (list infer-buffer3-patch)))
Patches:
  • Definition for patch INFER-BUFFER3-PATCH.

    1(defparameter infer-buffer3-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 121
    4    :cwe-line 4
    5    :precondition (lambda (obj location global)
    6                    (and (< (v/value global) 10)
    7                            (var-declarable-p obj
    8                                              (ast-at-index obj location)
    9                                              "arr")))
    10    :free-variables (("global" "int"))
    11    :code "char arr[10];
    12if (global < 10){
    13    /* POTENTIAL FLAW */
    14    arr[10] = 1;
    15}"))

Bug Template Infer-Buffer4

Definition:

1(define-scion infer-buffer4
2    (make-instance ’clang-scion
3                   :name infer-buffer4
4                   :patches (list infer-buffer4-patch1
5                                  infer-buffer4-patch2)))
Patches:
  • Definition for patch INFER-BUFFER4-PATCH1.

    1(defparameter infer-buffer4-patch1
    2  (make-instance ’clang-static-patch
    3    :precondition (lambda (obj location)
    4                    (declare (ignorable obj))
    5                    (= location 0))
    6    :free-variables ()
    7    :code-top-level-p t
    8    :code "void two_accesses(int* arr) {
    9    if (arr[1] < 0) {
    10        arr[0] = 0;
    11    }
    12}"))
  • Definition for patch INFER-BUFFER4-PATCH2.

    1(defparameter infer-buffer4-patch2
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 122
    4    :cwe-line 2
    5    :precondition (lambda (obj location arr)
    6                    (declare (ignorable obj location))
    7                    (and (v/size arr)
    8                         (<= (v/size arr) 4)))
    9    :free-variables (("arr" "*int" :-const))
    10    :dependencies (list infer-buffer4-patch1)
    11    :code "/* POTENTIAL FLAW */
    12two_accesses(arr);"))

Bug Template Infer-Buffer5

Definition:

1(define-scion infer-buffer5
2    (make-instance ’clang-scion
3                   :name infer-buffer5
4                   :patches (list infer-buffer5-patch)))
Patches:
  • Definition for patch INFER-BUFFER5-PATCH.

    1(defparameter infer-buffer5-patch
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 121
    4    :cwe-line 4
    5    :precondition (lambda (obj location n)
    6                    (and (= (v/value n) 0)
    7                         (var-declarable-p obj
    8                                           (ast-at-index obj location)
    9                                           "arr")))
    10    :free-variables (("n" "int"))
    11    :code "int arr[1];
    12arr[n] = 0;
    13/* POTENTIAL FLAW */
    14arr[n - 2] = 0;"))

Bug Template Infer-Pd1

Definition:

1(define-scion infer-pd1
2    (make-instance ’clang-scion
3                   :name infer-pd1
4                   :patches (list infer-pd1-patch1
5                                  infer-pd1-patch2)))
Patches:
  • Definition for patch INFER-PD1-PATCH1.

    1(defparameter infer-pd1-patch1
    2  (make-instance ’clang-static-patch
    3    :precondition (lambda (obj location)
    4                    (declare (ignorable obj))
    5                    (= location 0))
    6    :code-top-level-p t
    7    :code "void set_ptr(int* ptr, int val) {
    8    *ptr = val;
    9}
    10int set_ptr_param_array(int buf[]) {
    11    set_ptr(buf, 1);
    12    return buf[0];
    13}"))’
  • Definition for patch INFER-PD1-PATCH2.

    1(defparameter infer-pd1-patch2
    2  (make-instance ’clang-dynamic-patch
    3    :cwe 476
    4    :cwe-line 2
    5    :dependencies (list infer-pd1-patch1)
    6    :code "/* POTENTIAL FLAW */
    7set_ptr_param_array(0);"))’