FairFuzz: Targeting Rare Branches to Rapidly Increase Greybox Fuzz Testing Coverage

09/20/2017 ∙ by Caroline Lemieux, et al. ∙ berkeley college 0

In recent years, fuzz testing has proven itself to be one of the most effective techniques for finding correctness bugs and security vulnerabilities in practice. One particular fuzz testing tool, American Fuzzy Lop or AFL, has become popular thanks to its ease-of-use and bug-finding power. However, AFL remains limited in the depth of program coverage it achieves, in particular because it does not consider which parts of program inputs should not be mutated in order to maintain deep program coverage. We propose an approach, FairFuzz, that helps alleviate this limitation in two key steps. First, FairFuzz automatically prioritizes inputs exercising rare parts of the program under test. Second, it automatically adjusts the mutation of inputs so that the mutated inputs are more likely to exercise these same rare parts of the program. We conduct evaluation on real-world programs against state-of-the-art versions of AFL, thoroughly repeating experiments to get good measures of variability. We find that on certain benchmarks FairFuzz shows significant coverage increases after 24 hours compared to state-of-the-art versions of AFL, while on others it achieves high program coverage at a significantly faster rate.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Fuzz testing has emerged as one of the most effective testing techniques for finding correctness bugs and security vulnerabilities in real-world software systems. It has been used successfully by major software companies such as Microsoft (Godefroid et al., 2008) and Google (Evans et al., 2011; Arya and Neckar, 2012; Moroz and Serebryany, 2016) for security testing and quality assurance. The term fuzz testing is generally used to designate techniques which test programs by generating random input data and executing the program under such inputs. The goal of fuzz testing is to exercise as many program paths as possible with the hope of catching bugs that surface as assertion violations or program crashes. While many individual test inputs generated by fuzz testing may be garbage, due to its low computational overhead, fuzz testing can generate test inputs much faster than more sophisticated methods such as dynamic symbolic execution (Godefroid et al., 2005; Sen et al., 2005; Cadar et al., 2008; Sen and Agha, 2006). In practice, this trade-off has paid off, and fuzz testers have found numerous correctness bugs and security vulnerabilities in widely used software (Stephens et al., 2016; Böhme et al., 2016; Zalewski, 2014; Hocevar, 2007; Holler et al., 2012; Householder and Foote, 2012; Pacheco and Ernst, 2007; Yang et al., 2011; Fraser and Arcuri, 2011).

The success of one particular fuzz tester, American Fuzzy Lop (or simply AFL) (Zalewski, 2014), has gained attention both in practice and in the research community (Stephens et al., 2016; Li et al., 2017; Böhme et al., 2016; Böhme et al., 2017). This fuzz tester alone has found vulnerabilities in a broad array of programs, including Web browsers (e.g. Firefox, Internet Explorer), network tools (e.g., tcpdump, wireshark), image processors (e.g., ImageMagick, libtiff), various system libraries (e.g., OpenSSH, PCRE), C compilers (e.g., GCC, LLVM), and interpreters (for Perl, PHP, JavaScript). A large part of its popularity can be attributed to its easy setup. To use it, one needs only to compile it from source, compile the program of interest with AFL’s instrumented version of gcc or clang, collect a sample input, and run AFL.

Matching the simplicity of its setup is the simplicity of its process. At its base, all AFL does is mutate the user-provided inputs with byte-level operations. It runs the program under test on the mutated inputs and collects some low-overhead program coverage information from these runs. Then it saves the interesting mutated inputs—the ones that discover new program coverage. Finally, it continually repeats the process, starting with these interesting mutated inputs instead of the user-provided ones.

While playing with AFL and its extensions, we observed—as others have in the past (Stephens et al., 2016; Böhme et al., 2016; Li et al., 2017; laf, 2016a) — one important limitation of AFL: AFL often fails to explore programs very deeply, i.e. into sections guarded by specific constraints on the input. This is partly because AFL does not take into consideration which parts of the input may be necessary for deeper program exploration. For example, consider a program processing an image format. This format may have a particular header indicating its validity, and perhaps some special byte sequences which trigger different parts of the image processor. In order to cover deeper program functionality, an input must have this specific header and some of these special byte sequences. But even if AFL finds inputs with this header or these sequences, it has no knowledge that the parts of the inputs containing them are important for deep exploration of the program. Therefore, it is just as likely to mutate the header and special sequences as it is to mutate the data portion of the image. A bug which only manifests when the header and sequences are in place may take a much longer time to be generated than it would if these important parts of the input were kept fixed.

We propose an approach, called FairFuzz, that tries to alleviate this limitation. FairFuzz

functions in two main steps. First, it automatically prioritizes inputs exercising rare parts of the program under test. Second, it automatically adjusts byte-level mutation techniques to increase the probability that mutated inputs will exercise these same rare parts of the program while still exploring different paths through the program. While we propose this mutation modification strategy with the target of exercising rare parts of the program, it could be adapted to target other interesting parts of the program, e.g., a recently patched function.

We created an open-source implementation of FairFuzz on top of AFL. We evaluated FairFuzz against three popular versions of AFL: AFL, FidgetyAFL (Zalewski, 2016a)) (a modification of AFL to better match the behavior of AFLFast (Böhme et al., 2016)), and AFLFast.new (Böhme, 2016) (an enhancement of AFLFast to match the behavior of FidgetyAFL).

We evaluated our approach on four general fronts. First, whether it would result in faster program coverage. Relatedly, whether it would result in more extensive coverage after a standard time budget (24 hours). Third, whether the modification of AFL’s mutation technique results in a higher proportion of inputs exercising the targeted parts of the program. Finally, whether it would enhance AFL’s crash-finding ability. We conduct our evaluation on nine real-world benchmarks, including those used to evaluate AFLFast. We repeat our experiments and provide measures of variability when comparing techniques. We find that on certain benchmarks FairFuzz leads to significant coverage increases after 24 hours compared to AFL, FidgetyAFL, and AFLFast.new, and that on other benchmarks FairFuzz achieves wide program coverage at a significantly faster rate. We also observed that FairFuzz tends to perform better on programs with many nested conditionals.

In summary, we make the following contributions:

  • We propose an approach to increase the coverage obtained by greybox fuzzers by targeting rare branches, including a more general method to target mutations to a program characteristic of interest.

  • We develop an open-source111https://github.com/carolemieux/afl-rb implementation of this approach built on top of AFL, named FairFuzz.

  • We perform evaluation of FairFuzz against different state-of-the-art versions of AFL on real-world benchmarks with numerical information presented with a measure of variability

    (i.e. confidence intervals).

  • When comparing state-of-the-art versions of AFL, we repeat our experiments 20 times to obtain statistically significant results and to measure variability in the outcomes. We show that repetition and variability measures are necessary to draw correct conclusions while comparing AFL-based techniques due to the highly non-deterministic nature of AFL. To our knowledge, we are the first work on AFL-based techniques to report a measure of variability in the coverage the techniques achieve through time.

We detail the general method and implementation on top of AFL in Section 3 and the performance results in Section 4. We will begin in Section 2 with a more detailed overview of AFL, its current limitations, and how to overcome these with our method.

2. Overview

Our proposed technique, FairFuzz, is built on American Fuzzy Lop (AFL) (Zalewski, 2014). AFL is a popular greybox mutation-based fuzz tester. 222Greybox fuzz testers (Zalewski, 2014; lib, 2016) are designated as such since, unlike whitebox fuzz testers (Godefroid et al., 2005; Sen and Agha, 2006; Cadar et al., 2008; Godefroid et al., 2008), they do not do any source code analysis, but, unlike pure blackbox fuzz testers (Hocevar, 2007), they use limited feedback from the program under test to guide their fuzzing strategy. Next we give a brief description of AFL, illustrate one of its limitations, and motivate the need for FairFuzz.

To fuzz test programs, AFL generates random inputs. However, instead of generating these inputs blindly, from scratch, it selects a set of previously generated inputs and mutates them to derive new random inputs. A key innovation behind AFL is its use of coverage information collected during the execution of the program on its previously-generated inputs. Specifically, AFL uses this information to select inputs for mutation, selecting only those that have achieved new program coverage. In order to collect this coverage information efficiently, AFL inserts instrumentation into the program under test. To track coverage, it first associates each basic block with a random number via instrumentation. The random number is treated as the unique ID of the basic block. The basic block IDs are then used to generate unique IDs for the transitions between pairs of basic blocks. In particular, for a transition from basic block to , AFL uses the IDs of each basic block— and , respectively—to define the ID of the transition, , as follows:

Right-shifting () the basic block ID of the transition start block () ensures that the transition from to has a different ID from the transition from to . We associate the notion of basic block transition with that of a branch in the program’s control flow graph, and throughout the paper we will use the term branch to refer to this AFL-defined basic block transition unless stated otherwise. Note that since random numbers are used as unique IDs for basic blocks, there is a small but non-zero probability of having the same ID for two different branches. However, AFL’s creator argues (Zalewski, 2017) that for many programs the actual branch ID collision rate is small.

The coverage of the program under test on a given input is collected as a set of pairs of the form (branch ID, hit count). If a (branch ID, hit count) pair is present in the coverage set, it denotes that during the execution of the program on the input, the branch with ID branch ID was exercised hit count number of times. The hit count is bucketized to small powers of two. AFL refers this set of pairs as the path of an input. AFL says that an input achieves new coverage if it hits a new branch (one not exercised in any previous execution) or achieves a new hit count for an already-exercised branch—i.e. if it discovers a new (branch ID, hit count) pair.

1:procedure FuzzTest()
2:     
3:     while true do begin a queue cycle
4:         for  in  do
5:              if isWorthFuzzing(then
6:                  continue
7:              end if
8:              for  in to length(do
9:                  mutateDeterministic(, i)
10:                  runAndMaybeSave()
11:              end for
12:              performanceScore()
13:              for  in to  do
14:                  mutateHavoc()
15:                  runAndMaybeSave()
16:              end for
17:         end for
18:     end while
19:end procedure
20:procedure runAndMaybeSave()
21:      run()
22:     if newCoverage(then
23:         addToQueue()
24:     end if
25:end procedure
Algorithm 1 AFL Fuzzing

The overall AFL fuzzing algorithm is given in Algorithm 1. The fuzzing routine takes as input a program and a set of user-provided seed inputs. The seed inputs are used to initialized a queue (Line 2) of inputs. The queue contains the inputs which AFL will mutate in order to generate new inputs. AFL goes through this queue (Line 4), selects an input to mutate (Line 5), mutates the input (Lines 914), runs the program on and, simultaneously, collects the coverage information for the mutated inputs (Line 21), and finally adds these mutated inputs to the queue if they achieve new coverage (Line 23). An entire pass through the queue is called a cycle; cycles are repeated (Line 3) until the fuzz testing procedure is stopped by the user.

AFL’s mutation strategies assume the input to the program under test is a sequence of bytes, and can be treated as such during mutation. AFL mutates inputs under two main stages: the deterministic (Algorithm 1, Lines 8-11) stages and the havoc (Lines 12-16) stage. All the deterministic mutation stages operate by traversing the input under mutation and applying a mutation at sequential positions (bits and bytes) in this input. These mutations includes bit flipping, byte flipping, arithmetic increment and decrement of byte values, replacing of bytes with “interesting” values, etc. The number of mutated inputs produced in each of these stages is governed by the length of the input being mutated. On the other hand, the havoc stage works by applying a sequence of random mutations (e.g. setting random bytes to random values, deleting or cloning subsequences of the input) to the input being mutated to produce a new input. The number of total havoc-mutated inputs to be produced is determined by a performance score, score (Line 12).

AFL’s mutation strategies pay little or no attention to the contents of an input. Therefore, they can easily destroy parts of an existing input that are necessary to explore deeper parts of a program. To see how this could make it difficult for AFL to explore deeper parts of the program under test, consider the code in Figure 1. This fragment, simplified from a portion of the libxml file parser.c, shows the various nested string comparisons that process XML default declarations. Exploring the correct handling of, say, the requirement that the #FIXED keyword must be followed by a blank character (Line 16) requires producing an input containing the string <!ATTLIST, followed by a correct attribute type string like ID or CDATA (to pass Line 3), then finally the string #FIXED. If AFL manages to produces the input <!ATTLIST BD, it will not prioritize mutation of the bytes after <!ATTLIST, and is as likely to produce the mutants <!CATLIST BD, <!!ATTLIST BD, ???!ATTLIST BD as it is to produce <!ATTLIST ID.333AFL does perform some more complex mutations only on positions which, when mutated, cause different program behavior. In this case, <!CATLIST BD, <!!ATTLIST BD, and ???!ATTLIST BD are generated by mutations performed at locations whose modification results in different behavior. However, to explore the code in Figure 1 more rapidly, once <!ATTLIST BD has been discovered, mutations should keep the <!ATTLIST part of this input constant.

1if (CMP9(ptr,’<’,’!’,’A’,’T’,’T’,’L’,’I’,’S’,’T’)) {
2   // if an attribute type is found, process default val
3    if (xmlParseAttributeType(ptr) > 0) {
4       if (CMP9(ptr,’#’,’R’,’E’,’Q’,’U’,’I’,’R’,’E’,’D’)) {
5          ptr += 9;
6          default_decl = XML_ATTRIBUTE_REQUIRED;
7       }
8     if (CMP8(ptr,’#’,’I’,’M’,’P’,’L’,’I’,’E’,’D’)) {
9         ptr += 8;
10         default_decl = XML_ATTRIBUTE_IMPLIED;
11      }
12      if (CMP6(ptr,’#’,’F’,’I’,’X’,’E’,’D’)) {
13        ptr += 6;
14        default_decl = XML_ATTRIBUTE_FIXED;
15        if (!IS_BLANK_CH(ptr)) {
16           xmlFatalErrorMsg("Space required after ’#FIXED’");
17         }
18      }
19   }
20}
Figure 1. Code fragment based off the libxml file parser.c showing many nested if statements that must be satisfied to explore erroneous behavior.

We propose a two-pronged approach that builds on AFL to do exactly this. The first part of our approach is the identification of statements like the if statements in Figure 1 (and Line 3 of Figure 2). For this, we leverage the fact that such statements are indeed hit by very few of AFL’s generated inputs (i.e. they are rare), and can thus be easily identified by keeping the track of the number of inputs which hit each branch. Having identified these rare branches for targeted fuzzing, we modify the input mutation strategy in order to keep the condition of the rare branch satisfied. Specifically, we use a deterministic mutation phase to approximately determine the parts of the input that cannot be mutated if we want the input to hit the rare branch. The subsequent mutation stages are then not allowed to mutate these crucial parts of the input. As a result, we significantly increase the probability of generating new inputs that hit the rare branch. This opens up the possibility of better exploring the part of the code that is guarded by the if condition. We find this approach leads to significant coverage increases after 24 hours compared to stock AFL and other modified versions of AFL on some benchmarks, and that on other benchmarks it achieves wide program coverage at a significantly faster rate. We will present the details of this approach in the next section.

We note that prior work on AFL has focused on a closely related, but not identical, issue. This issue is usually illustrated with code fragments like that in Figure 2. In this fragment, a program crash hides behind a branch statement that is only satisfied by the presence of a certain magic number in the input: AFL’s mutation-based strategy is highly unlikely to discover the 8 consecutive magic bytes. Several techniques have been proposed to allow AFL to better explore scenarios like these. AFLFast (Böhme et al., 2016) assumes these will present as rare paths and applies more havoc mutations on seeds exercising rare paths to try and hit more rare paths; Driller (Stephens et al., 2016) uses symbolic execution to produce these magic numbers when AFL gets stuck; Steelix (Li et al., 2017) adds a static analysis stage, extra instrumentation, and mutations to AFL to exhaustively search the bytes around a byte which is matched in a multi-byte comparison. The issue of producing a single input with a magic number is fundamentally orthogonal to the issue of making sure this magic number retains its value. None of these techniques will be able to prevent this magic number from being mutated during AFL’s usual mutation stages after it has been discovered, while our technique does. Additionally, since Steelix does not instrument one byte comparison instructions, it would not effectively explore the code in Figure 1, as the CMP macros compile into single byte comparison instructions. Since AFLFast’s emphasis on rare paths is similar to our emphasis on rare branches, we will evaluate our technique against AFLFast.

1int main( int argc, const char* argv[] ) {
2// ...
3   if (magic == 0xBAAAAAAD) {
4     // crash!
5   }
6}
Figure 2. Magic number guarding a crash.

3. FairFuzz Algorithm

In FairFuzz444FairFuzz is so called because it gives more priority to the rare branches of a program, which are “unfairly” not prioritized by stock AFL., we modify the AFL algorithm in two key ways to target exploration to rare branches. First, we modify the selection of inputs to mutate from the queue (isWorthFuzzing in Algorithm 1, Line 5) in order to select inputs which hit rare branches. Second, we modify the mutations that are performed on these inputs (mutateDeterministic on Line 9 and mutateHavoc on Line 14) in order to increase the probability that the mutated inputs will hit the rare branch in question. We describe the modifications to input selection in Section 3.1 and the modifications to the mutation strategies in Section 3.2.

3.1. Selecting Inputs to Mutate

Recall from Section 2 that conventional AFL operates by repeatedly traversing a queue of inputs and mutating these inputs to produce new inputs. Unlike a traditional queue data structure, inputs are never truly removed from the queue. Instead, the method isWorthFuzzing selects certain inputs to mutate in each cycle. The method is non-deterministic, prioritizing short, fast, and recently discovered inputs, but sometimes selecting old inputs for mutation. We replace isWorthFuzzing with the function hitsRareBranch, which, as the name suggests, returns true if the input hits a rare branch and false otherwise.

In order to do this, we must define what it means to be a rare branch. A natural idea is to designate the branches hit by the fewest inputs as rare, or the branches hit by less than percent of inputs to be rare. After some initial experiments with our benchmark programs, we rejected these methods as (a) they can fail to capture what it means to be rare (e.g. if and the two rarest branches are hit by 20 and 15,000 inputs, both would be equally rare when this is clearly not the case), and (b) these thresholds would likely need to be modified for different benchmarks.

Instead, we say a branch is rare if it has been hit by a number of inputs less than or equal to a dynamically chosen rarity_cutoff. Informally, rarity_cutoff is the smallest power of two which bounds the number of inputs hitting the rarest branch. For example, if the rarest branch has been hit by only 17 inputs, we would cut off rarity at branches that have been hit by inputs. Formally, if is the set containing the number of inputs hitting each branch,

FairFuzz keeps track of by keeping a map of branch IDs to the number of inputs which hit the corresponding branch so far (the input count). After running the program on an input, FairFuzz increments the input count by one for each branch the input hits. FairFuzz calculates rarity_cutoff at every call of hitsRareBranch.

During fuzzing, FairFuzz goes over the queue and selects for mutation inputs on which hitsRareBranch returns true. Note that the execution of a program on a selected input may hit multiple rare branches. In that case, FairFuzz picks the rarest branch among them as the target branch for the purpose of mutation, and subsequently tries to make sure that mutations of the selected input hit this target rare branch more frequently. Of course, if the input hits only one rare branch, this is the target branch. Since at the beginning of fuzz testing no inputs have been produced, and the definition of rare requires some number of inputs to have hit a branch, we run a round of regular AFL mutation on the user-provided input. All subsequent inputs are selected from the queue with hitsRareBranch.

Finally, due to its strict boolean nature, hitsRareBranch can get stuck more easily than the probabilistic isWorthFuzzing (i.e. by repeatedly pulling the same input from the queue). Currently, we take action only in the extreme case, when none of the mutated inputs produced from a seed hit the target branch. If we witness this behavior, we put the rare branch on an exclude list. We ignore any branches in the exclude list when finding rare branches and calculating the rare branch cutoff in the future.

3.2. Mutating Inputs

After selecting an input which hits a rare branch, we bias mutation of this input to produce inputs which hit the target branch at a higher frequency than AFL would. The biggest part of this is the calculation of a branch mask, which we use to influence all subsequent mutations.

The branch mask designates at which positions in the input can bytes be (1) overwritten (overwritable), (2) deleted (deletable), or (3) inserted (insertable) while the resulting input still hits the target branch. We compute this branch mask in AFL’s deterministic mutation stages. Part (1) of the branch mask is computed in the byte flipping stage: if the input resulting from a byte flip still hits the target branch, the position is designated as overwritable. We add two new deterministic stages after the byte flipping to compute parts (2) and (3) of the mask. First, we traverse the input and delete each byte in sequence, marking the position as deletable if its removal does not make the program miss the target branch. Similarly, we traverse the input and add a random byte before each position (and at the end of the input), marking the position as insertable if the resulting input still makes the program hit the target branch. Note that this technique for computing the branch mask is approximate—even if FairFuzz determines that a position in the input is overwritable, a particular mutation of the byte at that position might make the program miss the target branch, or vice-versa. However, in our experiments we found that the use of branch mask significantly increases the probability of hitting the target branch (see Section 4.2).

After its calculation, the branch mask is used to influence all subsequent mutations of the input. During the deterministic stages, a mutation is performed at a position only if the branch mask indicates the target branch can still be hit by the resulting input. In the havoc stage, instead of performing random mutations at randomly selected positions in the input, FairFuzz performs random mutations at positions randomly selected within the corresponding modifiable positions of the branch mask. Note that when, during havoc mutations, bytes are deleted from or added to the mutated input, the branch mask created for the input at the beginning of the mutation process no longer maps to the input being mutated. As such, FairFuzz modifies the branch mask in coordination with the mutated input. When a block of the input is deleted, FairFuzz deletes the corresponding block within the branch mask; when a block is inserted, FairFuzz inserts a section into the branch mask which designates the new area as completely modifiable. Of course, this is an approximation, so we expect the proportion of inputs hitting the target branch in the havoc stage to be smaller than the proportion of inputs hitting the rare branch in deterministic fuzzing stages. We examine this further in Section 4.2.

3.3. Trimming Inputs for Target Branches

AFL’s efficiency depends on large part on its ability to quickly produce and modify inputs (Zalewski, 2017). Thus, it is important to make sure FairFuzz’s branch mask computation is efficient. Since the runtime of the branch mask computation is linear in the length of the selected input, FairFuzz needs to keep the length of the inputs in the queue short. AFL has two techniques for keeping inputs short: (1) prioritizing short inputs when selecting inputs for mutation and (2) trimming (i.e. performing an efficient approximation of delta-debugging (Zeller and Hildebrandt, 2002)) the input it selects for mutation before mutating it. Trimming attempts to minimize the input selected for mutation with the constraint that the minimized input hits the same path (set of (branch ID, hit count) pairs) as the seed input. However, this constraint is not good enough for reducing the length of inputs significantly when very long inputs are chosen—something FairFuzz may do since it selects inputs only based on whether they hit a rare branch. We found that we can make inputs shorter in spite of this if we relax the trimming constraint to require that the minimized input hits only the target branch of the original input, instead of the same path as the original input. We refer to our technique with this relaxed constraint as our technique “with trimming”.

4. Implementation and Evaluation

We have implemented our technique as an open-source extension of AFL named FairFuzz. This implementation adds around 600 lines of C code to the file containing AFL’s core implementation, including some code used only for experimentation, i.e. for the statistics in Section 4.2.

tcpdump
readelf
nm
objdump
c++filt
xmllint
mutool draw
djpeg
readpng
Figure 3. Number of basic block transitions (AFL branches) covered by different AFL techniques averaged over 20 runs (bands represent 95% C.I.s).

In this evaluation we compare three popular versions of AFL against FairFuzz, all based off of AFL version 2.40b. “AFL” is the vanilla AFL available from AFL’s website. “FidgetyAFL” (Zalewski, 2016a) is AFL run without the deterministic mutation stage, which AFL’s creator found replicated the performance of AFLFast (Böhme et al., 2016) on some benchmarks. “AFLFast.new” (Böhme, 2016) is AFLFast run with new settings, which has been claimed as significantly better than both FidgetyAFL and AFLFast. AFLFast.new omits the deterministic stage and uses the cut-off exponential exploration strategy. We ran FairFuzz with input trimming for the target branch and omitting all deterministic stages except those necessary for the creation of a branch mask. Our initial experiments to determine which of our modifications (trimming, using the branch mask, doing all deterministic mutations) were most effective were inconclusive. This combination was a compromise which saw coverage increases on multiple benchmarks. We will refer to techniques (2), (3) and (4) as the “modified” techniques.

To evaluate their ability to achieve fast code coverage and discover crashes, we evaluated the techniques on 9 different benchmarks. We selected these from those favored for evaluation by the AFL creator (djpeg from libjpeg-turbo-1.5.1, and readpng from libpng-1.6.29), those used in AFLFast’s evaluation (tcpdump -nr from tcpdump-4.9.0; and nm, objdump -d, readelf -a, and c++filt from GNU binutils-2.28) and a few benchmarks with more complex input grammars in which AFL has previously found vulnerabilities (mutool draw from mupdf-1.9, and xmllint from libxml2-2.94). Since some of these input formats had AFL dictionaries and some did not, we ran all this evaluation without dictionaries to level out the playing field. In each case we seeded the fuzzing run with the inputs in the corresponding AFL testcases directories (except c++filt, which was seeded with the input “_Z1fv\n”). We ran each technique for 24 hours (on a single core) on each benchmark, repeating each 24 hour run 20 times for each benchmark.

The main metric we report is basic block transitions covered, which is close to the notion of branch coverage used in real-world software testing. Due to AFL’s implementation, of the three common metrics used to evaluate versions of AFL—number of paths, number of unique crashes, and number of basic block transitions covered—this is the only one that is robust to the order in which inputs are discovered. As a simple illustration, consider a program with two branches, and . Suppose input hits once, input hits once, and input hits both and . Their respective paths are , , and . Now, if AFL discovers these inputs in the order it will save both and and count 2 paths (or 2 unique crashes if both crash), and not save since it does not exercise a new (branch ID, hit count) pair. On the other hand, if AFL discovers the inputs in the order , it will save and count 1 path (or 1 unique crash if crashes), and save neither nor . Thus it appears on the second run that AFL has found half the paths it did on the first run (or unique crashes). On the other hand, regardless of the order in which inputs are discovered, the number of basic block transitions covered will be 2. We note that the creator of AFL also favors basic block transitions covered over unique crashes as a performance metric (Zalewski, 2016b). We will nonetheless present data on AFL-measured unique crashes in Section 4.3 for comparison to prior work.

Although previous work does not always repeat experiments (Stephens et al., 2016; Li et al., 2017; Böhme et al., 2017) or, when the experiments repeated, provide any measure of variability for metrics like unique crashes (Böhme et al., 2016), we found compelling evidence (see the results in Sections 4.1 and 4.3 and in particular the graphs in Figure 3 and Figure 5) of the importance of reporting the variability of AFL results in some form or another. We repeat our experiments 20 times because AFL is an inherently non-deterministic process (especially when running only havoc mutations), and so is its performance. This enabled us to report results that are statistically significant. We believe that all future research work should perform experiments along similar lines.

4.1. Coverage Compared to Prior Techniques

In analyzing the program coverage achieved by each technique we sought to answer two research questions:

  1. Does an emphasis on rare branches result in faster program coverage?

  2. Does an emphasis on rare branches lead to long-term gains in coverage?

To answer these question we begin by analyzing coverage as measured in the number of basic block transitions (branches) discovered through time. Figure 3 plots, for each benchmark and technique, the average number of branches covered over all 20 runs at each time point (dark central line) and 95% confidence intervals in branches covered at each time point (shaded region around line) over the 20 runs for each benchmark. For the confidence intervals we assume a Student’s t Distribution (taking

times the standard error).

Figure 4. Number of benchmarks on which each technique has the lead in coverage at each hour. A benchmark is counted for multiple techniques if two techniques are tied for the lead.

We can see in this Figure 3 that on most benchmarks, FairFuzz achieves the upper bound in branch coverage, generally showing the most rapid increase in coverage at the beginning of execution. Overall, FairFuzz produces rapid increases in coverage compared to other techniques on objdump, readelf, readpng, tcpdump, and xmllint, and to a lesser degree on nm, while tying with the other techniques on mutool and djpeg, and AFLFast.new having the edge on c++filt. Note that while FairFuzz keeps a sizeable lead on the xmllint benchmark (Figure 3), it does so wide variability. Closer analysis reveals that one run of FairFuzz on xmllint revealed a bug in the rare branches queueing strategy, causing FairFuzz to select no inputs for mutation—this run covered no more than 6160 branches. However, FairFuzz had two runs on xmllint

covering an exceptional 7969 and 10990 branches, respectively. But even without these three outliers,

FairFuzz’s average coverage (6896) remains much higher than the other techniques’ averages ( AFLFast.new averages 6541 branches after 24 hours).

Figure  4, shows, at every hour, for how many benchmarks each technique has the lead in coverage. By lead we mean its average coverage is above the confidence intervals of the other techniques, and no other technique’s average lies within its confidence interval. We say two techniques are tied if one’s average lies within the confidence interval of the other. If techniques tie for the lead, the benchmark is counted for both techniques in Figure 4, which is why the number of benchmarks at each hour may add up to more than 9. This figure shows that FairFuzz quickly achieves a lead in coverage on nearly all benchmarks and is not surpassed in coverage by the other techniques in our time limits.

We believe Figures 3 and 4 are compelling evidence that in the context of our benchmarks, the answer to our first research question is yes, an emphasis on rare branches leads to faster program coverage.

4.1.1. Detailed analysis of coverage differences.

Figure 3 shows there are three benchmarks (c++filt, tcpdump, and xmllint) on which one technique achieves a lead in AFL’s branch coverage after 24 hours (with AFLFast.new leading on c++filt and FairFuzz on the other two). So, to answer our second research question, we do a more in depth analysis of the coverage achieved. In particular, we examine what the difference in AFL branch coverage corresponds to in terms of source code coverage.

Since AFL saves all inputs that achieve new program coverage (i.e. that are placed in the queue) to disk, we can replicate what program coverage was achieved in each run by replaying these queue elements through the programs under test. Since each benchmark was run 20 times, we take the union (over each technique) of inputs in the queue for all 20 runs. We ran the union of the inputs for each technique through their corresponding programs and then ran lcov on the results to reveal coverage differences.

xmllint

The bulk of the coverage gains on xmllint were in the main parser.c file. The key trend in increased coverage appears to be FairFuzz’s increased ability to discover keywords. For example, both AFL and FairFuzz have higher coverage than FidgetyAFL and AFLFast.new as they discovered the patterns <!DOCTYPE and <!ATTLIST in at least one run. However, FairFuzz also discovered the #REQUIRED, #IMPLIED, and #FIXED keywords, producing inputs including:

<!DOCTYPE6[ <!ATTLISTí T ID #FIXED%

<!DOCTYPE¦[ <!ATTLISTíD T ID #IMPLIEDOCTY

<!DOCTYPE\[ <!ATTLISTíD T ID #REQUIRED^@^P

We found several other instances of keywords discovered by Rare Branches but not the other techniques for xmllint. We believe our rare branch targeting technique is directly responsible for this. To see this, let us focus on the <!ATTLIST> block covered by the inputs above, whose code is outlined in in Figure 1. While both AFL and FairFuzz had a run discovering the sequence <!ATTLIST, running xmllint on all the saved inputs produced by AFL only resulted in 18 hits of Line 2 of Figure 1. Running xmllint on all saved inputs produced by FairFuzz, on the other hand, resulted in 2124 hits of this line. With the queued inputs resulting in two orders of magnitude more hits of this line, it is obvious that FairFuzz was better able to discover inputs with more structure than AFL.

It is valid to ask whether this increase is only attributable to luck, since after all, it was only in one run that AFL and FairFuzz discovered the full <!ATTLIST sequence. While we believe the believe the difference in the number of inputs hitting Line 2 of Figure 1 for these two techniques is compelling evidence this was not the case, looking at the number of runs which produced subsequences of <!ATTLIST also seems to suggest it is not simply luck. As we can see in Table 1, the decrease in the number of runs discovering <!AT from the number of runs discovering <!A shows the branch mask in action, with 11 of FairFuzz’ runs discovering <!AT, compared to 1, 2, and 3 for AFLFast.new, FidgetyAFL, and AFLFast.new, respectively.

subsequence AFL FidgetyAFL AFLFast.new FairFuzz
<!A 7 15 18 17
<!AT 1 2 3 11
<!ATT 1 0 0 1
Table 1. Number of runs, for each technique, producing a seed input with the given subsequence in 24 hours.

We suspect FairFuzz performs particularly well on this benchmark due to the structure of the code in parser.c. The CMP macros in this code (see Figure 1) expand into character-by-character string comparisons. AFL is able to explore these more easily than full string comparisons since this splits the comparison into different basic blocks, progress through which is reported as new coverage. We discuss a recently proposed extension to the LLVM version of the AFL instrumenter (laf, 2016a) that would automatically expand comparisons in programs not structured in this manner in Section 5.

Finally, as is obvious from the example inputs above, although FairFuzz discovered more keywords, the inputs it produced were not necessarily more well-formed. Nonetheless, these inputs allowed the FairFuzz to explore more of the program’s faults. This is reflected in the coverage of a large case statement differentiating 57 error messages in parser.c. Both FidgetyAFL and AFLFast.new cover only 22 of these cases, AFL covers 33, and FairFuzz covers 39. Interestingly, FairFuzz misses one error that the other algorithms cover—an error message about excessive name length—indicating it did not produce an input with a long enough name to trigger this error. We will come back to this when examining the coverage differences in c++filt.

tcpdump

We observed that coverage for tcpdump differs a bit for all four techniques over a variety of different files. We see the biggest gains in three files printing certain packet types, all of which suggest FairFuzz is better able to automatically detect the structure of inputs.

In print-forces.c, a ForCES Protocol printer (RFC 5810), all of AFL, FidgetyAFL, and AFLFast.new are able to create files that pass the first validity checks on message type, but only FairFuzz is able to create files that have legal ForCES packet length. In print-llc.c, an IEEE 802.2 Logical Link Control (LLC) parser, FairFuzz was be able to create packets with the organizationally unique identifier (OUI) corresponding to RFC 2684, and subsequently explore the many subtypes of this OUI. Finally, in print-snmp.c, a Simple Network Management Protocol printer, we find that FairFuzz produces inputs corresponding to Trap PDUs with valid timestamps while AFL does not. FairFuzz also produces scoped SNMP PDUs with valid contextName while AFL does not. Finally, in a function to decode a SNMPv3 user-based security message header FairFuzz produces a few inputs which pass through this function successfully, while AFL does not produce any with a valid msgAuthoritativeEngineBoots.

We note these gains in coverage seem less impressive than those of FairFuzz on xmllint. We hypothesize this is because FairFuzz’s performance on tcpdump may reflect consistently higher coverage of the program, but most of this increase was matched in at least one run of the other techniques. The coverage produced by running all inputs produced by each obscures on which proportion of runs a certain packet structure was discovered. This hypothesis is confirmed by comparing the number of branches covered by at least one of the 20 runs (the union over the runs) and the number of branches covered at least once in all the 20 runs (the intersection over the runs) for the different techniques. For tcpdump, we see FairFuzz has a much more consistent increase in coverage (the intersection of coverage for FairFuzz contains 11,293 branches, 500 more branches than AFLFast.new’s 10,724), but no huge standout gain the union (the union of coverage is 16,129, only 200 more branches than AFLFast.new’s 15,929). FairFuzz’s performance on xmllint shows the opposite behavior. The intersection of coverage for xmllint is virtually the same for the three modified techniques (5,876 for FidgetyAFL, 5,778 for AFLFast.new and 5,884 for FairFuzz), but FairFuzz’s union of coverage (11,681) contains over 4,000 more branches than AFLFast.new’s union of coverage (7,222).

c++filt

The differences in terms of source code coverage between techniques were much more minimal for c++filt than for tcpdump or xmllint. We list all the differences in coverage between AFLFast.new and FairFuzz below. FairFuzz’s gains are structural in the input type; it covers 3 lines in cp-demangle.c that AFLFast.new does not, related to demangling binary components when the operator has a certain opcode. AFLFast.new’s gains appear to be mostly related to FairFuzz’s inability to produce very long inputs. In cp-demangle.c, AFLFast.new covers a branch where xmalloc_failed(INT_MAX) is called if a length bound comparison fails. FairFuzz fails to produce an input long enough to violate the length bound. FairFuzz also fails to cover a branch in cxxfilt.c, taken when the length of input read into c++filt surpasses the length of the input buffer allocated to store it, which all other techniques cover.

FairFuzz’s inability to produce very long inputs may be due to the second round of trimming FairFuzz does. It could also be due to the fact that a focus on branches, ignoring hit counts an input achieves of a particular branch, may not encourage FairFuzz to explore repeated iterations of loops. In spite of the fact that this shows up as a small difference in measurable coverage, FairFuzz shows very different behavior when it comes to finding crashes, and we suspect the input length issue is a reason why. We will discuss this in more detail in Section 4.3.

The pattern we see from this analysis is that FairFuzz is better able to automatically discover input constraints and keywords—special sequences, packet lengths, organization codes—and target exploration to inputs which satisfy these constraints than the other techniques. We suspect the gains in coverage speed on benchmarks such as objdump, readpng, and readelf are due to similar factors. We conjecture the targeting of rare branches shines the most in the tcpdump and xmllint benchmarks since these programs are structured with many nested constraints, which the other techniques are unable to properly explore over the time budget (and perhaps even longer) without extreme luck.

4.2. Are branches successfully targeted?

Our third research question focused on the effectiveness of the branch mask strategy, i.e.:

  1. Does the branch mask increase the proportion of produced inputs hitting the target branch?

To evaluate this we conducted an experiment where, for each seed input chosen for mutation we first ran a shadow run of mutations with the branch mask disabled, then re-ran the mutations with the branch mask enabled. We also disabled side effects (queueing of inputs, incrementing of branch hit counts) in the shadow run. This shadow run allows us to compute the difference between the percentage of generated inputs hitting the branch with and without the branch mask for each seed input.

We ran FairFuzz with these shadow runs for one queueing cycle on a subset of our benchmarks. For each benchmark we ran a cycle with target branch trimming and one without. Table 2 shows the target branch hit percentage for the deterministic555We separated the byteflipping stage in which the branch mask is computed— the calibration stage—from the subsequent deterministic stages which use the branch mask. and havoc stages, averaging the per-input hit percentages over all inputs generated in the first cycle.

det. mask det plain havoc mask havoc plain
xmllint 92.8% 46.5% 31.8% 6.6%
tcpdump 99.0% 74.0% 34.2% 9.3%
c++filt 97.6% 64.1% 41.4% 14.4%
readelf 99.7% 82.7% 57.7% 14.9%
readpng 99.1% 34.6% 24.3% 2.4%
objdump 99.2% 70.2% 42.4% 9.0%
(a) Cycle without trimming.
det. mask det. plain havoc mask havoc plain
xmllint 90.3% 22.9% 32.8% 2.9%
tcpdump 98.7% 72.8% 36.1% 9.0%
c++filt 96.6% 14.8% 34.4% 1.1%
readelf 99.7% 78.2% 55.5% 11.4%
readpng 97.8% 39.0% 24.0% 2.4%
objdump 99.2% 66.7% 46.2% 7.6%
(b) Cycle with trimming.
Table 2. Average % of mutated inputs hitting target branch for one queueing cycle.

Comparing the numbers in Table 2(a) and Table 2(b), it appears that trimming the input reduces the number of inputs hitting the target branch when the branch mask is disabled but has minimal effect when the branch mask is enabled. Overall, Table 2 shows that the branch mask does largely increase the percentage of mutated inputs hitting the target branch. The hit percentages for the deterministic stage are strikingly high: this is not unexpected as in the deterministic stage the branch mask simply operates to prevent mutations at locations likely to violate the target branch. What is most impressive is the gain in the percentage of inputs hitting the target branch in the havoc stage. In spite of the use of the branch mask in the havoc stage being approximate, we consistently see the use of the branch mask causing a 3x-10x increase in the percentage of inputs hitting the target branch. From this analysis, we conclude that the answer to our third research question is positive – the branch mask increases the proportion of inputs hitting the target branch.

Note that this branch mask targeting technique is independent of the fact that the branches being targeted are “rare”. This means that this strategy could be used in a more general context. For example, we could target only the branches within a function that needs to be tested, or, if some area of the code was recently modified or bug-prone, we could target the branches in that area with the branch mask. Recent work on targeted AFL (Böhme et al., 2017) shows promise in such an application of AFL, and we believe the branch mask technique could be used cooperatively with the power schedules presented in this work.

4.3. Crashing Compared to Prior Techniques

Our final research question pertains to crash finding:

  1. Does an emphasis on rare branches lead to faster crash exposure?

Crash finding is hard to evaluate in practice. Of the 9 benchmarks we tested on, all techniques found crashes only on c++filt and readelf. While we discussed issues with the unique crashes metric at the beginning of this section, for comparison with prior work, we plot the unique crashes found by each technique in Figure 5. The most interesting part of this figure is the width of the confidence intervals, which reflects just how variable this metric is.

readelf
cxxfilt
Figure 5. Average unique crashes found over 20 runs for each technique (dark line) with 95% confidence intervals.

In spite of the variability, the figure suggests FairFuzz performs poorly on c++filt in terms of crash exposure. To evaluate this, we can first compare the percent of runs finding crashes for each benchmark. On readelf, both AFL and FidgetyAFL find crashes in 50% of runs while AFLFast.new and FairFuzz find crashes in 75% of runs. On c++filt, however, FidgetyAFL and AFLFast.new find crashes in 100% of the runs, AFL in 85% of them, and FairFuzz in only 25%. Looking at the time to first crash for the different runs, we see that for readelf, in general, FairFuzz finds crashes a bit faster than AFLFast.new (Figure 6). However, for the few runs on which the FairFuzz found crashes for c++filt, it takes around 10 hours for FairFuzz to find the crashes as compared to less than 2 hours for FidgetyAFL.

readelf
cxxfilt
Figure 6. Time to find first crash for each run of the different techniques. Each point represents the time to first crash for a single run.

Given that some of the lines missed by FairFuzz in c++filt were about buffer length (recall Section 4.1.1), we suspected that input length might be a factor in this performance discrepancy. We examined the file size of the crashing inputs saved for each technique. Many of these were very long, more than 20KB, with one crashing input generated by AFLFast.new being 130KB long! We saw crashing inputs of this length in only one of the five FairFuzz runs which found crashes in c++filt.

Thus, we suspect a factor in FairFuzz’s poor performance on crash finding on this c++filt is that the method does not encourage the creation of huge inputs. It is also possible that the structure of c++filt may is better suited to a path-based exploration strategy (like that of AFL/FidgetyAFL and AFLFast.new) than a branch-based one. For example, demangling function argument types is done in a loop, so FairFuzz is less likely to prioritize many new function arguments than path-based prioritization does. Overall, the answer to our final research question is that FairFuzz may improve crash exposure in some cases (readelf), but does not when the crashes are most easily exposed by large inputs.

5. Discussion

The foremost limitation of FairFuzz is the fact that branches that are never hit by any AFL input cannot be targeted by this method. So, it confers little benefits to discovering a single long magic number (like that in Figure 2) when progress towards matching the magic number does not result in new coverage. However, recall FairFuzz was effective at finding keyword sequences in the xmllint benchmark. This is because the long string comparisons in parser.c were structured as byte-by-byte comparisons, so AFL’s instrumentation reported new coverage when progress was made on these comparisons. AFL’s instrumentation would not report new coverage for progress on a single multi-byte comparison.

The creators of laf-intel (laf, 2016a) propose several LLVM “deoptimization” passes to improve AFL’s performance, including a pass that automatically turns multi-byte comparisons into byte-by-byte comparisons. Figure 7 shows an example of this comparison unrolling. The integration of these LLVM passes into AFL’s instrumentation is straightforward, requiring only a patch to AFL’s LLVM-based instrumenter (laf, 2016b). Due to FairFuzz’s performance on the xmllint benchmark, we believe FairFuzz could show similar coverage gains on other programs if they were compiled with this laf-intel pass. We did not evaluate this as the evaluation of the laf-intel pass was done in AFL’s “parallel” fuzzing mode, with some instances of AFL running on the traditionally instrumented program and others running on the enhanced instrumentation programs. We did not do any experiments with this parallel fuzzing as our implementation did not implement a distributed version of our rare branch computation algorithm.

1if ( str == "BAD!") { 2   // do bad things 3} 1if (str[0]  == ’B’) { 2   if (str[1] == ’A’) { 3      if (str[2] == ’D’) { 4         if (str[3] == ’!’) { 5             // do bad things 6}}}}
Figure 7. Multi-byte comparison (left) unrolled to byte-by-byte comparison (right).

6. Other Related Work

Unlike FairFuzz and other greybox fuzzers (lib, 2016)

which use coverage information as a heuristic for which inputs may yield new coverage under mutation, symbolic execution tools 

(Godefroid et al., 2005; Sen and Agha, 2006; Cadar et al., 2008; Sen et al., 2005) methodically explore the program under test by capturing path constraints and directly producing inputs which fit yet-unexplored path constraints. This cost of this precision is that it can lead to the path explosion problem, which causes scalability issues.

Traditional blackbox fuzzers such as zzuf (Hocevar, 2007) mutate user-provided seed inputs according to a mutation ratio, which may need to be adjusted to the program under test. BFF (Householder and Foote, 2012) and SymFuzz (Cha et al., 2015) adapt this parameter automatically, by measuring crash density (number of mutated inputs finding crashes) and doing input bit dependence, respectively. These optimizations are not relevant to AFL-type generational fuzzers which do not use this mutation ratio parameter.

There exist several fuzzers highly optimized for certain input file structures, including network protocols (Amini and Portnoy, 2012; Bratus et al., 2008), and source code (Yang et al., 2011; Holler et al., 2012; Ruderman, 2015). FairFuzz is of much lower specificity so will not be as effective as these tools on these specific input formats. However, its method is fully automatic, requiring neither user inputs (Amini and Portnoy, 2012) or extensive tuning (Yang et al., 2011; Ruderman, 2015).

While FairFuzz uses its branch mask to try and fix important parts of program inputs, recent work has more explicitly tried to automatically learn input formats. Learn&Fuzz (Godefroid et al., 2017) uses sequence-based learning methods to learn the structure of pdf objects, leveraging this to produce new valid pdf. Autogram (Höschele and Zeller, 2016) proposes a taint analysis- based approach to learning input grammars, while Glade (Bastani et al., 2017) uses an iterative approach and repeated calls to an oracle to learn a context-free grammar for a set of inputs. We also do not assume a corpus of valid inputs of any size from which validity could be automatically learned in our work.

Another approach to smarter fuzzing is to find locations in seed inputs related to likely crash locations in the program and focus mutation there. BuzzFuzz (Ganesh et al., 2009) and TaintScope (Wang et al., 2010) pair taint analysis with the detection of potential attack points (i.e. array allocations) to find crashes more effectively. TaintScope also includes automated checksum check detection. Dowser (Haller et al., 2013) has a similar goal but instead guides symbolic execution down branches likely to find buffer overflows. These directed methods are not directly comparable to FairFuzz since they do not have the same goal of broadening and deepening program exploration.

VUzzer’s (Rawat et al., 2017)

approach is reminiscent of these, using both static and dynamic analysis to get immediate values used in comparisons (likely to be magic numbers), positions in the input likely to be magic numbers are offset, and information about which basic blocks are hard to get to. It models the program as a Markov Chain to decide with parts of the program are rare, as opposed to our empirical approach.

Randoop (Pacheco and Ernst, 2007) automatically generates test cases for object oriented programs through feedback-directed random test generation, checking to make sure program fragments are valid; Evosuite (Fraser and Arcuri, 2011)

uses seed inputs and genetic algorithms to achieve high code coverage. Both these techniques focus on generating sequences of method calls to test programs, which is quite different from the input-generation form of fuzzing in

FairFuzz.

Search-based software testing (SBST) (Miller and Spooner, 1976; Korel, 1990; McMinn, 2011; Grechanik et al., 2009, 2012; Harman and Jones, 2001; Harman, 2007; Harman and Clark, 2004; Yoo and Harman, 2007) uses optimization techniques such as hill climbing and genetic algorithms to generate inputs that optimize some observable fitness function. These techniques work well when the fitness curve is smooth with respect to changes in the input, which is not the case in coverage-based greybox fuzzing.

References

  • (1)
  • laf (2016a) 2016a. laf-tintel. https://lafintel.wordpress.com/. (2016). Accessed August 23rd, 2017.
  • laf (2016b) 2016b. laf-tintel source. https://gitlab.com/laf-intel/laf-llvm-pass. (2016). Accessed August 24th, 2017.
  • lib (2016) 2016. libFuzzer. http://llvm.org/docs/LibFuzzer.html. (2016). Accessed August 25th, 2017.
  • Amini and Portnoy (2012) Pedram Amini and Aaron Portnoy. 2012. Sulley. https://github.com/OpenRCE/sulley. (2012). Accessed August 22nd, 2017.
  • Arya and Neckar (2012) Abhishek Arya and Cris Neckar. 2012. Fuzzing for Security. https://blog.chromium.org/2012/04/fuzzing-for-security.html. (2012).
  • Bastani et al. (2017) Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2017. Synthesizing Program Input Grammars. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017).
  • Böhme (2016) Marcel Böhme. 2016. AFLFast.new. https://groups.google.com/d/msg/afl-users/1PmKJC-EKZ0/lbzRb8AuAAAJ. (2016). Accessed August 23rd, 2017.
  • Böhme et al. (2017) Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoudhury. 2017. Directed Greybox Fuzzing. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17).
  • Böhme et al. (2016) Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2016. Coverage-based Greybox Fuzzing As Markov Chain. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS ’16).
  • Bratus et al. (2008) Sergey Bratus, Axel Hansen, and Anna Shubina. 2008. LZfuzz: a fast compression-based fuzzer for poorly documented protocols. Technical Report. Department of Computer Science, Darmouth College.
  • Cadar et al. (2008) Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and Automatic Generation of High-coverage Tests for Complex Systems Programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08).
  • Cha et al. (2015) Sang Kil Cha, Maverick Woo, and David Brumley. 2015. Program-Adaptive Mutational Fuzzing. In Proceedings of the 2015 IEEE Symposium on Security and Privacy (SP ’15).
  • Evans et al. (2011) Chris Evans, Matt Moore, and Tavis Ormandy. 2011. Fuzzing at Scale. https://security.googleblog.com/2011/08/fuzzing-at-scale.html. (2011). Accessed August 24th, 2017.
  • Fraser and Arcuri (2011) Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite Generation for Object-oriented Software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE ’11).
  • Ganesh et al. (2009) Vijay Ganesh, Tim Leek, and Martin Rinard. 2009. Taint-based Directed Whitebox Fuzzing. In Proceedings of the 31st International Conference on Software Engineering (ICSE ’09).
  • Godefroid et al. (2008) Patrice Godefroid, Adam Kiezun, and Michael Y. Levin. 2008. Grammar-based Whitebox Fuzzing. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’08).
  • Godefroid et al. (2005) Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed Automated Random Testing. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’05).
  • Godefroid et al. (2017) Patrice Godefroid, Hila Peleg, and Rishabh Singh. 2017. Learn&Fuzz: Machine Learning for Input Fuzzing. CoRR (2017). http://arxiv.org/abs/1701.07232
  • Grechanik et al. (2012) Mark Grechanik, Chen Fu, and Qing Xie. 2012. Automatically finding performance problems with feedback-directed learning software testing. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 156–166.
  • Grechanik et al. (2009) Mark Grechanik, Qing Xie, and Chen Fu. 2009. Maintaining and evolving GUI-directed test scripts. In 2009 IEEE 31st International Conference on Software Engineering. IEEE, 408–418.
  • Haller et al. (2013) Istvan Haller, Asia Slowinska, Matthias Neugschwandtner, and Herbert Bos. 2013. Dowsing for Overflows: A Guided Fuzzer to Find Buffer Boundary Violations. In Proceedings of the 22Nd USENIX Conference on Security (SEC’13).
  • Harman (2007) Mark Harman. 2007. The current state and future of search based software engineering. In 2007 Future of Software Engineering. IEEE Computer Society, 342–357.
  • Harman and Clark (2004) Mark Harman and John Clark. 2004. Metrics are fitness functions too. In Software Metrics, 2004. Proceedings. 10th International Symposium on. IEEE, 58–69.
  • Harman and Jones (2001) Mark Harman and Bryan F Jones. 2001. Search-based software engineering. Information and software Technology 43, 14 (2001), 833–839.
  • Hocevar (2007) Sam Hocevar. 2007. zzuf. http://caca.zoy.org/wiki/zzuf/. (2007). Accessed August 22nd, 2017.
  • Holler et al. (2012) Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with Code Fragments. In Presented as part of the 21st USENIX Security Symposium (USENIX Security 12).
  • Höschele and Zeller (2016) Matthias Höschele and Andreas Zeller. 2016. Mining Input Grammars from Dynamic Taints. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE 2016).
  • Householder and Foote (2012) Allen D. Householder and Jonathan M. Foote. 2012. Probability-Based Parameter Selection for Black-Box Fuzz Testing. Technical Report. Carnegie Mellon University Software Engineering Institute.
  • Korel (1990) Bogdan Korel. 1990. Automated software test data generation. IEEE Transactions on software engineering 16, 8 (1990), 870–879.
  • Li et al. (2017) Yuekang Li, Bihuan Chen, Mahinthan Chandramohan, Shang-Wei Lin, Yang Liu, and Alwen Tiu. 2017. Steelix: Program-state Based Binary Fuzzing. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017).
  • McMinn (2011) Phil McMinn. 2011. Search-Based Software Testing: Past, Present and Future. In Proceedings of the 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops (ICSTW ’11). IEEE Computer Society, Washington, DC, USA, 153–163. https://doi.org/10.1109/ICSTW.2011.100
  • Miller and Spooner (1976) Webb Miller and David L. Spooner. 1976. Automatic generation of floating-point test data. IEEE Transactions on Software Engineering 2, 3 (1976), 223.
  • Moroz and Serebryany (2016) Max Moroz and Kostya Serebryany. 2016. Guided in-process fuzzing of Chrome components. https://security.googleblog.com/2016/08/guided-in-process-fuzzing-of-chrome.html. (2016).
  • Pacheco and Ernst (2007) Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-directed Random Testing for Java. In Companion to the 22nd ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications Companion (OOPSLA ’07).
  • Rawat et al. (2017) Sanjay Rawat, Vivek Jain, Ashish Kumar, Lucian Cojocar, Cristiano Giuffrida, and Herbert Bos. 2017. VUzzer: Application-aware Evolutionary Fuzzing. In Proceedings of the 2017 Network and Distributed System Security Symposium (NDSS ’17).
  • Ruderman (2015) Jesse Ruderman. 2015. jsfunfuzz. https://github.com/MozillaSecurity/funfuzz/tree/master/js/jsfunfuzz. (2015).
  • Sen and Agha (2006) Koushik Sen and Gul Agha. 2006. CUTE and jCUTE: Concolic Unit Testing and Explicit Path Model-checking Tools. In Proceedings of the 18th International Conference on Computer Aided Verification (CAV’06).
  • Sen et al. (2005) Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: A Concolic Unit Testing Engine for C. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE-13).
  • Stephens et al. (2016) Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. 2016. Driller: Augmenting Fuzzing Through Selective Symbolic Execution. In Proceedings of the 2016 Network and Distributed System Security Symposium (NDSS ’16).
  • Wang et al. (2010) Tielei Wang, Tao Wei, Guofei Gu, and Wei Zou. 2010. TaintScope: A Checksum-Aware Directed Fuzzing Tool for Automatic Software Vulnerability Detection. In Proceedings of the 2010 IEEE Symposium on Security and Privacy (SP ’10).
  • Yang et al. (2011) Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and Understanding Bugs in C Compilers. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’11).
  • Yoo and Harman (2007) Shin Yoo and Mark Harman. 2007. Pareto efficient multi-objective test case selection. In Proceedings of the 2007 international symposium on Software testing and analysis. ACM, 140–150.
  • Zalewski (2014) Michał Zalewski. 2014. American Fuzzy Lop. http://lcamtuf.coredump.cx/afl. (2014). Accessed August 18th, 2017.
  • Zalewski (2016a) Michał Zalewski. 2016a. FidgetyAFL. https://groups.google.com/d/msg/afl-users/fOPeb62FZUg/CES5lhznDgAJ. (2016). Accessed August 23rd, 2017.
  • Zalewski (2016b) Michał Zalewski. 2016b. Unique crashes as a metric. https://groups.google.com/d/msg/afl-users/fOPeb62FZUg/LYxgPYheDwAJ. (2016). Accessed August 24th, 2017.
  • Zalewski (2017) Michał Zalewski. 2017. American Fuzzy Lop Technical Details. http://lcamtuf.coredump.cx/afl/technical_details.txt. (2017). Accessed August 18th, 2017.
  • Zeller and Hildebrandt (2002) Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure-Inducing Input. IEEE Transactions on Software Engineering 28, 2 (2002), 183–200.