SoK: The Progress, Challenges, and Perspectives of Directed Greybox Fuzzing

05/25/2020 ∙ by Pengfei Wang, et al. ∙ 0

Greybox fuzzing has been the most scalable and practical approach to software testing. Most greybox fuzzing tools are coverage guided as code coverage is strongly correlated with bug coverage. However, since most covered codes may not contain bugs, blindly extending code coverage is less efficient, especially for corner cases. Unlike coverage-based fuzzers who extend the code coverage in an undirected manner, a directed fuzzer spends most of its time budget on reaching specific target locations (e.g., the bug-prone zone) without wasting resources stressing unrelated parts. Thus, directed greybox fuzzing is particularly suitable for scenarios such as patch testing, bug reproduction, and special bug hunting. In this paper, we conduct the first in-depth study of directed greybox fuzzing. We investigate 26 state-of-the-art fuzzers (80 published after 2019) closely related to DGF, which have various directed types and optimization techniques. Based on the feature of DGF, we extract 15 metrics to conduct a thorough assessment of the collected tools and systemize the knowledge of this field. Finally, we summarize the challenges and provide perspectives of this field, aiming to facilitate and boost future research on this topic.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

page 9

page 10

page 11

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

To date, the most scalable and practical approach to software testing has been greybox fuzzing, which draws much attention in recent years [5, 7, 67, 64]

. Compared to blackbox fuzzing and whitebox fuzzing, greybox fuzzing is efficient and effective. Based on the feedback information from the execution, greybox fuzzers use an evolutionary algorithm to generate new input and explore the paths. Greybox fuzzing is widely used to testing application software, libraries

[3], as well as kernel code [47, 49, 27], and has been applied in practice to varieties of targets, including protocols [63, 69], smart contracts [22, 59], and multi-threaded programs [68, 51, 21].

Most greybox fuzzing tools are coverage guided, which aim to cover as many program paths as possible within a limited time budget. This is because, intuitionally, code coverage is strongly correlated with bug coverage, and fuzzers with higher code coverage can find more bugs. However, it is not appropriate to treat all codes of the program as equal because most covered codes may not contain bugs. For example, according to Shin et al. [48], only 3% of the source code files in Mozilla Firefox have vulnerabilities. Thus, testing software by blindly extending code coverage is less efficient, especially for corner cases. Since achieving full code coverage is difficult in practice, researchers have been trying to target the vulnerable parts in the code to improve efficiency and save the resources. Thus, directed fuzzing is proposed.

Unlike coverage-based fuzzers who are blindly extending the path coverage, a directed fuzzer spends most of its time budget on reaching specific target locations (e.g., the bug-prone zone) without wasting resources stressing unrelated parts. Thus, directed greybox fuzzing is particularly suitable for scenarios such as patch testing, bug reproduction, and integration with other tools. Traditionally, directed fuzzers are based on symbolic execution [17, 34, 45, 36], which uses program analysis and constraint solving to generate inputs that exercise different program paths. Such directed fuzzers cast the reachability problem as iterative constraint satisfaction problem [4]. However, since directed symbolic execution relies on heavy-weight program analysis and constraint solving, it suffers from scalability and compatibility limitations.

In 2017, Böhme et al. introduced the concept of Directed Greybox Fuzzing (DGF)[4]. By specifying a set of target sites in the program under test (PUT) and leveraging lightweight compile-time instrumentation of the PUT, a directed greybox fuzzer calculates the distance between the seed and the target to assist seed selection. By giving more mutation chances to the seeds that are closer to the target, it can steer the greybox fuzzing to reach the target locations. DGF casts reachability as an optimization problem to minimize the distance of the generated seeds to the targets [4]. Compared with directed symbolic execution, DGF has much better scalability and improves the efficiency of several magnitudes. For example, Böhme et al can reproduce Heartbleed within 20 minutes while the directed symbolic execution tool KATCH [36] needs more than 24 hours [4]. For now, DGF has been studied in-depth and has evolved beyond the primary pattern that depends on manually labeled target sites and distance-based metrics to prioritize the seeds. A great number of variations have been realized to boost software testing under different scenarios, such as fuzzers directed by target sequence [39, 31, 30], by semantic information [62, 26], by parser [37], by typestate [52], by sanitizer checks [40, 8], by memory usage [58]

, and by vulnerable probability

[29]. Complex deep behavioral testing scenes, such as use-after-free bugs [39, 52], memory consumption bugs [58], memory violation bugs [13], algorithmic complexity vulnerabilities [3], input validation bugs in robotic vehicles [28], even the state space of the game Super Mario Bros [2], can be handled via optimized directed greybox fuzzers.

In this paper, we focus on the up to date research progress on DGF and conduct the first in-depth study of it. We systemize the knowledge of DGF by the surveying the state-of-the-art directed greybox (hybrid) fuzzers and conducting a comprehensive analysis based on their assessment. In summary, we make the following contributions.

  • We investigate 26 state-of-the-art fuzzers (80% are published after 2019) closely related to DGF, which have various directed types and optimization techniques. We extract 15 metrics based on the features of DGF to conduct a thorough assessment of the collected tools and systemize the knowledge of this field.

  • Base on the assessment of the known works, we summarize six challenges to the research of DGF, including the binary code support, the automatic target identification, the differentiated weight metric, the multi-targets relationship, the missing indirect calls, and the exploration-exploitation coordination. We disclose the deep reasons behind these challenges and propose possible solutions to address them.

  • We give perspectives on future research directions, aiming to facilitate and boost research of this field.

The rest of the paper is organized as follows: Section 2 reviews the background knowledge of greybox fuzzing and directed greybox fuzzing. Section 3 evaluates the collected state-of-the-art directed greybox fuzzers based on the extracted metrics and systemizes the optimization details of each work for the critical techniques in DGF. Section 4 summarizes the challenges of this field based on the current research progress. Section 5 discusses future perspectives and followed by conclusions.

Ii Background

This section provides the background knowledge on CGF and DGF. We use AFL and AFLGo to illstrate the principle, respectively. Then we compare DGF with CGF to show the difference. Finally, we summarize the application scenarios of DGF.

Ii-a Terminology

To avoid the confusion on the presentation of different literature, we unify the terminology in fuzzing.

  • Fuzzing. In this paper, fuzzing refers to traditional blackbox fuzzing and greybox fuzzing. We exclude whitebox fuzzing as it depends on constraint solving of symbolic execution to generate inputs, which is quite different from evolutionary fuzzers based on mutation.

  • Testcase. A testcase is the input to a PUT, which is generated by randomly mutating a seed.

  • Seed. A seed is a testcase that is favored (trigger a new path or close to the target) and retained for the mutation to generate new testcases in the next fuzzing iteration.

  • Seed prioritization. Seed prioritization means to evaluate and sort the seeds according to its performance. Prioritized seeds would be given more fuzzing chances.

  • Power schedule Power schedule means to determine the number of mutation chances to be applied on a seed (i.e., energy).

Ii-B Coverage-guide Greybox Fuzzing

Coverage-guide greybox fuzzing is the most prevalent fuzzing scheme that aims to maximize the code coverage to find hidden bugs. AFL (American fuzzy lop) 111http://lcamtuf.coredump.cx/afl/ is the state-of-the-art coverage-based greybox fuzzer, and many state-of-the-art greybox fuzzers [5, 33, 7, 64] are built on top of it. Here we use AFL as a representative to illustrate the principle of CGF. AFL uses lightweight instrumentation to capture basic block transitions and gain coverage information during runtime. Then it selects a seed from the seed queue and mutates the seed to generate testcases. If a testcase exercises a new path, it is added to the queue as a new seed. AFL favors seeds that triggered new paths and give them preference over the non-favored ones. Compared to other instrumented fuzzers, AFL has a modest performance overhead.

Edge coverage. AFL obtains the execution trace and calculates the edge coverage by instrumenting the PUT at compile time. It inserts random numbers for each branch jump at compile-time and collects these inserted numbers from the register at run-time to identify the basic block transition (i.e., the edge in the CFG), which is calculated by cur_location (prev_location 1)]. Edge coverage is more delicate and sensitive than block coverage as it takes into account the transition between blocks. It is also more scalable than path coverage as it avoids path explosion. However, this scheme incurs collision because different edges might have the chance to share the same identifier.

Seed prioritization. AFL leverages the edge-coverage information to select seeds. It maintains a seed queue and fuzzes the seed within it one by one. It labels some seeds as “favored” when they execute fast and are small in size. AFL uses a bitmap with edges as keys and top-rate seeds as values to maintain the best performance seeds for each edge. It selects favored seeds from the top_rated queue, and gives these seeds preference over the non-favored ones by giving the favored one more fuzzing chances [43].

Power schedule. AFL assigns energy to the seed according to its performance score which is based on coverage (prioritize inputs that cover more paths), execution time (prioritize inputs that execute faster), and discovery time (prioritize inputs discovered later) [25] Particularly, if the test case exercises a new path, AFL will double the assigned energy.

Ii-C Directed Greybox Fuzzing

Algorithm 1: Directed Greybox Fuzzing.
Input: i Initial input
Input: Target A set of target locations
Output: BugInput A set of buggy input
01
02
03 while true do
04
05
06
07 if then
08   
09 if then
10   
11
12
13 end

In 2017, Böhme et al. introduced the concept of Directed Greybox Fuzzing (DGF) and implemented a tool called AFLGo [4] based on the modern greybox fuzzing framework. Unlike blindly increasing the path coverage in coverage-based greybox fuzzing, DGF aims to reach a pre-identified location in the code (potentially the buggy part) and spends most of its time budget on reaching target locations without wasting resources stressing unrelated parts.

Here we use AFLGo as the representative to illustrate how DGF works. AFLGo follows the general principles and architecture as coverage-guided fuzzing. It relies on compile-time instrumentation to obtain the current PUT’s running status, especially the current execution path and path coverage information. Essentially, the directness in DGF is realized by prioritizing the seeds that are closer to the targets. However, the seed prioritization in AFLGo is based on “distance” instead of new path coverage. The distance is calculated based on the average of basic blocks on the input seed’s execution trace weight to the target basic blocks, where the weight is determined by the number of edges in the call graph and control-flow graphs of the program. Böhme et al[5]

view the greybox fuzzing process as a Markov chain that can be efficiently navigated using a “power schedule”. They leverage a simulated annealing strategy to gradually assign more energy to a seed that is “closer” to the targets than to a seed that is “further away”. They cast reachability as an optimization problem to minimize the distance of the generated seeds to the targets.

The exploration-exploitation problem. For DGF, the whole fuzzing process is divided as the exploration phase and the exploitation phase [4]. The exploration phase is designed to uncover as many paths as possible. Like many coverage-guided fuzzers, DGF in this phase favors the seeds that cover new paths and prioritizes them. This is because new paths increase the potential to lead to the targets. It is particularly necessary when the initial seeds are quite far from the targets. Then, based on the known paths, the exploitation phase is invoked to drive the engine to the target code areas. In this phase, Böhme et al. leverage a simulated annealing-based power schedule to gradually assign more energy to a seed that is “closer” to the targets than to a seed that is “further away”. The intuition is that if the path that the current seed executes is “closer” to any of the expected paths that can reach the target, more mutations on that seed should be more likely to generate expected seeds that fulfill the target. The exploration-exploitation tradeoff lies in how to coordinates these two phases. Böhme et al use a fixed splitting of the exploration and exploitation phases. For example, for 24-hour testing, AFLGo uses 20 hours for the exploration and then 4 hours for the exploitation.

Ii-D Difference between CGF and DGF

(1) Seed prioritization

. A major difference between CGF and DGF lies in the seed prioritization. Since CGF aims to maximize the path coverage, CGF gives preference to seeds that trigger new paths. Differently, DGF aims to reach specific locations in the coed. Thus, it prioritizes seeds that are “closer” to the targets, and the evaluation metrics of the seeds varies a lot, including distance, coverage, path, and probability.

(2) Target involvement. CGF expands the coverage in an undirected manner. While for DGF, a set of targets must be marked in advance, manually or automatically, to guide the fuzzing process. The target selection can affect the performance of DGF. For example, selecting critical sites, such as memory allocation function malloc() or string manipulation function strcpy(), as targets are more likely to trigger memory corruption bugs. Besides, we can leverage the relationship among targets to accelerate detecting complex behavioral bugs, such as use-after-free. Thus, the involvement of targets gives more chance to optimize DGF, and customized techniques that are specific to DGF can be applied.

(3) Exploration-exploitation. Researchers [43, 65] model the greybox fuzzing process as a “ multi-armed bandit problem” where the seeds are considered as arms of a multi-armed bandit. For coverage-based greybox fuzzing, the whole process is essentially a tradeoff of the exploration-exploitation problem, where exploration stands for trying as many seeds as possible while exploitation means mutating a certain seed as much as possible. For DGF, the exploration-exploitation problem lies in coordinating the exploration phase and the exploitation phase. DGF in the exploration phase favors the seeds that cover new paths and prioritizes them to increase the potential to reach the targets. At the same time, the exploitation phase gives more chances of mutation to seeds that are more likely to generate inputs to reach the target.

Ii-E Application of DGF

DGF is promising as it is especially suitable and effective for specific testing scenarios. We summarize the following common practical application of DGF.

  • Patch testing. DGF can be used to test whether a patch is complete and compatible. A patch is incomplete when a bug can be triggered by multiple inputs [54], for example, CVE-2017-15939 is caused by an incomplete fix for CVE-2017-15023 [6]. Meanwhile, a patch can introduce new bugs [53]. For example, CVE-2016-5728 is introduced by a careless code update. Thus, directed fuzzing towards problematic changes or patches has a higher chance of exposing bugs.

  • Bug reproduction. DGF is also useful when reproducing a known bug without the buggy input. For example, due to concerns such as privacy, some applications (such as video player) are not allowed to send the input file. Thus, the in-house development team can use DGF to reproduce the crash with the method calls in stack-trace and some environmental parameters [4]. DGF is also helpful when generating Proof-of-Concept (PoC) inputs of disclosed vulnerabilities given bug report information [44, 62]. In fact, DGF is especially needed because 45.1% of the usual bug reports cannot be reproduced due to missing information and users’ privacy violations [38].

  • Knowledge boost. DGF can boost program testing by integrating the knowledge from a human analyst or auxiliary techniques. Human-in-the loop is commonly used in software testing, which can help to identify the critical syscalls or security-sensitive program sites (e.g., memory allocation function malloc(), string manipulation function strcpy()) based on the previous experience to guide fuzzing to error-prone parts [2]. Auxiliary techniques, such as symbolic execution [44], tait analysis [55], static analysis [59]

    , and artificial intelligence

    [29], can be leveraged to enhance directedness and overcome roadblocks in the testing.

  • Energy saving. Another interesting application of DGF is when the testing resource is limited, for example, fuzzing the IoT devices. Under this circumstance, to save the time and computational resources spent on non-buggy like code regions, identifying critical code areas to guide the testing is more efficient than testing the whole program in an undirected manner.

  • Special bug hunting. Finally, DFG can be applied to hunting special bugs based on customized indicators. For example, finding uncontrolled memory consumption bugs under the guidance of memory usage [58], find use-after-free bugs under the guidance of typestate violation [52]. With DGF, the efficiency of behavioral complex bugs discovery can be greatly improved.

Iii Assessment of the-state-of-the-art Works

During the last three years, DGF has drawn the attention of the whole field, and many followups appear. In this section, we collect and investigate 26 state-of-the-art fuzzers that relevant to DGF. To reflect the state-of-the-art research, we choose to include fuzzers from top-level conferences on security and software engineering. Alphabetically, ACM Conference on Computer and Communications Security (CCS), IEEE Symposium on Security and Privacy (S&P), USENIX Security Symposium (USEC), Network and Distributed System Security Symposium (NDSS), and International Conference on Software Engineering (ICSE). To reflect the most up to date research progress, we also include works from preprint website arXiv.org. For writings that appear in other venues or mediums, we include them based on our own judgment on their relevance. To conduct a thorough assessment, we extract 15 metrics based on the features of DGF. We further divide the metrics into three categories, including basic information, implementation details, and optimization methods. In the following, we concentrate on properties that related to the critical techniques of DGF, including directed type, input optimization, seed prioritization, power assignment, mutation scheduling, and data-flow analysis.

Basic information Implementation details Optimization methods
Tool names
& publish year

Fuzzing type

Directed type

Kernel support

Specific bug type

Base tool

Binary support

Indirect call support

Multi-targets support

Seed prioritization metric

Target auto-identification

Data-flow analysis

Input optimization

Mutation optimization

Adaptive explore-exploit

’17 AFLGo [4] G target sites - AFL distance
’17 SemFuzz [62] G Semantic - Syzkaller -
’18 Hawkeye [6] G target sites - AFL distance
’18 TIFF [20] G bug MCB VUzzer -
’19 ProFuzzer [61] G bug MCB AFL -
’19 LOLLY [31] G target sequence - AFL sequence coverege
’19 V-Fuzz [29] G vulnerable probability - VUzzer IDA fitness score
’19 Wüstholz [59] G target sites - HARVEY BRAN path
’19 Memfuzz [13] G memory access - AFL code coverage; new memory access
’19 TortoiseFuzz [56] G vulnerable function MCB AFL function, loop, basic block
’19 PFUZZER [37] G parser - -
’19 RVFUZZER [28] G control IVB - control instability
’20 RDFuzz [60] G target sites - AFL distance; frequency
’20 TOFU [57] G target sites - - distance
’20 UAFuzz [39] G target sequence UAF AFL QEMU UAF-based distance;
target similarity;
cut-edge coverage;
’20 UAFL [52] G typestate UAF AFL operation sequence coverage
’20 Memlock [58] G memory usage MC AFL function & operation memory consumption
’20 IJON [2] G human annotations - AFL
’20 FuzzGuard [70] G target sites - AFLGo distance
’20 ParmeSan [40] G sanitizer checks - Angora distance
’20 Ankou [35] G combinatorial difference - self-AFL execution distance
’16 SeededFuzz [55] H critical sites - KLEE critical site coverage
’19 DrillerGo [26] H semantic - AFL Angr
’19 1DVUL [44] H binary patches - Driller QEMU diatance
’19 SAVIOR [8] H sanitizer OOB,
IOF,
OS
AFL bug potential coverage
’20 Berry [30] H target sequence - AFL similarity between the target execution trace and the enhanced target sequence
  • G: greybox fuzzing, H: hybrid fuzzing, UAF: use-after-free, MC: memory consumption, OOB: out-of-boundary, IOF: interger overflow, OS: Oversized shift, IVB: input validation bug, MCB: memory corruption bug

TABLE I: Comparison of directed greybox fuzzers

Iii-a Directed Type

Although this paper focuses on directed greybox fuzzing (noted as G in Table I), some of the works we investigated adopt symbolic execution to enhance the directedness, forming directed hybrid fuzzing (noted as H), we also include them in this table.

For the directed type, DGF was initially directed by target sites that are manually labeled in the PUT, such as AFLGo[4] and Hawkeye [6]. Then, researchers noticed that the relationship among the targets is also helpful. For example, in order to trigger use-after-free vulnerabilities, a sequence of operations (e.g., allocate memory, use memory, and free memory) must be executed in the specific order. UAFuzz [39] and UAFL [52] leverages target sequences instead of target sites to find use-after-free vulnerabilities. LOLLY [31] also uses target statement sequences to guide greybox fuzzing to trigger bugs that resulted from the sequential execution of multiple statements. Berry [30] upgrades LOLLY with hybrid fuzzing to alleviate the randomness of greybox fuzzing when reaching deep targets along complex paths. Apart from the target sequence, researchers have proposed various mechanisms to direct the fuzzing process. Memlock [58] is directed by memory usage to find uncontrolled memory consumption bugs. V-Fuzz [29]

is directed by vulnerable probability, which is predicted by a deep learning model to guide the fuzzing process to potentially vulnerable code area. SemFuzz

[62] and DrillerGo [26] leverage semantic information retrieved from CVE description and git logs to direct fuzzing and generate PoC exploits. 1DVUL [44] is directed by patch-related branches that directly change the original data flow or control flow to discover 1-day vulnerabilities. SAVIOR [8] and ParmeSan [40] are directed by information from sanitizers. IJON [2] leverages annotations from a human analyst to guide the fuzzer to overcome significant roadblocks. RVFUZZER [28] is directed by control instability to find input validation bugs in robotic vehicles. PFUZZER [37] is directed explicitly at input parser to cover the space of possible inputs well. DGF has evolved from reaching target locations to hunting complex deep behavioral bugs,

Iii-B Input Optimization

Once the targets are marked, DGF needs to generate a seed input to invoke the fuzzing process. A good seed input can drive the fuzzing process closer to the target location and improves the performance of the later mutation process. According to Zong et al., on average, over 91.7% of the inputs of AFLGo cannot reach the buggy code [70]. Thus, optimizing the input generation has much room to improve the directedness of DGF. SeededFuzz[55] focuses on improving the generation and selection of initial seeds to achieve the goal of directed fuzzing. It utilizes dynamic taint analysis to identify the bytes of seeds which can influence values at security-sensitive program sites and generates new inputs by mutating the relative bytes and feeds them to target programs to trigger errors. FuzzGuard [70] uses a deep-learning-based approach to filter out unreachable inputs before exercising them. It views program inputs as a kind of pattern and uses a large number of inputs labeled with the reachability to the target code learned from previous executions to train a model. Then, FuzzGuard utilizes the model to predict the reachability of the newly generated inputs without running them, which saves the time spent on real execution.

A fuzzer can perform much better if it generates the input concerning the input grammar. TOFU [57] takes advantage of the known structure of the program’s inputs in the form of a protobuf [18] specification to generate valid inputs. TOFU also augments the input space that it explores to include command-line flags by dividing the fuzzing process into syntactic-fuzzing and semantic-fuzzing. However, it usually takes one or two days to implement input-language grammar even if the user is familiar with the input language. SemFuzz [62] leverages information (syscalls and parameters) retrieved from CVE description and git log to build designed seed inputs to increases the probability of hitting the vulnerable functions. TIFF [20] and ProFuzzer [61] identify input types to assist mutation towards maximizing the likelihood of triggering memory corruption bugs. PFUZZER [37] is a syntax-driven approach that specifically targets input parsers to maximize the input space coverage without generating plausible inputs.

Iii-C Seed Prioritization

The crux of DGF is selecting and prioritizing the seeds that perform better in directedness under certain metrics. We summarize three prevalent metrics widely adopted by modern works, including distance, coverage, and probability.

Iii-C1 Distance

As we can see from Table I, 35% (9/26) of the directed fuzzers prioritize seeds based on distance and give preference to the seeds that are closer to the target. As a groundbreaking work, AFLGo [4] instruments the source code at compile-time and calculates the distances to the target basic blocks by the number of edges in the call graph and control flow graphs of the program. Then at runtime, aggregating the distance values of each exercised basic block to compute an average value to evaluate the seed. Many followups inherit this method, such as TOFU [57], ParmeSan [40], and 1DVUL [44] . RDFuzz [60]

combines distance with frequency to prioritize seeds. The code areas are separated into high-frequency and low-frequency areas by counting the execution frequency. The inputs are classified into high/low distance and high/low frequency four types. In the exploration phase, the low-frequency seeds are prioritized to improve the coverage, and for the exploitation phase, the low distance seeds are preferred to achieve the target code areas. UAFuzz is a tailored directed greybox fuzzer for complex behavioral use-after-free vulnerabilities 

[39]. Different from the CFG-based distance, it uses a distance metric of call chains leading to the target functions that are more likely to include both allocation and free functions.

Iii-C2 Similarity & Coverage

In addition to distance, similarity is another useful metric, which indicates the coverage of certain target forms, such as functions, locations, and bug traces. This metric is particularly suitable when there are many targets. Hawkeye [6] leverages a static analysis of the PUT and combines the basic block trace distance with covered function similarity for the seed prioritization and power scheduling. LOLLY [31] uses a user-specified program statement sequence as the target and takes the seed’s ability of covering the target sequences (i.e., sequence coverage) as a metric to evaluate the seed. UAFL [52] uses the operation sequence coverage as the feedback to guide the test generation to progressively cover the operation sequences that are like to trigger use-after-free vulnerabilities. UAFuzz[39] also uses a sequenceness-aware target similarity metric to measure the similarity between the execution of a seed and the target UAF bug trace. The sequenceness-aware target similarity metric concretely assesses how many targets a seed execution trace covers at runtime and takes ordering of the targets into account. Berry [30] takes into account of the coverage of nodes in the target sequences and their execution context, and enhances the target sequences with necessary nodes, namely the basic blocks required to reach the nodes in the target sequences for all paths. In addition to the branch coverage, Berry also considers the similarity between the target execution trace and the enhanced target sequence to prioritize the seeds. SAVIOR [8] prioritizes seeds that have higher potentials to trigger vulnerabilities based on the coverage of labels predicted by UBSan [12]. TortoiseFuzz [56] differentiates edges that are more likely to be destined vulnerable based on the fact that memory corruption vulnerabilities are closely related to sensitive memory operations. It rioritizes inputs by a combination of coverige and security impact, which is represented by the memory operations on three different types of granularity at function, loop, and basic block.

Iii-C3 Probability

Probability is a promising metric. V-Fuzz[29] predicts the vulnerable probability of functions based on a deep learning-based model and gives each basic block in the vulnerable function a static score. Then for each input, it calculates the sum of the static score of all the basic blocks on its execution path and prioritizes the inputs with higher scores. The labels made by UBSan in SAVIOR [8] also reflect the buggy potentials of the corresponding code areas.

Iii-C4 Path

Wüstholz et al [59] prioritize seeds at path level instead of basic block (edge) level. For each seed, before added to its test suite, the online static lookahead analysis can determine a path prefix for which all suffix paths are unable to reach a target location. In this way, the power schedule of the fuzzer can allocate its resources more strategically, such that more effort is spent on exercising program paths that might reach the target locations.

Iii-D Power Assignment

After the seeds are selected and prioritized, the preferenced seeds are given more power, namely more mutation chances. Although power assignment is crucial for DGF, very few works, try to optimize this step. AFLGo [4] uses a simulated annealing-based power schedule to gradually assign more energy to seeds that are closer to the target locations while reducing energy for further away seeds. Unlike the traditional random walk scheduling that always accepts better solutions which may be trapped in local optimum, simulated annealing accepts the solution which is not as good as the current one with a certain probability, so it is possible to jump out of the local optimum and reach the global optimal solution. [31]. Berry [30] also applies simulated annealing to the seed energy scheduling scheme to achieve global optimization. Hawkeye [6] also adopted simulated annealing but added prioritization. Thus, seeds closer to the target are mutated first, which further improves the directedness. LOLLY [31] adopts an optimized simulated annealing-based power schedule to achieve maximum sequence coverage. Controlled by a temperature threshold, the cooling schedule in the exploration stage randomly mutates the provided seeds to generate many new inputs, while in the exploitation stage, it generates more new inputs from seeds that have higher sequence coverage.

Iii-E Mutation Scheduling

Some fuzzers also optimize mutation strategies to assist directed fuzzing, which mainly realized by classifying the mutators into different granularities. Hawkeye[6] leverages an adaptive mutation strategy, which categorizes the mutators as coarse-grained and fine-grained. Coarse-grained mutators are used to change bulks of bytes during the mutations, while fine-grained only involve a few byte-level modifications, insertions, or deletions. It gives less chance of coarse-grained mutations when a seed can reach the target function. Once the seed reaches targets, the times of doing fine-grained mutations increase, and coarse-grained mutations decrease. In practice, the scheduling of mutators is controlled by empirical values. Similarly, V-Fuzz [29] classify the mutation strategies into slight mutation and heavy mutation and dynamically adjust the mutation strategy via a threshold according to the actual fuzzing states. SemFuzz [62] performs a resemble classification, except it focuses on the syscall. SemFuzz utilizes coarse mutation on the inputs to find a syscall sequence that can move the execution towards the “vulnerable functions”. After that, it switches to a fine-grained mutation on the syscall sequence to monitor the “critical variables”. ProFuzzer [61] entails different mutation policies according to the input field types recognized by input type probing.

Iii-F Data-flow Analysis

Compared to control-flow analysis, data-flow analysis is less prevalent in DGF. This is because it usually enlarges the runtime overhead even though data-flow analysis helps optimize mutation strategy and input generation. RDFuzz [60] leverages a disturb-and-check method to identify and protect the distance sensitive content from the input, which is vital to maintain the distance. Preventing such content during mutation can help to approach the target code location more efficiently. UAFL [52] adopts an information flow analysis to identify the relationship between the input and the program variables in the conditional statement, and assigns higher mutation possibility for these input bytes with high information flow strength, as they are more likely to change the values of target statement. SemFuzz [62] tracks critical variables by the kernel function parameters that the critical variables depend on via backward data-flow analysis on the critical variables. SeededFuzz[55] utilizes dynamic taint analysis to identify the bytes of seeds which can influence values at security-sensitive program sites. PFUZZER [37] uses dynamic tainting of inputs to relate each value processed to the input characters it is derived from. TIFF [20] infers input type by means of in-memory data-structure identification and dynamic taint analysis, which increases the probability of triggering memory corruption vulnerabilities by type-based mutation.

Iv Challenges and Solutions

In this section, based on the assessment of the state-of-the-art directed greybox fuzzers, we summarize the following challenges in DFG and propose potential solutions.

Iv-a Binary Code Support

Most of the known DGF works [4, 6, 60] are implemented on top of AFL and inherit its compile-time instrumentation scheme to feedback the execution status or calculate the distance-based metric. A significant drawback of such a scheme is the dependence of the PUT source code, which is unsuitable for testing scenes that the source code is unavailable, such as the commercial off-the-shelf (COTS) software, or the security-critical programs that rely partly on third-party libraries.

The binary-level DGF is less prevalent owing to the following reasons. First, heavy runtime overhead. A straightforward solution for the binary code testing is leveraging a full-system emulator. For example, UAFuzz [39] handles binary codes and extract execution paths via QEMU. However, emulator-based tools are usually less efficient. For example, the execution speed of vanilla AFL is 2X - 5X faster than its QEMU mode [9]. Second, difficulty in collecting target information. For source code programs, targets information can be gathered from various channels, such as the CVE vulnerability descriptions, changes made in the git commit logs, and human experience on critical sits in the source code. However, for binary code, we can only extract targets information from bug trace. Third, difficulty in labeling the targets. For the source code instrumentation approach, the targets can be labeled from the source code (e.g., cxxfilt.c, line 100). However, the thing is much more difficult for the binary-level approach. Since the binary code is hard to read, we have to disassemble it. A viable way is processing the binary code with tools such as IDA Pro [39], then label the target with the virtual address. However, this is inconvenient and time-consuming.

A viable solution to alleviate the performance limitation is hardware assistance. Intel PT is a lightweight hardware feature in recent Intel processors that captures tracing data about program execution, which replaces the need for dynamic instrumentation. Intel PT can trace program execution on the fly with negligible overhead. Using the packet trace captured by Intel PT along with the corresponding binary of the PUT, a security analyst could fully reconstruct the PUT’s execution path. Averagely, the PT-based approach is 4.3x faster than QEMU-AFL [66]. Previous hardware features such as Intel Last Branch Record also perform program tracing, but its output is stored in special registers instead of the main memory, which limits the trace size. There have been attempts of CGF with PT, such as kAFL [47], PTfuzz [66], Ptrix [9], and Honggfuzz [50]

. However, PT has never been used to DGF yet. For the problem of target identification and labeling on binary code, we can leverage the machine-learning-based approach

[29]

, or heuristic binary diffing approach to automatically identify vulnerable code from binary level.

Iv-B Automatic target identification

Most of the known directed fuzzers require the analyst to mark the targets manually (e.g., AFLGo, Hawkeye). They rely on the prior knowledge of the target sites, such as the line number in the source code or the virtual memory address of the binary code, to label the target and steer the execution to the desired location [4, 6]. However, to obtain such prior knowledge is challenging, especially for the binary code. Among the works we investigated, about a half (12/26) of them try to optimize the way how the targets are identified. Researchers use auxiliary metadata, such as changes made in the PUT code based on git commit logs[62], information extracted from bug trace [39], or information from CVE vulnerability descriptions [26] to gather interesting targets. Nevertheless, it still relies on manual efforts to process the information and mark the target on the PUT. It is unsuitable when fuzzing a PUT for the first time or when well-structured information is unavailable.

To achieve automatic target identification, we can use static analysis tools to find potential dangerous areas in the PUT [11, 14, 55]. However, these tools are often specific to the bug types and programming languages used [40]. Another direction is leveraging the compiler sanitizer passes, such as UBSan [12], to annotate potential bugs in the PUT [40, 8]. For bytecode, 1DVUL [44] identifies patch-related target branches by extracting different functions as well as their different basic blocks through binary-level comparison based on Bindiff. A deep learning-based method is also effective in predicting the vulnerability and uses the prediction information to guide the fuzzing [29]. Finally, attack surface identification component [15] can also be used to identify vulnerable targets for DGF automatically.

Iv-C Differentiated weight metric

In most of the state-of-the-art directed greybox fuzzers, the prioritization of seeds is based on equal-weight metrics. Take the widely used distance-based metric as an example. The ability to reach the target is measured by the distance between the seed and the target. Specifically, the distance between the seed and target is represented by a number of edges, namely the transitions among basic blocks. However, such measurement ignores the fact that different branch jumps have different probabilities to take. Thus, it is inaccurate and potentially limits the performance of the directed fuzzing.

Fig. 1: Probability bias when measuring the distance.

We use the following example to illustrate the difference. Figure 1 shows a CFG fragment, in which the input x is an integer ranges from 0 to 9. It is easy to know that the probability of jumping from node A to node C is 0.1, and from node A to node B is 0.9. We can also obtain the probabilities of other jumps by the branch conditions. For the distance calculation based on the number of branch jumps, the distance of A C is shorter than that of A G. This is because A C has only one jump but A G has three jumps. However, when we take the branch jump probability into account, the probability of A C is 0.1. However, the probability of A G is 0.9 0.7 0.5 0.3, which is more likely to be taken than A C and should be considered as has a “shorter” distance. Thus, it is more reasonable to consider the weight difference as well when calculating the distance to guide the seed prioritization. The rationale of the coverage-based metric is the same.

One possible solution is taking the branch jump probability into account. When evaluating the reachability of the target based on probability, each seed is prioritized based on how likely the seed can generate an input that can reach the target, namely the probability of converting the current execution path of this seed to a target path that goes through the target. Since an execution path can be viewed as a Markov Chain of successive branches [5]

, therefore, the probability of a path can be calculated based on the probabilities of all the branches within the path, and the estimation works by statistically calculating the ratio of branches traverse this path.

A Monte Carlo based method can be leveraged to estimate such probability. The density of the stationary distribution formally describes the likelihood that the fuzzer exercises a certain path after a certain number of iterations. A Monte Carlo based method requires two conditions: 1) the sampling should be random; 2) the sample scale should be large [67]. Fortunately, the fuzzing process by nature fulfills these requirements. The execution paths motivated by randomly mutated testcases can be considered as random samples, which met the first requirement. The high throughput of the testcases generated by fuzzers makes the estimation statistically meaningful, satisfying the second requirement. Thus, regarding fuzzing as a sampling process, we can statistically estimate the branch jump probability in a lightweight fashion.

One possible drawback of such a probability-based approach is the potential run-time overhead. Both the statistical jump counting and probability calculation can introduce extra computation. A simple way to alleviate performance deduction is interval sampling. Another possible solution is to accelerate the computation, which involves how the metadata is stored and accessed. Conventionally, graph-based data is stored in an adjacency table. However, since the probability-based approach updates the jump statistics very often and the reachability judgment also requires a quick edge tracing, thus, the adjacency table is unsuitable owing to the less efficiency when accessing data. Another option is the adjacency matrix [56], which supports quick data access. However, since a jump usually has two branches, the matrix is vast, while the data distribution is relatively sparse, which increases space consumption dramatically. Thus, a pre-condition to leverage a probability-based approach is designing a customized data structure that balances the time complexity and space complexity.

Iv-D Multi-targets relationship

When there are multi-targets in a DGF testing, How to coordinates these targets is another challenge. If these targets are unrelated, one strategy is seeking the global shortest distance based on Dijkstra’s algorithm, as AFLGo does. However, such global optimization can miss the local optimum seed that is closest to a certain target, leading to a deviation. We use the following example to illustrate the situation.

Fig. 2: Global distance bias.

Figure 2 shows an execution tree, where node I, P, L are the target nodes. If the execution path is ACFJN, based on the distance formula defined in AFLGo [4], the distance between the path and the targets are: = 3, = (4 + 3 + 2 + 1) / 4 = 2.5, = (3 + 2) / 2 = 2.5. Thus, the global distance = 3 + 2.5 + 2.5 = 8. If the execution path is ABDI, the distance for the targets are: = (3 + 2 + 1 + 0) / 4 = 1.5, = 4, = 3, and = 1.5 + 4 + 3 = 8.5. According to the algorithm, path ACFJN who has a shorter global distance should be prioritized. However, this is inreasonable because path ABDI goes through a target node I, which has a better directness and should be prioritized. Therefore, when there are multiple targets, the method of finding the global shortest distance has deviation, which affect the directness of fuzzing. Another strategy is separating the targets. For each seed, only selecting the minimum distance among all the targets as the distance of the seed, and prioritize the seed based on this distance [44]. In this way, we can avoid the local optimum deviation, but this might slow down the speed of reaching a specific target.

For the situation that multi-targets are related, such as the stack trace of reproducing a known bug [31], we should also take into account the relationships among targets, such as the ordering [39] and execution context [30].

Iv-E Missing Indirect Calls

No matter what metric is adopted, DGF relies on control-flow analysis to prioritize the seed. Take the distance-based metric as an example, in order to prioritize the seeds that are close to the targets, the distance is generally measured based on the control-flow graph and call graph. However, most researchers construct the control-flow graph and call graph statically from the source code at compile-time via LLVM’s builtin APIs, such graphs are incomplete due to the missing of indirect calls. In real-world programs, indirect function calls are prevalent. For example, in libpng, 44.11% function calls are indirect function calls [6]. For static analysis without running the program, indirect function calls sites cannot be observed directly from the source code or binary instructions, such as passing a function pointer as a parameter in C or using function objects and pointers. For binary code, the target address of indirect calls depends on the values in the registers, which cannot be obtained either. Besides, to construct an inter-procedural control-flow graph, we need to combine the control-flow graph of each function generated based on LLVM’s IR with the call graph of the whole program. Therefore, the distance measurement based on the call graph and control-flow graph is inaccurate without the indirect calls, which affects the ability of DGF to reach the targets.

For static approaches, one straightforward solution to this challenge is performing Andersen’s points-to analysis for function pointers [6, 8]. However, such inclusion-based context-insensitive pointer analysis causes an indirect call to have many outgoing edges, possibly yielding execution paths that are not possible for a given input. TOFU [57] uses function type-signatures to approximate the callable set at each indirect-call site. However, it does not consider casts, which could allow a differently typed function to be called, introducing imprecision. For the dynamic situation, ParmeSan [40] identifies the missing edges of indirect calls during real executions and compensates the call graph gradually. Finally, the graphs tend to be complete after enough number of fuzzing executions. However, such a solution inevitably enlarges the run-time overhead and cannot guarantee completeness.

Iv-F Exploration-exploitation coordination

The last challenge for DGF lies in coordinating the exploration-exploitation tradeoff. On the one hand, more exploration can obtain and provide adequate information for the exploitation; on the other hand, an overfull exploration would occupy many resources and delay the exploitation. It is difficult to determine the boundary between the exploration phase and the exploitation phase. Namely, we do not know when to stop exploration and begin the exploitation can perform the best. AFLGo adopts a fixed separation of the exploration and exploitation phases. The time budgets are set in the test configuration before testing, which is empirical and inflexible. Such a scheme is preliminary because the separation point is empirical and inflexible. Since different PUT has a different character, such fixed separation is less adaptive. Once the exploration phase turns to the exploitation phase, there is no going back even if the direction performance is poor due to not enough paths.

Fig. 3: Comparison of different splitting of exploration and exploitation.

To illustrate how the splitting of the exploration and exploitation phases affects the performance of DGF, we use the “-z” parameter of AFLGo to set different time budget for the exploration phase and compare the performance. As figure 3 shows, the horizontal coordinate shows the time duration, and the vertical coordinate means the minimum input among all the generated inputs to the target code distance areas. A small “min distance” indicates a better-directed performance. The experiments last for 24 hours, and AFLGo-1 means 1 hour of exploration, 23 hours of exploitation, and the rest are similar. From the result, we can conclude that the splitting of the exploration and exploitation phases affects the performance of DGF, and the best performance (AFLGo-16) requires adequate time for both of the two phases. However, it is difficult to get optimum splitting.

Among the directed fuzzers we investigated, only one work improves the coordination of exploration-exploitation. RDFuzz [60] uses an intertwined schedule to conduct exploration and exploitation alternately. It counts the branch-level statistics during the execution to separate the code areas into high-frequency and low-frequency areas. Based on the two evaluation criteria of frequency and distance, the inputs are classified into high/low distance and high/low-frequency types. Low-frequency inputs are helpful to improve the coverage, which is required in the exploration; Low distance inputs are helpful to achieve the target code areas, which are favored in the exploitation. Finally, it uses an intertwined testing schedule to conduct the exploration and exploitation alternately.

Another possible solution to this challenge is leveraging a dynamic strategy to coordinate the exploration-exploitation phases, which can adaptively switch between the exploration phase and the exploitation phase. To realize this scheme, we can utilize a variable to coordinate the energy we spend on each phase. We call this variable depth, which indicates the percentage of energy that spends on the exploitation phase. Based on this design, an adaptive DGF should start from the exploration phase (depth = 0) that focuses on discovering new paths. Then, with the increasing of known paths, we gradually increase depth to invoke the exploitation phase, in which high-valued directed seeds are selected and prioritized to enhance the reachability base on depth. When the fuzzer can not find any new paths for amount of time, this means the exploration phase has come to the bottleneck, and we should quickly move to the exploitation phase by dramatically increasing depth. Similarly, we also need to move from the exploitation phase back to the exploration phase occasionally. For example, the generation of the directed seed is very low for many fuzzing cycles. We should move back to the exploration phase to enlarge known path coverage to discover more directed seeds. When we are already at the exploitation phase and depth is very large (e.g., depth 0.9), but we still cannot find any new paths for many fuzzing cycles, we should decrease depth dramatically. This is because the currently directed seeds perform poorly, and we need to move back to the exploration phase to find new directed seeds. With this scheme, both of the two phases can coexist. Thus, they can achieve the best performance and adaptiveness.

V Discussion

According to Table I, 80% (21/26) of the works were published after 2019, indicating that DGF is currently a research hotspot. With the rapid development of DGF, apart from target sites, various of indicators have been proposed to direct DGF, including target sequence [39, 31, 30], semantic information [62, 26], typestate [52], sanitizer checks [40, 8], memory usage [58], and vulnerable probability [29]. DGF has evolved from reaching target locations to hunting complex deep behavioral bugs, such as used-after-free bugs [39, 52], memory consumption bugs [58], memory violation bugs [13], and deep stateful bugs [2].

V-a Relationship among Targets

Although 85% (22/26) of the directed fuzzers we investigated support multi-targets, only 4 of them pay attention to the relationship among targets. When there are multiple targets, we need to figure out the relationship among them. If they are unrelated, we can assign weights to them to differentiate the importance or probability. Otherwise, the hidden relationship can be extracted and exploited to improve directedness. For example, UAFL [52] take into account of the operation sequence ordering when leverages target sequence to find use-after-free vulnerabilities. This is because, to trigger such behavioral complex vulnerabilities, one needs not only to cover individual edges but also to traverse some long sequence of edges in a particular order. Berry [30] enhances the target sequences with execution context (i.e., necessary nodes, which are basic blocks required to reach the nodes in the target sequences) for all paths. We propose the following relationships that can be further included.

  • Spatial relationship. The relative position of targets on the execution tree. Suppose we have two targets, we can measure them by whether they are on the same execution path, how many execution paths are shared by them, and which one is the ancestor or the successor of the other.

  • State relationship. For targets that involve the program state, we also need to consider their position in the state space. For example, whether two targets share the same states, and whether two states can convert to each other on the state transition map.

  • Interleaving relationship. For multi-threaded programs, the thread scheduling affects the execution ordering of events in different threads. Targets that can be reached under the same thread interleaving should be a close relationship in the interleaving space.

Based on the above discussion, we recommend taking the relationship among targets when selecting and prioritizing targets. The targets with higher reachability should have higher priority. Targets with a closer relationship should be covered with fewer test runs.

V-B Technology Integration

Owing to the fact that DGF depends on the random mutation to generate test inputs, it can hardly reach deep targets and is less effective at triggering deep bugs along complex paths. In order to enhance the directedness and the reachability to corner cases and flaky bugs, various program analysis techniques such as static analysis, taint analysis, artificial intelligence, and symbolic execution have been adopted. Static analysis can be leveraged to automatically identify targets [8], extract information from PUT [6, 59]. Taint analysis can be used to identify the relationship between the input and the critical program variables and optimize mutation strategy scheduling cite [55, 52]. Artificial intelligence can help to predict vulnerable code [29] and filter out unreachable inputs [70]. Symbolic (concolic) execution can be leveraged to solve complex path constraints. Directed hybrid fuzzing is a promising direction that can leverage both the precision of symbolic execution and the scalability of DGF to mitigate their individual weaknesses. Directed fuzzing can prioritize and schedule input mutation to get closer to the targets rapidly, and directed symbolic execution can help reach deeper code guarded by complex checks on the execution traces from program entry to the targets [30, 8, 26, 44]. Nevertheless, we should be aware that anti-fuzzing techniques [19, 24] can insert fake paths, add delays in error-handling code, and obfuscate codes to slow down dynamic analyses such as symbolic execution and taint analysis [56].

V-C Implementation Limitation

According to Table I, about 57% (15/26) of the tools are implemented on top of AFL. Thus, the performance is, to some extent, limited by the implementation of AFL. We illustrate such limitation from two aspects.

Since the edge coverage of AFL is based on the basic block transitions, thus, it is only sensitive at the basic block level and cannot distinguish the path difference at the instruction level. Figure 4 shows an example of a jump between two nearby basic blocks. Since a traditional CFG is only path-sensitive at the basic block level, we cannot differentiate whether the jump at address 0x400657 is taken (path 2) or not (path 1) because there will be the same edge in the CFG, namely 0x400657 400671. Thus, such CFG is not sensitive enough to precisely reflect the code coverage at the instruction level.

Fig. 4: Differentiate execution path at instruction level.

Another problem lies in the path collision. AFL inserts random numbers for each branch jump at compile-time and collects these inserted numbers from the register at run-time to identify the basic block transition (i.e., the edge in the CFG). Then it maps such transitions to a 64KB bitmap by cur_location (prev_location 1)]. This scheme incurs path collision because different edges might have the chance to share the same location.

Both of the two limitations of AFL can cause imprecision on the control-flow graph, which eventually affect the seed prioritization based on the control-flow graph, no matter it is based on distance or other metrics. Although such limitation can be alleviated by constructing finer-grained control-flow graph or designing a customized hash scheme [16], however, this inevitably increase the runtime overhead. Thus, the implementation is essentially a tradeoff between the effectiveness and the efficiency. Such limitation from implementation level also exists in other tools.

V-D Efficiency Improvement

In order to realize directedness in fuzzing, most researchers use additional instrumentation and data analysis, for example, by static analysis, symbolic execution, taint analysis, and machine learning. However, such additional analysis inevitably incurs performance deduction. For the evaluation, researchers usually focus on the ability to reach targets, using metric such as Time-to-Exposure (the length of the fuzzing campaign until the first testcase that exposes a given error [4]) to measure the performance of directed greybox fuzzers, while ignoring the run-time overhead. However, for a given fuzzing time budget, higher efficiency means more fuzzing executions and consequently, more chance to reach the target. Thus, optimize fuzzing efficiency is another direction to improve the directedness.

One solution is moving the execution-independent computation from run-time to compile-time. For example, AFLGo measures the distance between each basic block and a target location by parsing the call graph and intra-procedure control flow graph of the PUT. Since both parsing graphs and calculating distances are very time consuming, AFLGo moves most of the program analysis to the instrumentation phase at compile-time in exchange for efficiency at run-time. Another optimization is at the implementation level. Since most of the data we use during the analysis is graph-based, how such metadata is stored and accessed is vital to the efficiency. We can design an optimized data structure to store such data, which should facilitate the frequent and quick access to the data when searching based on the topological structure of the graph—for example, using the graph database model [1]. Finally, we can leverage parallel computing to improve efficiency further. Prior works [10, 32] have successfully applied parallelism to CGF. For DGF, we can use a central node to maintain a seed queue that holds and prioritizes all the seeds for DGF. Then, distributing the seeds to parallel fuzzing instances on computational nodes to test the PUT and collect feedback information.

V-E Future research suggestions

Based on the assessment and analysis of known works, we point the following directions for future research.

  • Among the tools we evaluated, only one (SemFuzz [62]) of them supports kernel code testing. Thus, introducing DGF to kernel code and guiding fuzzing towards critical sites such as syscalls [42] and error handling codes [23, 46] should be a workable direction.

  • Although DGF has been trying to discover new bug types, such as use-after-free and memory consumption bugs, many commonly seen bugs have not been included yet. Thus, another research direction is applied DGF to specific bug types, such as information leakage bugs, concurrency bugs, semantic bugs (TOCTTOU, double fetch [53]).

  • As for the seed prioritization metric, most of the works leverage distance and coverage (similarity) based methods, which facilitate quantitive seed evaluation without introducing much overhead. However, a smaller distance or broader coverage does not necessarily mean closer to the target owing to the unequal weight reason (discussed in detail in Section IV-C). We argue that path and probability-based metrics should be more reasonable.

  • Finally, staged fuzzing [41, 57] is also a feasible approach that can be further exploited for DGF. By dividing the path to the target into sequential stages, staged directed fuzzing can get to the target step by step by reaching the sub-targets in each stage. Moreover, we can leverage different fuzzing strategies to satisfy the requirements in different stages. For example, TOFU uses syntactic-fuzzing for command-line flags and semantic-fuzzing for primary input files. Thus, staged fuzzing can reduce the dimensionality of the input space for each individual stage of fuzzing and improve fuzzing efficiency.

Vi Conclusions

Directed greybox fuzzing is a practical and scalable approach to software testing under specific scenarios, such as patch testing and bug reproduction. The modern DGF has evolved from reaching target locations to hunting complex deep behavioral bugs. In this paper, we conduct the first in-depth study of directed greybox fuzzing. We collect 26 state-of-the-art directed greybox (and hybrid) fuzzers with various directed schemes and optimization techniques. Based on the feature of DGF, we extract 15 metrics to conduct a thorough assessment of the collected tools and systemize the knowledge of this field. Based on the assessment, we summarize the challenges and perspectives of this field, aiming to facilitate and boost future research on this topic.

Acknowledgement

The authors would like to sincerely thank all the reviewers for your time and expertise on this paper. Your insightful comments help us improve this work. This work is partially supported by the National High-level Personnel for Defense Technology Program (2017-JCJQ-ZQ-013), the HUNAN Province Natural Science Foundation (2017RS3045, 2019JJ50729), and the National Natural Science Foundation China (61472437, 61902412, 61902416).

References

  • [1] R. Angles and C. Gutierrez (2008) Survey of graph database models. ACM Computing Surveys (CSUR) 40 (1), pp. 1–39. Cited by: §V-D.
  • [2] C. Aschermann, S. Schumilo, A. Abbasi, and T. Holz IJON: exploring deep state spaces via fuzzing. Cited by: §I, item -, §III-A, TABLE I, §V.
  • [3] W. Blair, A. Mambretti, S. Arshad, M. Weissbacher, W. Robertson, E. Kirda, and M. Egele (2020) HotFuzz: discovering algorithmic denial-of-service vulnerabilities through guided micro-fuzzing. arXiv preprint arXiv:2002.03416. Cited by: §I, §I.
  • [4] M. Böhme, V. Pham, M. Nguyen, and A. Roychoudhury (2017) Directed greybox fuzzing. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 2329–2344. Cited by: §I, §I, item -, §II-C, §II-C, §III-A, §III-C1, §III-D, TABLE I, §IV-A, §IV-B, §IV-D, §V-D.
  • [5] M. Böhme, V. Pham, and A. Roychoudhury (2017) Coverage-based greybox fuzzing as markov chain. IEEE Transactions on Software Engineering 45 (5), pp. 489–506. Cited by: §I, §II-B, §II-C, §IV-C.
  • [6] H. Chen, Y. Xue, Y. Li, B. Chen, X. Xie, X. Wu, and Y. Liu (2018) Hawkeye: towards a desired directed grey-box fuzzer. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 2095–2108. Cited by: item -, §III-A, §III-C2, §III-D, §III-E, TABLE I, §IV-A, §IV-B, §IV-E, §IV-E, §V-B.
  • [7] P. Chen, J. Liu, and H. Chen (2019) Matryoshka: fuzzing deeply nested branches. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 499–513. Cited by: §I, §II-B.
  • [8] Y. Chen, P. Li, J. Xu, S. Guo, R. Zhou, Y. Zhang, L. Lu, et al. (2019) SAVIOR: towards bug-driven hybrid testing. arXiv preprint arXiv:1906.07327. Cited by: §I, §III-A, §III-C2, §III-C3, TABLE I, §IV-B, §IV-E, §V-B, §V.
  • [9] Y. Chen, D. Mu, J. Xu, Z. Sun, W. Shen, X. Xing, L. Lu, and B. Mao (2019) Ptrix: efficient hardware-assisted fuzzing for cots binary. In Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security, pp. 633–645. Cited by: §IV-A, §IV-A.
  • [10] Y. Chen, Y. Jiang, F. Ma, J. Liang, M. Wang, C. Zhou, X. Jiao, and Z. Su (2019) Enfuzz: ensemble fuzzing with seed synchronization among diverse fuzzers. In 28th USENIX Security Symposium (USENIX Security 19), pp. 1967–1983. Cited by: §V-D.
  • [11] M. Christakis, P. Müller, and V. Wüstholz (2016) Guiding dynamic symbolic execution toward unverified program executions. In Proceedings of the 38th International Conference on Software Engineering, pp. 144–155. Cited by: §IV-B.
  • [12] clang 9 documentation (2020) Undefined behavior sanitizer - clang 9 documentation. External Links: Link Cited by: §III-C2, §IV-B.
  • [13] N. Coppik, O. Schwahn, and N. Suri (2019) MemFuzz: using memory accesses to guide fuzzing. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), pp. 48–58. Cited by: §I, TABLE I, §V.
  • [14] X. Du, B. Chen, Y. Li, J. Guo, Y. Zhou, Y. Liu, and Y. Jiang (2019) Leopard: identifying vulnerable code for vulnerability assessment through program metrics. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 60–71. Cited by: §IV-B.
  • [15] X. Du (2018) Towards building a generic vulnerability detection platform by combining scalable attacking surface analysis and directed fuzzing. In International Conference on Formal Engineering Methods, pp. 464–468. Cited by: §IV-B.
  • [16] S. Gan, C. Zhang, X. Qin, X. Tu, K. Li, Z. Pei, and Z. Chen (2018) Collafl: path sensitive fuzzing. In 2018 IEEE Symposium on Security and Privacy (SP), pp. 679–696. Cited by: §V-C.
  • [17] V. Ganesh, T. Leek, and M. Rinard (2009) Taint-based directed whitebox fuzzing. In 2009 IEEE 31st International Conference on Software Engineering, pp. 474–484. Cited by: §I.
  • [18] Google (2019) Protocol buffers. External Links: Link Cited by: §III-B.
  • [19] E. Güler, C. Aschermann, A. Abbasi, and T. Holz (2019) ANTIFUZZ: impeding fuzzing audits of binary executables. In 28th USENIX Security Symposium (USENIX Security 19), pp. 1931–1947. Cited by: §V-B.
  • [20] V. Jain, S. Rawat, C. Giuffrida, and H. Bos (2018) TIFF: using input type inference to improve fuzzing. In Proceedings of the 34th Annual Computer Security Applications Conference, pp. 505–517. Cited by: §III-B, §III-F, TABLE I.
  • [21] D. R. Jeong, K. Kim, B. Shivakumar, B. Lee, and I. Shin (2019) Razzer: finding kernel race bugs through fuzzing. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 754–768. Cited by: §I.
  • [22] B. Jiang, Y. Liu, and W. Chan (2018) Contractfuzzer: fuzzing smart contracts for vulnerability detection. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 259–269. Cited by: §I.
  • [23] Z. Jiang, J. Bai, K. Lu, and S. Hu (2020) Fuzzing error handling code using context-sensitive software fault injection. In 29th USENIX Security Symposium (USENIX Security 20), Cited by: item -.
  • [24] J. Jung, H. Hu, D. Solodukhin, D. Pagan, K. H. Lee, and T. Kim (2019) FUZZIFICATION: anti-fuzzing techniques. In 28th USENIX Security Symposium (USENIX Security 19), pp. 1913–1930. Cited by: §V-B.
  • [25] S. Karamcheti, G. Mann, and D. Rosenberg (2018)

    Adaptive grey-box fuzz-testing with thompson sampling

    .
    In Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security, pp. 37–47. Cited by: §II-B.
  • [26] J. Kim and J. Yun (2019) Poster: directed hybrid fuzzing on binary code. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 2637–2639. Cited by: §I, §III-A, TABLE I, §IV-B, §V-B, §V.
  • [27] K. Kim, D. R. Jeong, C. H. Kim, Y. Jang, I. Shin, and B. Lee HFL: hybrid fuzzing on the linux kernel. Cited by: §I.
  • [28] T. Kim, C. H. Kim, J. Rhee, F. Fei, Z. Tu, G. Walkup, X. Zhang, X. Deng, and D. Xu (2019) RVFUZZER: finding input validation bugs in robotic vehicles through control-guided testing. In 28th USENIX Security Symposium (USENIX Security 19), pp. 425–442. Cited by: §I, §III-A, TABLE I.
  • [29] Y. Li, S. Ji, C. Lv, Y. Chen, J. Chen, Q. Gu, and C. Wu (2019) V-fuzz: vulnerability-oriented evolutionary fuzzing. arXiv preprint arXiv:1901.01142. Cited by: §I, item -, §III-A, §III-C3, §III-E, TABLE I, §IV-A, §IV-B, §V-B, §V.
  • [30] H. Liang, L. Jiang, L. Ai, and J. Wei (2020) Sequence directed hybrid fuzzing. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 127–137. Cited by: §I, §III-A, §III-C2, §III-D, TABLE I, §IV-D, §V-A, §V-B, §V.
  • [31] H. Liang, Y. Zhang, Y. Yu, Z. Xie, and L. Jiang (2019) Sequence coverage directed greybox fuzzing. In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pp. 249–259. Cited by: §I, §III-A, §III-C2, §III-D, TABLE I, §IV-D, §V.
  • [32] J. Liang, Y. Jiang, Y. Chen, M. Wang, C. Zhou, and J. Sun (2018) Pafl: extend fuzzing optimizations of single mode to industrial parallel mode. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 809–814. Cited by: §V-D.
  • [33] C. Lyu, S. Ji, C. Zhang, Y. Li, W. Lee, Y. Song, and R. Beyah (2019) mopt: Optimized mutation scheduling for fuzzers. In 28th USENIX Security Symposium (USENIX Security 19), pp. 1949–1966. Cited by: §II-B.
  • [34] K. Ma, K. Y. Phang, J. S. Foster, and M. Hicks (2011) Directed symbolic execution. In International Static Analysis Symposium, pp. 95–111. Cited by: §I.
  • [35] V. J. Manès, S. Kim, and S. K. Cha Ankou: guiding grey-box fuzzing towards combinatorial difference. Cited by: TABLE I.
  • [36] P. D. Marinescu and C. Cadar (2013) KATCH: high-coverage testing of software patches. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp. 235–245. Cited by: §I, §I.
  • [37] B. Mathis, R. Gopinath, M. Mera, A. Kampmann, M. Höschele, and A. Zeller (2019) Parser-directed fuzzing. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 548–560. Cited by: §I, §III-A, §III-B, §III-F, TABLE I.
  • [38] D. Mu, A. Cuevas, L. Yang, H. Hu, X. Xing, B. Mao, and G. Wang (2018) Understanding the reproducibility of crowd-reported security vulnerabilities. In 27th USENIX Security Symposium (USENIX Security 18), pp. 919–936. Cited by: item -.
  • [39] M. Nguyen, S. Bardin, R. Bonichon, R. Groz, and M. Lemerre (2020) Binary-level directed fuzzing for use-after-free vulnerabilities. arXiv preprint arXiv:2002.10751. Cited by: §I, §III-A, §III-C1, §III-C2, TABLE I, §IV-A, §IV-B, §IV-D, §V.
  • [40] S. Österlund, K. Razavi, H. Bos, and C. Giuffrida ParmeSan: sanitizer-guided greybox fuzzing. Cited by: §I, §III-A, §III-C1, TABLE I, §IV-B, §IV-E, §V.
  • [41] R. Padhye, C. Lemieux, K. Sen, M. Papadakis, and Y. Le Traon (2019) Semantic fuzzing with zest. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 329–340. Cited by: item -.
  • [42] S. Pailoor, A. Aday, and S. Jana (2018) MoonShine: optimizing os fuzzer seed selection with trace distillation. In 27th USENIX Security Symposium (USENIX Security 18), pp. 729–743. Cited by: item -.
  • [43] K. Patil and A. Kanade (2018) Greybox fuzzing as a contextual bandits problem. arXiv preprint arXiv:1806.03806. Cited by: §II-B, §II-D.
  • [44] J. Peng, F. Li, B. Liu, L. Xu, B. Liu, K. Chen, and W. Huo (2019) 1dVul: discovering 1-day vulnerabilities through binary patches. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 605–616. Cited by: item -, item -, §III-A, §III-C1, TABLE I, §IV-B, §IV-D, §V-B.
  • [45] S. Person, G. Yang, N. Rungta, and S. Khurshid (2011) Directed incremental symbolic execution. Acm Sigplan Notices 46 (6), pp. 504–515. Cited by: §I.
  • [46] S. Rawat, V. Jain, A. Kumar, L. Cojocar, C. Giuffrida, and H. Bos (2017) VUzzer: application-aware evolutionary fuzzing.. In NDSS, Vol. 17, pp. 1–14. Cited by: item -.
  • [47] S. Schumilo, C. Aschermann, R. Gawlik, S. Schinzel, and T. Holz (2017) Kafl: hardware-assisted feedback fuzzing for os kernels. In 26th USENIX Security Symposium (USENIX Security 17), pp. 167–182. Cited by: §I, §IV-A.
  • [48] Y. Shin and L. Williams (2013) Can traditional fault prediction models be used for vulnerability prediction?. Empirical Software Engineering 18 (1), pp. 25–59. Cited by: §I.
  • [49] D. Song, F. Hetzelt, D. Das, C. Spensky, Y. Na, S. Volckaert, G. Vigna, C. Kruegel, J. Seifert, and M. Franz (2019) PeriScope: an effective probing and fuzzing framework for the hardware-os boundary.. In NDSS, Cited by: §I.
  • [50] R. Swiecki (2016) Honggfuzz. Available online a t: http://code. google. com/p/honggfuzz. Cited by: §IV-A.
  • [51] N. Vinesh, S. Rawat, H. Bos, C. Giuffrida, and M. Sethumadhavan (2020) ConFuzz—a concurrency fuzzer. In First International Conference on Sustainable Technologies for Computational Intelligence, pp. 667–691. Cited by: §I.
  • [52] H. Wang, X. Xie, Y. Li, C. Wen, Y. Liu, S. Qin, H. Chen, and Yulei. Sui (2020) Typestate-guided fuzzer for discovering use-after-free vulnerabilities. In 2020 IEEE/ACM 42nd International Conference on Software Engineering, Seoul, South Korea. Cited by: §I, item -, §III-A, §III-C2, §III-F, TABLE I, §V-A, §V-B, §V.
  • [53] P. Wang, J. Krinke, K. Lu, G. Li, and S. Dodier-Lazaro (2017) How double-fetch situations turn into double-fetch vulnerabilities: a study of double fetches in the linux kernel. In 26th USENIX Security Symposium (USENIX Security 17), pp. 1–16. Cited by: item -, item -.
  • [54] P. Wang, J. Krinke, X. Zhou, and K. Lu (2019) AVPredictor: comprehensive prediction and detection of atomicity violations. Concurrency and Computation: Practice and Experience 31 (15), pp. e5160. Cited by: item -.
  • [55] W. Wang, H. Sun, and Q. Zeng (2016) Seededfuzz: selecting and generating seeds for directed fuzzing. In 2016 10th International Symposium on Theoretical Aspects of Software Engineering (TASE), pp. 49–56. Cited by: item -, §III-B, §III-F, TABLE I, §IV-B, §V-B.
  • [56] Y. Wang, X. Jia, Y. Liu, K. Zeng, T. Bao, D. Wu, and P. Su Not all coverage measurements are equal: fuzzing by coverage accounting for input prioritization. Cited by: §III-C2, TABLE I, §IV-C, §V-B.
  • [57] Z. Wang, B. Liblit, and T. Reps (2020) TOFU: target-orienter fuzzer. arXiv preprint arXiv:2004.14375. Cited by: §III-B, §III-C1, TABLE I, §IV-E, item -.
  • [58] C. Wen, H. Wang, Y. Li, S. Qin, Y. Liu, Z. Xu, H. Chen, X. Xie, G. Pu, and T. Liu (2020) Memlock: memory usage guided fuzzing. Cited by: §I, item -, §III-A, TABLE I, §V.
  • [59] V. Wüstholz and M. Christakis (2019) Targeted greybox fuzzing with static lookahead analysis. arXiv preprint arXiv:1905.07147. Cited by: §I, item -, §III-C4, TABLE I, §V-B.
  • [60] J. Ye, R. Li, and B. Zhang (2020) RDFuzz: accelerating directed fuzzing with intertwined schedule and optimized mutation. Mathematical Problems in Engineering 2020. Cited by: §III-C1, §III-F, TABLE I, §IV-A, §IV-F.
  • [61] W. You, X. Wang, S. Ma, J. Huang, X. Zhang, X. Wang, and B. Liang (2019) Profuzzer: on-the-fly input type probing for better zero-day vulnerability discovery. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 769–786. Cited by: §III-B, §III-E, TABLE I.
  • [62] W. You, P. Zong, K. Chen, X. Wang, X. Liao, P. Bian, and B. Liang (2017) Semfuzz: semantics-based automatic generation of proof-of-concept exploits. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 2139–2154. Cited by: §I, item -, §III-A, §III-B, §III-E, §III-F, TABLE I, §IV-B, item -, §V.
  • [63] B. Yu, P. Wang, T. Yue, and Y. Tang (2019) Poster: fuzzing iot firmware via multi-stage message generation. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 2525–2527. Cited by: §I.
  • [64] T. Yue, Y. Tang, B. Yu, P. Wang, and E. Wang (2019) LearnAFL: greybox fuzzing with knowledge enhancement. IEEE Access 7, pp. 117029–117043. Cited by: §I, §II-B.
  • [65] T. Yue, P. Wang, Y. Tang, E. Wang, B. Yu, K. Lu, and X. Zhou (2020) EcoFuzz: adaptive energy-saving greybox fuzzing as a variant of the adversarial multi-armed bandit. In 29th USENIX Security Symposium (USENIX Security 20), Cited by: §II-D.
  • [66] G. Zhang, X. Zhou, Y. Luo, X. Wu, and E. Min (2018) Ptfuzz: guided fuzzing with processor trace feedback. IEEE Access 6, pp. 37302–37313. Cited by: §IV-A.
  • [67] L. Zhao, Y. Duan, H. Yin, and J. Xuan (2019) Send hardest problems my way: probabilistic path prioritization for hybrid fuzzing.. In NDSS, Cited by: §I, §IV-C.
  • [68] M. X. S. K. H. Zhao and T. Kim KRACE: data race fuzzing for kernel file systems. Cited by: §I.
  • [69] Y. Zheng, A. Davanian, H. Yin, C. Song, H. Zhu, and L. Sun (2019) FIRM-afl: high-throughput greybox fuzzing of iot firmware via augmented process emulation. In 28th USENIX Security Symposium (USENIX Security 19), pp. 1099–1114. Cited by: §I.
  • [70] P. Zong, T. Lv, D. Wang, Z. Deng, R. Liang, and K. Chen FuzzGuard: filtering out unreachable inputs in directed grey-box fuzzing through deep learning. Cited by: §III-B, TABLE I, §V-B.