# MUZZ: Thread-aware Grey-box Fuzzing for Effective Bug Hunting in Multithreaded Programs

## Authors

• 13 publications
• 8 publications
• 6 publications
• 6 publications
• 2 publications
• 3 publications
• 6 publications
• 344 publications
• ### NEUZZ: Efficient Fuzzing with Neural Program Learning

Fuzzing has become the de facto standard technique for finding software ...
07/15/2018 ∙ by Dongdong She, et al. ∙ 0

• ### Fuzzing Based on Function Importance by Attributed Call Graph

Fuzzing has become one of the important methods for vulnerability detect...
10/07/2020 ∙ by Wenshuo Wang, et al. ∙ 0

• ### Improving Linux-Kernel Tests for LockDoc with Feedback-driven Fuzzing

LockDoc is an approach to extract locking rules for kernel data structur...
09/16/2020 ∙ by Alexander Lochmann, et al. ∙ 0

• ### MEUZZ: Smart Seed Scheduling for Hybrid Fuzzing

Seed scheduling is a prominent factor in determining the yields of hybri...
02/20/2020 ∙ by Yaohui Chen, et al. ∙ 3

• ### Fuzzing with Fast Failure Feedback

Fuzzing – testing programs with random inputs – has become the prime tec...
12/25/2020 ∙ by Rahul Gopinath, et al. ∙ 0

• ### PTrix: Efficient Hardware-Assisted Fuzzing for COTS Binary

Despite its effectiveness in uncovering software defects, American Fuzzy...
05/25/2019 ∙ by Yaohui Chen, et al. ∙ 0

• ### SAVIOR: Towards Bug-Driven Hybrid Testing

Hybrid testing combines fuzz testing and concolic execution. It leverage...
06/18/2019 ∙ by Yaohui Chen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multithreading has been popular in modern software systems since it substantially utilizes the hardware resources to boost software performance. A typical computing paradigm of multithreaded programs is to accept a set of inputs, distribute computing jobs to threads, and orchestrate their progress accordingly. Compared to sequential programs, however, multithreaded programs are more prone to severe software faults. On the one hand, the non-deterministic thread-interleavings give rise to concurrency-bugs like data-races, deadlocks, etc [32]. These bugs may cause the program to end up with abnormal results or unexpected hangs. On the other hand, bugs that appear under specific inputs and interleavings may lead to concurrency-vulnerabilities [30, 5], resulting in memory corruptions, information leakage, etc.

There exist a line of works on detecting bugs and vulnerabilities in multithreaded programs. Static concurrency-bug predictors  [40, 50, 45, 2] aim to approximate the runtime behaviors of a program without actual concurrent execution. However, they typically serve as a complementary solution due to the high percentage of false alarms [19]. Dynamic detectors detect concurrency-violations by reasoning memory read/write and synchronization events in a particular execution trace [58, 12, 41, 42, 49, 21, 5]. Several techniques like ThreadSanitizer (a.k.a., TSan) [42] and Helgrind [49] have been widely used in practice. However, these approaches by themselves do not automatically generate new test inputs to exercise different paths in multithreaded programs.

Meanwhile, grey-box fuzzing is effective in generating test inputs to expose vulnerabilities [36, 34]. It is reported that grey-box fuzzers (GBFs) such as AFL [63] and libFuzzer [31] have detected more than 16,000 vulnerabilities in hundreds of real-world software projects [63, 31, 16].

Despite the great success of GBFs in detecting vulnerabilities, there are few efforts on fuzzing user-space multithreaded programs. General-purpose GBFs usually cannot explore thread-interleaving introduced execution states due to their unawareness of multithreading. Therefore, they cannot effectively detect concurrency-vulnerabilities inherently buried in sophisticated program flows [30]. In a discussion in 2015 [64], the author of AFL, Michal Zalewski, even suggests that “it’s generally better to have a single thread”. In fact, due to the difficulty and inefficiency, the fuzzing driver programs in Google’s continuous fuzzing platform OSS-fuzz are all tested in single-threaded mode [15]. Also, by matching unions of keyword patterns “race*”, “concurren*” and “thread*” in the MITRE CVE database [48], we found that only 202 CVE records are relevant to concurrency-vulnerabilities out of the 70438 assigned CVE IDs ranging from CVE-2014-* to CVE-2018-*. In particular, we observed that, theoretically, at most 4 CVE records could be detected by grey-box fuzzers that work on user-space programs.

As a result, there are no practical fuzzing techniques to test input-dependent user-space multithreaded programs and detect bugs or vulnerabilities inside them. To this end, we present a dedicated grey-box fuzzing technique, Muzz, to reveal bugs by exercising input-dependent and interleaving-dependent paths. We categorize the targeted multithreading-relevant bugs into two major groups:

1. [,noitemsep,topsep=0pt,parsep=0pt]

2. concurrency-vulnerabilities (): they correspond to memory corruption vulnerabilities that occur in a multithreading context. These vulnerabilities can be detected during the fuzzing phase.

3. concurrency-bugs (): they correspond to the bugs like data-races, atomicity-violations, deadlocks, etc. We detect them by replaying the seeds generated by Muzz with state-of-the-art concurrency-bug detectors such as TSan.

Note that may not be revealed during fuzzing since they do not necessarily result in memory corruption crashes. In the remaining sections, when referring to multithreading-relevant bugs, we always mean the combination of concurrency-bugs and concurrency-vulnerabilities, i.e., .

We summarize the contributions of our work as follows: 1) We develop three novel thread-aware instrumentations for grey-box fuzzing that can distinguish the execution states caused by thread-interleavings.

2) We optimize seed selection and execution strategies based on the runtime feedback provided by the instrumentations, which help generate more effective seeds concerning the multithreading context.

3) We integrate these analyses into Muzz for an effective bug hunting in multithreaded programs. Experiments on 12 real-world programs show that Muzz outperforms other fuzzers like AFL and MOpt in detecting concurrency-vulnerabilities and revealing concurrency-bugs.

4) Muzz detected 8 new concurrency-vulnerabilities and 19 new concurrency-bugs, with 4 CVE IDs assigned. Considering the small portion of concurrency-vulnerabilities recorded in the CVE database, the results are promising.

## 2 Background and Motivation

### 2.1 Grey-box Fuzzing Workflow

Algorithm 1 presents the typical workflow of a grey-box fuzzer [34, 3, 63]. Given a target program and the input seeds , a GBF first utilizes instrumentation to track the coverage information in . Then it enters the fuzzing loop: 1) Seed selection decides which seed to be selected next; 2) Seed scheduling decides how many mutations will be applied on the selected seed ; 3) Seed mutation applies mutations on seed to generate a new seed ; 4) During repeated execution, for each new seed , the fuzzer executes against it times to get its execution statistics; 5) Seed triaging evaluates based on the statistics and the coverage feedback from instrumentation, to determine whether the seed leads to a vulnerability, or whether it is “effective” and should be preserved in the seed queue for subsequent fuzzing. Here, steps 3), 4), 5) are continuously processed times. Notably, times of repeated executions are necessary since a GBF needs to collect statistics such as average execution time for , which will be used to calculate mutation times for seed scheduling in the next iteration. In essence, the effectiveness of grey-box fuzzing relies on the feedback collected from the instrumentation. Specifically, the result of cov_new_trace (line 1) is determined by the coverage feedback.

### 2.2 The Challenge in Fuzzing Multithreaded Programs and Our Solution

Figure 1 is an abstracted multithreaded program that accepts a certain input file and distributes computing jobs to threads. Practically it may behave like compressors/decompressors (e.g., lbzip2, pbzip2), image processors (e.g., ImageMagick, GraphicsMagick), encoders/decoders (e.g., WebM, libvpx), etc. After reading the input content buf, it does an initial validity check inside the function check. It exits immediately if the buffer does not satisfy certain properties. The multithreading context starts from function compute (via pthread_create at lines 1-1). It contains shared variables s_var (passed from main) and g_var (global variables), as well as the mutex primitive m to exclusively read/write shared variables (via pthread_mutex_lock and pthread_mutex_unlock).

With different inputs, the program may execute different segments. For example, based on the condition of statement , which is purely dependent on the input content (i.e., different results of buf provided by seed files), it may or may not execute . Therefore, different seed files need to be generated to exercise different paths in multithreading context — in fact, this is the starting point that we use fuzzing to generate seed files to test multithreaded programs.

Meanwhile, in the presence of thread-interleavings, g_var (initialized with -1) may also have different values. Let us focus on different seeds’ executions at two statements: :“g_var+=1”, and : “g_var*=2”. Suppose there are two threads: T1, T2; and T1: is executed first. Then there are at least three interleavings:

1. [label=)]

2. T1: T2: T2: T1: g_var=4

3. T1: T2: T1: T2: g_var=4

4. T1: T1: T2: T2: g_var=2

After the second is executed, the values of g_var may be different (4 and 2, respectively). Worse still, since neither nor is an atomic operation in the representation of the actual program binary, many more interleavings can be observed and g_var will be assigned to other values.

The challenge. To reveal multithreading-relevant bugs, a GBF needs to generate diverse seeds that execute different paths in multithreading context (e.g., paths inside compute). However, existing GBFs even have difficulties in generating seeds to reach multithreading segments. For example, if check is complicated enough, most of the seeds may fail the check and exit before entering compute — this is quite common due to the low quality of fuzzer-generated seeds [34, 61]. Meanwhile, even if a seed indeed executes multithreading code, it may still fail to satisfy certain preconditions to reach the problematic context. For example, suppose modify contains a vulnerability that can only be triggered when g_var is 2. If the fuzzer has occasionally generated a seed that executes compute and the condition of is true, with no awareness of thread-interleavings, it will not distinguish different schedules between 1, 2 and 3. As a result, subsequent mutations on this seed will miss important feedback regarding g_var, making it difficult to generate seeds that trigger the vulnerability.

To summarize, the challenge of fuzzing multithreaded programs is, existing GBFs have difficulties in generating seeds that execute multithreading context and keep thread-interleaving execution states.

Our solution. We provide fine-grained thread-aware feedback for seed files that execute multithreading context and distinguish more such execution states. According to §2.1, the preservation of seeds is based on the feedback; then we can expect that the fuzzer will preserve more distinct seeds that execute multithreading code segments in the seed queue. This means that the multithreading-relevant seeds are implicitly prioritized. Since these seeds have already passed the validity checking, the overall quality of the generated seeds is higher. The “Matthew Effect” helps keep the quality of seed generations for subsequent fuzzing. Essentially, this provides a biased coverage feedback on multithreading code segments (more explanations on this are available in  §5.3.

Now let us investigate what instrumentations can be improved to existing fuzzers for thread-aware feedback.

The state-of-the-art GBFs, such as AFL, instrument the entry instruction of each basicblock evenly as the basicblock’s deputy. We refer to this selection strategy over deputy instructions as AFL-Ins. AFL-Ins provides coverage feedback during the dynamic fuzzing phase to explore more paths. During repeated execution (line 1 in Algorithm 1), AFL labels a value to each transition that connects the deputies of two consecutively executed basicblocks [63]. By maintaining a set of transitions for queued seeds, AFL-Ins tracks the “coverage” of the target program. cov_new_trace (line 1 in Algorithm 1) checks whether a transition indicates a new path/state.

Figure (b)b depicts the transitions upon executing the functions compute and modify in Figure 1. For brevity, we use source code to illustrate the problem and use statements to represent instructions in assembly or LLVM IR [28].

AFL-Ins works perfectly on single-threaded programs: the kept transitions can reflect both branching conditions (e.g., and ) and function calls (e.g., and ). However, AFL-Ins cannot capture these differences among schedules 1, 2 and 3 (c.f. §2.2). In fact, it can only observe there is a transition ; thus it will not prioritize this path for subsequent mutations, compared to other paths that do not even execute compute. The root cause of this defect lies in that AFL only tracks entry statements of basicblocks evenly, and does not record thread identities. Therefore, we can add more deputy instructions within multithreading-relevant basicblocks to provide more interleaving feedback, and add thread-context information to distinguish different threads.

#### 2.3.2 Schedule-intervention Across Executions

During a GBF’s repeated execution procedure (line 1 in Algorithm 1), a seed may exhibit non-deterministic behaviors: it executes different paths of the target program across executions due to randomness. In this scenario, AFL (and other GBFs) will execute against such a seed more times than a seed with deterministic behaviors [63]. For the non-deterministic behaviors caused by scheduling-interleaving in multithreaded programs, since the execution is continuously repeated times, the system level environment (e.g., CPU usage, memory consumption, I/O status) is prone to be similar [23, 26]. This will decrease the diversities of schedules, and consequently reduce the overall effectiveness. For example, during a repeated execution with , schedules 1 and 3 might occur 10 and 30 times respectively, while schedule 2 do not occur at all; in this scenario, the execution states corresponding to 2 will not be observed by the fuzzer. Ideally, we would like the fuzzer to observe as many distinct interleavings as possible during repeated execution since that marks the potential states a seed can exercise. In the case of statements and , we hope schedules 1, 2, 3 can all occur. Therefore, it is favorable to provide schedule interventions to diversify the actual schedules.

## 3 System Overview

Figure 3 depicts the system overview of Muzz. It contains four major components: static thread-aware analysis guided instrumentations, dynamic fuzzing, vulnerability analysis, concurrency-bug revealing.

During :instrumentation4), for a multithreaded program , Muzz firstly computes thread-aware inter-procedural control flow graph (ICFG) and the code segments that are likely to interleave with others during execution [11, 45], namely suspicious interleaving scope, in §4.1. Based on these results, it performs three instrumentations inspired by §2.3.

1. [label=0)]

2. Coverage-oriented instrumentation4.2) is one kind of stratified instrumentation that assigns more deputies to suspicious interleaving scope. It is the major instrumentation to track thread-interleaving induced coverage.

3. Thread-context instrumentation4.3) is a type of lightweight instrumentation that distinguishes different thread identities by tracking the context of threading functions for thread-forks, locks, unlocks, joins, etc.

4. Schedule-intervention instrumentation4.4) is a type of lightweight instrumentation at the entry of a thread-fork routine that dynamically adjusts each thread’s priority. This complementary instrumentation aims to diversify interleavings by intervening in the thread schedules.

During :dynamic fuzzing5), Muzz optimizes seed selection and repeated execution to generate more multithreading-relevant seeds. For seed selection (§5.1), in addition to the new coverage information provided by coverage-oriented instrumentation, Muzz also prioritizes those seeds that cover new thread-context based on the feedback provided by thread-context instrumentation. For repeated execution (§5.2), owing to the schedule-intervention instrumentation, Muzz adjusts the repeating times , to maximize the benefit of repetitions and track the interleaved execution states.

:Vulnerability analysis is applied to the crashing seeds found by dynamic fuzzing, which reveals vulnerabilities (including ). :concurrency-bug revealing component reveals with the help of concurrency-bug detectors (e.g., TSan [42], Helgrind [49]). These two components will be explained in the evaluation section (§6).

## 4 Static Analysis Guided Instrumentation

This component includes the thread-aware static analysis and the instrumentations based on it.

The static analysis aims to provide lightweight thread-aware information for instrumentation and runtime feedback.

We firstly apply an inclusion-based pointer analysis [1] on the target program. The points-to results are used to resolve the def-use flow of thread-sharing variables and indirect calls to reconstruct the ICFG. By taking into account the semantics of threading APIs (e.g., POSIX standard Pthread, the OpenMP library), we get an ICFG that is aware of the following multithreading information:

1. [label=0)]

2. TFork is the set of program sites that call thread-fork functions. This includes the explicit call to pthread_create, the std::thread constructor that internally uses pthread_create, or the “parallel pragma” in OpenMP. The called functions, denoted as , are extracted from the semantics of these forking sites.

3. TJoin contains call sites for functions that mark the end of a multithreading context. It includes the call sites of the pthread APIs such as pthread_join, pthread_exit, etc.

4. TLock is the set of sites that call thread-lock functions such as pthread_mutex_lock, omp_set_lock, etc.

5. TUnLock is the set of sites that call thread-unlock functions like pthread_mutex_unlock, omp_unset_lock, etc.

6. TShareVar is the set of variables shared among different threads. This includes global variables and those variables that are passed from thread-fork sites (e.g., TFork).

#### 4.1.2 Suspicious Interleaving Scope Extraction

Given a program that may run simultaneously with multiple threads, we hope the instrumentation to collect execution states to reflect the interleavings. However, instrumentation introduces considerable overhead to the original program, especially when it is applied intensively throughout the whole program. Fortunately, with the static information provided by the thread-aware ICFG, we know that thread-interleavings may only happen on some specific program statements; therefore, the instrumentation can stress these statements. We hereby use to denote the set of these statements and term it as suspicious interleaving scope. is determined according to the following three conditions.

1. [C1]

2. The statements should be executed after one of TFork, while TJoin is not encountered yet.

3. The statements can only be executed before the invocation of TLock and after the invocation of TUnLock.

4. The statements should read or write at least one of the shared variables by different threads.

1 excludes the statements irrelevant to multithreading. These statements can be prologue code that does the validity check (e.g., check in Figure 1), or the epilogue that post-processes the inputs or deals with error handlings. 2 prevents the statements that are protected by certain locks from being put into . 3 is necessary since the interleavings will not affect the shared states if the segment involves no shared variables. This condition is determined by observing whether the investigated statement contains a variable data dependent on TShareVar (based on pointer analysis). We provide a separate preprocessing procedure to exclude cases where there are only read operations on shared variables.

Note that is used to emphasize multithreading-relevant paths via instrumentations for state exploration during fuzzing. Therefore the conditions are different from the constraints required by static models (e.g., may-happen-in-parallel [11, 45]) or dynamic concurrency-bug detection algorithms (e.g., happens-before [12] or lockset [41]).

In Figure 1, according to the call pthread_create at Lines 1 and 1, . Muzz then gets all the functions that may be called by functions inside , i.e., {modify,compute} and according to 1 the scope comes from Lines 1, 1, 11. Inside these functions, we check the statements that are outside pthread_mutex_lock and pthread_mutex_unlock based on 2: Line 1 should be excluded from . According to 3, we exclude the statements that do not access or modify the shared variables g_var, s_var, which means Lines 1 and 1 should also be excluded. In the end, the scope is determined as . Note that although modify can be called in a single-threading site inside check (Line 1), we still conservatively include it in . The reason is that it might be called within multithreading contexts (Line 1 and Line 1) — modify is protected by mutex m at Line 1 while unprotected at Line 1. It is worth noting that line 1, although protected by m, may still happen-in-parallel [11, 45] with lines 1 and  1. However, since lines 1 and 1 have already been put in , we consider it sufficient to help provide more feedback to track thread-interleavings, with line 1 excluded from . Overall, the static analysis is lightweight. For example, the pointer analysis is flow- and context-insensitive ; extraction of thread-aware results such as (in 1) and TShareVar (in 3) are over-approximated in that the statically calculated sets may be larger than the actual sets; 2 may aggressively exclude several statements that involve interleavings. The benefit, however, is that it makes our analysis scalable to large-scale real-world programs.

### 4.2 Coverage-oriented Instrumentation

With the knowledge of , we can instrument more deputy instructions (corresponding to statements in source code) inside the scope than the others, for exploring new transitions. However, it is still costly to instrument on each instruction inside since this may significantly reduce the overall execution speed of the target programs. It is also unnecessary to do so — although theoretically, interleavings may happen everywhere inside , many interleavings are not important because they do not change the values of shared variables in practice. This means that we can skip some instructions for instrumentation, or equivalently instrument them

with a probability

. We still instrument, despite less, on segments outside for exploration purposes [34]. For example, in Figure 1, we apply instrumentation on check, just in case the initial seeds are all rejected by the validity check and no intermediate feedback are available at all, making the executions extremely difficult to even enter compute. Similarly, we can also selectively instrument some instructions outside .

#### 4.2.1 Instrumentation Probability Calculation

The goal of calculating instrumentation probabilities is to strike a balance between execution overhead and feedback effectiveness by investigating code segments’ complexity of the target programs. First of all, Muzz calculates a base instrumentation probability according to cyclomatic complexity [35], based on the fact that bugs or vulnerabilities usually come from functions with higher cyclomatic complexity [9, 43]. For each function , we calculate the complexity value: where is the number of nodes (basicblocks) and is the number of edges in the function’s control flow graph. Intuitively, this value determines the complexity of the function across its basicblocks. As 10 is considered to be the preferred upper bound of  [35], we determine the base probability as:

 Pe(f)=min{E(f)−N(f)+210, 1.0} (1)

We use as the probability to selectively instrument on the entry instruction of a basicblock that is entirely outside suspicious interleaving scope, i.e., none of the instructions inside the basicblock belong to . Here, is calculated as:

 Ps(f)=min{Pe(f), Ps0} (2)

where . Empirically, Muzz sets .

Further, for each basicblock inside the given function , we calculate the total number of instructions , and the total number of memory operation instructions (e.g., load/store, memcpy, free). Then for the instructions within , the instrumentation probability is calculated as:

 Pm(f,b)=min{Pe(f)⋅Nm(b)N(b), Pm0} (3)

where is a factor satisfying and defaults to . The rationale of is that vulnerabilities usually result from memory operation instructions [34], and executions on more such operations deserve more attention.

#### 4.2.2 Instrumentation Algorithm

The coverage-oriented instrumentation algorithm is described in Algorithm 2. It traverses functions in the target program . For each basicblock in function , Muzz firstly gets the intersection of the instructions inside both and . If this intersection is empty, it instruments the entry instruction of with a probability of (). Otherwise, 1) for the entry instruction in , Muzz always instruments it (i.e., with probability 1.0); 2) for the other instructions, if they are inside , Muzz instruments them with a probability of . We will refer to our selection strategy over deputy instructions as M-Ins. As a comparison, AFL-Ins always instruments evenly at the entry instructions of all the basicblocks.

For the example in Figure 1, since the lines 1-1 and line 1 are out of , we can expect M-Ins to instrument fewer entry statements on their corresponding basicblocks. Meanwhile, for the statements inside , M-Ins may instrument other statements besides the entry statements. For example,   is the entry statement thus it must be instrumented; statement  may also be instrumented (with a probability) — if so, transition  can be tracked.

We apply threading-context instrumentation to distinguish thread identities for additional feedback. This complements coverage-oriented instrumentation since the latter is unaware of thread IDs. The context is collected at the call sites of , each of which has the form , where is the labeling value of deputy instruction executed before this call site , and is obtained by getting the value of the key identified by current thread ID from the “thread ID map” collected by the instrumented function (to be explained in §4.4). Given an item in , we keep a sequence of context ,. At the end of each execution, we calculate a hash value for item . The tuple is a context-signature that determines the overall thread-context of a specific execution. Essentially, this is a sampling on threading-relevant APIs to track the thread-context of a specific execution. As we shall see in §5.1, the occurrence of determines the results of cov_new_mt_ctx during seed selection.

In Figure 1, each time when pthread_mutex_lock is called, Muzz collects the deputy instruction prior to the corresponding call site (e.g., ) and the thread ID label (e.g., T1) to form the tuple (e.g., ); these tuples form a sequence for TLock, and a hash value will be calculated eventually. Similar calculations are applied for pthread_mutex_unlock and pthread_join.

### 4.4 Schedule-intervention Instrumentation

When a user-space program does not specify any scheduling policy or priority, the operating system determines the actual schedule dynamically [26, 23]. Schedule-intervention instrumentation aims to diversify the thread-interleavings to collaborate with coverage-oriented and thread-context instrumentations. This instrumentation should be general enough to work for different multithreaded programs and extremely lightweight to keep runtime overhead minimal.

POSIX compliant systems such as Linux, FreeBSD usually provide APIs to control the low-level process or thread schedules [23, 26]. In order to intervene in the interleavings during the execution of the multithreading segments, we resort to the POSIX API pthread_setschedparam to adjust the thread priorities with an instrumented function named that will be invoked during fuzzing. This function does two tasks:

1. [a)]

2. During repeated execution (§5.2), whenever the thread calls , it updates the scheduling policy to SCHED_RR, and assigns a ranged random value to its priority. This value is uniformly distributed random and diversifies the actual schedules across different threads. With this intervention, we try to approximate the goal in §2.3.2.

3. For each newly mutated seed file, it calls pthread_self in the entry of to collect the thread IDs. It has two purposes: 1) it informs the fuzzer that the current seed is multithreading-relevant; 2) based on the invocation order of , each thread can be associated with a unique ID starting from , which composes “thread ID map” and calculates thread-context in §4.3.

## 5 Dynamic Fuzzing

The dynamic fuzzing loop follows the workflow of a typical GBF described in Algorithm 1. To improve the feedback on multithreading context, we optimize seed selection (§5.1) and repeated execution (§5.2) for fuzzing multithreaded programs, based on the aforementioned instrumentations.

### 5.1 Seed Selection

Seed selection decides which seeds to be mutated next. In practice, this problem is reduced to: when traversing seed queue , whether the seed at the queue front will be selected for mutation. Algorithm 3 depicts our solution. The intuition is that we prioritize those seeds with new (normal) coverage or covering new thread-context.

In addition to following AFL’s strategy by using has_new_trace() to check whether there exists a seed, , in that covers a new transition (i.e., cov_new_trace(s)==true), Muzz also uses has_new_mt_ctx() to check whether there exists a seed in with a new thread-context (). If either is satisfied, it means there exist some “interesting seeds” in the queue. Specifically, if the current seed covers a new thread-context, the algorithm directly returns true. If it covers a new trace, it has a probability of to be selected; otherwise, the probability is . On the contrary, if no seeds in are interesting, the algorithm selects with a probability of . Analogous to AFL’s seed selection strategy [63], Muzz sets , , .

As to the implementation of cov_new_mt_ctx(t), we track the thread-context of calling a multithreading API in (c.f. §4.3) and check whether the context-signature has been met before — when is new, cov_new_mt_ctx(t)=true; otherwise, cov_new_mt_ctx(t)=false. Note that cov_new_trace(t)==true does not imply cov_new_mt_ctx(t)==true. The reason is that (1) we cannot instrument inside the body of threading API functions (as they are “external functions”) inside , hence cov_new_trace cannot track the transitions; (2) cov_new_mt_ctx also facilitates the thread IDs that cov_new_trace is unaware of.

### 5.2 Repeated Execution

Multithreaded programs introduce non-deterministic behaviors when different interleavings are involved. As mentioned in §2.3.2, for a seed with non-deterministic behaviors, a GBF typically repeats the execution on the target program against the seed for more times. With the help of  (c.f. §4.4), we are able to tell whether or not the exhibited non-deterministic behaviors result from thread-interleavings. In fact, since we focus on multithreading only, based on the thread-fork information kept by , the fuzzer can distinguish the seeds with non-deterministic behaviors purely by checking whether the executions exercise multithreading context. Further, if previous executions on a seed induce more distinct values of (the number of these values for a provided seed is denoted as (t)), we know that there must exist more thread-interleavings. To determine the repeating times  applied on , we rely on . In AFL, the repeating times on is:

 Nc(t)=N0+Nv⋅Bv,Bv∈{0,1} (4)

where is the initial repeating times, is a constant as the “bonus” times for non-deterministic runs. =0 if none of the executions exhibit non-deterministic behaviors; otherwise =1. We augment this to fit for multithreading setting.

 Nc(t)=N0+min{Nv,N0⋅Cm(t)} (5)

In both AFL and Muzz, , . For all the executions, we track their execution traces and count how many different states it exhibits. The rationale of adjusting is that, in real-world programs the possibilities of thread-interleavings can vary greatly for different seeds. For example, a seed may exhibit non-deterministic behaviors when executing compute in Figure 1 (e.g., races in g_var), but it exits soon after failing an extra check inside compute (typically, exit code >0). For sure, it will exhibit fewer non-deterministic behaviors than a seed that is concurrently processed and the program exits normally (typically, exit code =0).

### 5.3 Complementary Explanations

Here we provide some explanations to show why Muzz’s static and dynamic thread-aware strategies help to improve the overall fuzzing effectiveness.

1) Mutations on multithreading-relevant seeds are more valuable for seed mutation/generation. Multithreading-relevant seeds themselves have already passed validity checks of the target program. Compared to a seed that cannot even enter the thread-fork routines, it is usually much easier to generate a multithreading-relevant seed mutant from an existing multithreading-relevant seed. This is because the mutation operations (e.g., bitwise/bytewise flips, arithmetic adds/subs) in grey-box fuzzers are rather random and it is rather difficult to turn an invalid seed to be valid. Therefore, from the mutation’s perspective, we prefer multithreading-relevant seeds to be mutated.

2) Muzz can distinguish more multithreading-relevant states. For example, in Figure 1, it can distinguish transitions and . Then when two different seeds exercise the two transitions, Muzz is able to preserve both seeds. However, other GBFs such as AFL cannot observe the difference. Conversely, when we provide less feedback for seeds that do not involve multithreading, Muzz can distinguish less of these states and put less multithreading-irrelevant seeds in the seed queue.

3) Large portions of multithreading-relevant seeds in the seed queue benefit subsequent mutations. Suppose at some time of fuzzing, both Muzz and AFL preserve 10 seeds ( =10), and Muzz keeps 8 multithreading-relevant seeds ( =8) while AFL keeps 6 ( =6). Obviously, the probability of picking Muzz generated multithreading-relevant seeds (80%) is higher than AFL’s (60%). After this iteration of mutation, more seed mutants in Muzz are likely multithreading-relevant. The differences of seed quality (w.r.t. relevance to multithreading) in the seed queue can be amplified with more mutation iterations. For example, finally Muzz may keep 18 multithreading-relevant seeds ( =18), and 10 other seeds, making =28 and =; while AFL keeps 12 multithreading-relevant seeds ( =12) and 14 other seeds, making =26 and =.

Properties 1), 2) and 3) collaboratively affects the fuzzing effectiveness in a “closed- loop”. Eventually, both and in Muzz are likely to be bigger than those in AFL. Owing to more multithreading-relevant seeds in the queue and property 1), we can expect that:

1. [a)]

2. concurrency-vulnerabilities are more likely to be detected with the new proof-of-crash files mutated from multithreading-relevant files from the seed queue.

3. concurrency-bugs are more likely to be revealed with the (seemingly normal) files in the seed queue that violate certain concurrency conditions.

Providing more feedback for multithreading-relevant segments essentially provides a biased coverage criterion to specialize fuzzing on multithreaded programs. Other specialization techniques, such as the context-sensitive instrumentation used by Angora [7], or the typestate-guided instrumentation in UAFL [52], provide similar solutions and achieve inspiring results. The novelty of Muzz lies in that we facilitate the multithreading-specific features as feedback to innovatively improve the seed generation quality. It is worth noting that our solution only needs lightweight thread-aware analyses rather than deep knowledge of multithreading/concurrency; thus, it can scale to real-world software.

## 6 Evaluation

We implemented Muzz upon SVF [46], AFL [63] , and ClusterFuzz [16]. The thread-aware ICFG construction leverages SVF’s inter-procedural value-flow analysis. The instrumentation and dynamic fuzzing strategies lay inside AFL’s LLVM module. The vulnerability analysis and concurrency-bug replaying components rely on ClusterFuzz’s crash analysis module. We archive our supporting materials at https://sites.google.com/view/mtfuzz. The archive includes initial seeds for fuzzing, the detected concurrency-vulnerabilities and concurrency-bugs, the implementation details, and other findings during evaluation.

Our evaluation targets the following questions:

1. [label=RQ0]

2. Can Muzz generate more effective seeds that execute multithreading-relevant program states?

3. What is the capability of Muzz in detecting concurrency-vulnerabilities ()?

4. What is the effect of using Muzz generated seeds to reveal concurrency-bugs () with bug detectors?

### 6.1 Evaluation Setup

#### 6.1.1 Settings of the grey-box fuzzers

We use the following fuzzers during evaluation.

1. [1)]

2. Muzz is our full-fledged fuzzer that applies all the thread-aware strategies in §4 and §5.

3. MAFL is a variant of Muzz. It differs from Muzz only on the coverage-oriented instrumentation — MAFL uses AFL-Ins while Muzz uses M-Ins. We compare MAFL with Muzz to demonstrate the effectiveness of M-Ins, and compare MAFL with AFL to stress other strategies.

4. AFL is by far the most widely-used GBF that facilitates general-purpose AFL-Ins instrumentation and fuzzing strategies. It serves as the baseline fuzzer.

5. MOpt [33] is the recently proposed general-purpose fuzzer that leverages adaptive mutations to increase the overall fuzzing efficiency. It is claimed to be able to detect 170% more vulnerabilities than AFL in fuzzing (single-thread) programs.

#### 6.1.2 Statistics of the evaluation dataset

The dataset for evaluation consists of the following projects.

1. [1),noitemsep,leftmargin=*,topsep=0pt,parsep=0pt,labelindent=0pt]

2. Parallel compression/decompression utilities including pigz, lbzip2, pbzip2, xz and pxz. These tools have been present in GNU/Linux distributions for many years and are integrated into the GNU tar utility.

3. ImageMagick and GraphicsMagick are two widely-used software suites to display, convert, and edit image files.

4. libvpx and libwebp are two WebM projects for VP8/VP9 and WebP codecs. They are used by popular browsers like Chrome, Firefox, and Opera.

5. x264 and x265 are the two most established video encoders for H.264/AVC and HEVC/H.265, respectively.

All these projects’ single-thread functionalities have been intensively tested by mainstream GBFs such as AFL. We try to use their latest versions at the time of our evaluation; the only exception is libvpx, which we use version v1.3.0-5589 to reproduce the ground-truth vulnerabilities and concurrency-bugs. Among the 12 multithreaded programs, pxz, GraphicsMagick, and ImageMagick use OpenMP library, while the others use native PThread.

Table 1 lists the statistics of the benchmarks. The first two columns show the benchmark IDs and their host projects. The next column specifies the command-line options. In particular, four working threads are specified to enforce the program to run in multithreading mode.

The rest of the columns are the static statistics. Column “Binary Size” calculates the sizes of the instrumented binaries. Column records the preprocessing time of static analysis (c.f. §4.1). Among the 12 benchmarks, vpxdec takes the longest time of approximately 30 minutes. Columns , , and depict the number of basicblocks, the number of total instructions, and the number of deputy instructions for M-Ins (c.f. §4.2), respectively. Recall that AFL-Ins instruments evenly over entry instructions of all basicblocks, hence also denotes the number of deputy instructions in AFL, MAFL, and MOpt. The last column, , is the ratio of more instructions Muzz instrumented versus AFL (or MAFL, MOpt). This ratio ranges from 6.0% (pbzip2-c or pbzip2-d) to 288.9% (x265). Fortunately, in practice, this does not proportionally increase the runtime overhead. Many aspects can affect this metric, including the characteristics of the target programs, the precision of the applied static analysis, and the empirically specified thresholds and .

Fuzzing Configuration The experiments are conducted on four Intel(R) Xeon(R) Platinum 8151 CPU@3.40GHz workstations with 28 cores, each of which runs a 64-bit Ubuntu 18.04 LTS; the evaluation upon a specific benchmark is conducted on one machine. To make fair comparisons, Muzz, MAFL and AFL are executed in their “fidgety mode” [65], while MOpt is specified with -L 0 to facilitate its “pacemaker mode” [33]. The CPU affinity is turned off during fuzzing to avoid multiple threads being bound to a single CPU core. During fuzzing, we run each of the aforementioned fuzzers six times against all the 12 benchmark programs, with a time budget of 24 hours. Since all the evaluated programs are set to run with four working threads and the threads are mapped to different cores, it takes each fuzzer approximately CPU hours.

### 6.2 Seed Generation (RQ1)

Table 2 shows the overall fuzzing results in terms of newly generated seeds. We collect the total number of generated seeds () and the number of seeds that exercise the multithreading context (). In AFL’s jargon, corresponds to the distinct paths that the fuzzer observes [63]. The multithreading-relevant seeds are collected with a separate procedure, based on the observations that they at least invoke one element in TFork. Therefore, tracks the different multithreading execution states during fuzzing — a larger value of this metric suggests the fuzzer can keep more effective thread-interleaving seeds. We sum up those seed files across all six fuzzing runs to form   and  in Table 2. The column shows the percentage of over . determines the probability of picking a multithreading-relevant seed during seed selection, which greatly impacts the overall quality of the generated seeds. Obviously, the most critical metrics are and .

Muzz surpasses MAFL, AFL, and MOpt in both metrics. First, Muzz exhibits superiority in generating multithreading-relevant seeds — in all the benchmarks Muzz achieves the highest . For example, in pbzip2-d, despite that all the are relatively small, Muzz generated 297 multithreading-relevant seeds, which is 178 more than MAFL (119), 229 more than AFL (68), and 235 more than MOpt (62). Moreover, for larger programs such as im-cnvt (binary size 19.4M), of Muzz (12987) is still better than the others (MAFL: 10610, AFL: 7634, MOpt: 8012). Second, the value of in Muzz is more impressive — Muzz wins the comparison over all the benchmarks. For example, in pbzip2-d, Muzz’s result of is higher — Muzz: 14.9%, AFL: 7.0% MAFL: 4.1%, and MOpt: 3.8%. For the benchmark where AFL has already achieved a decent result, e.g., 89.3% for x264, Muzz can even improve it to 96.5%. Meanwhile, although MAFL has the largest for x265 (10890), the value of its (78.6%) is less than that of Muzz (82.6%).

It is worth noting that MAFL also outperforms AFL and MOpt w.r.t. and in all the benchmarks. For example, in pxz-c, the number of generated multithreading-relevant seeds in MAFL is 3401, which is more than AFL (2470) and MOpt (2634). Correspondingly, the percentage of multithreading-relevant seeds in MAFL is 60.3%; for AFL and MOpt, they are 46.1% and 47.2%, respectively. Considering MAFL, AFL, MOpt apply coverage-oriented instrumentation (M-Ins), we can conclude that other strategies in MAFL, including thread-context instrumentation, schedule-intervention instrumentation, and the optimized dynamic strategies, also contribute to effective seed generation.

[size=title] Answer to RQ1: Muzz has advantages in increasing the number and percentages of multithreading-relevant seeds for multithreaded programs. The proposed three thread-aware instrumentations and dynamic fuzzing strategies benefit the seed generation.

### 6.3 Vulnerability Detection (RQ2)

For vulnerability detection, we denote the total number of proof-of-crash (POC) files generated during fuzzing as . The vulnerability analysis component (right-bottom area as in Figure 3) analyzes the POC files and categorizes them into different vulnerabilities. This basically follows ClusterFuzz’s practice [16]: if two POC files have the same last N lines of backtraces and the root cause is the same (e.g., both exhibit as buffer-overflow), they are treated as one vulnerability. Afterwards, we manually triage all the vulnerabilities into two groups based on their relevance with multithreading: the concurrency-vulnerabilities , and the other vulnerabilities that do not occur in multithreading context . The number of these vulnerabilities are denoted as and , respectively.

We mainly refer to , in Table 3 to evaluate Muzz’s concurrency-vulnerability detection capability.

The number of multithreading-relevant POC files, , is important since it corresponds to different crashing states when executing multithreading context  [34, 27]. It is apparent that Muzz has the best results of in all the benchmarks that have vulnerabilities (e.g., for im-cnvt, Muzz: 63, MAFL: 23, AFL: 6, MOpt: 6). Moreover, MAFL also exhibits better results than AFL and MOpt (e.g., for pbzip2-c, Muzz: 6, MAFL: 6, AFL: 0, MOpt: 0). This suggests that Muzz’s and MAFL’s emphasis on multithreading-relevant seed generation indeed helps to exercise more erroneous multithreading-relevant execution states.

The most important metric is since our focus is to detect concurrency-vulnerabilities (). Table 3 shows that Muzz has the best results: Muzz detects 9 concurrency-vulnerabilities, while MAFL, AFL and MOpt detects 5, 4, 4, respectively. Detected can be divided into three groups.

1) caused by concurrency-bugs. We term this group of vulnerabilities as . The 4 vulnerabilities in im-cnvt all belong to this group — the misuses of caches shared among threads cause the data races. The generated seeds may exhibit various symptoms such as buffer-overflow and memcpy-param-overlap. Muzz found all the 4 vulnerabilities, while the others only found 2. We also observed that for the 2 vulnerabilities that are detected by all these fuzzers, MAFL’s detection capability appears more stable since it detects both in all its six fuzzing runs, while the others can only detect them at most in five runs (not depicted in the table). 2) triggered in multithreading only but not induced by concurrency-bugs. For example, the vulnerability in pbzip2-d stems from a stack-overflow error when executing a function concurrently. This crash can never happen when pbzip2-d works in single-thread mode since it does not even invoke that erroneous function. In our evaluation, Muzz detected this vulnerability while the other fuzzers failed. Another case is the vulnerability in pbzip2-c, which was detected by Muzz and MAFL, but not by AFL or MOpt. 3) Other concurrency-vulnerabilities. The characteristics of these are that their crashing backtrace contains multithreading context (i.e., TFork is invoked), however, the crashing condition might also be occasionally triggered when only one thread is specified. The vulnerabilities detected in vpxdec and x264 belong to this category. In particular, Muzz detects 2 vulnerabilities in vpxdec while MAFL, AFL, and MOpt only find 1.

We consider the reason behind the differences w.r.t. and among the fuzzers to be that, Muzz keeps more “deeper” multithreading-relevant seeds that witness different execution states, and mutations on some of them are more prone to trigger the crashing conditions.

Columns , , are metrics less focused. But we can still observe that 1) according to , Muzz (and MAFL) can exercise more total crashing states; 2) despite that the values of from Muzz are usually smaller, Muzz can still find all the (categorized) detected by other fuzzers.

From the 12 evaluated benchmarks, we reported the 10 new vulnerabilities (sum of Muzz’s results in columns and except for row vpxdec; 7 of them belong to ), all of them have been confirmed or fixed, 3 of which have already been assigned CVE IDs. Besides, we also conducted a similar evaluation on libvpx v1.8.0-178 (the git HEAD version at the time of evaluation). Muzz detected a 0-day concurrency-vulnerability within 24 hours (among six fuzzing runs, two of them detected the vulnerability in 5h38min and 16h07min, respectively), while MAFL, AFL and MOpt failed to detect it in 15 days (360 hours) in all their six fuzzing runs. The newly detected vulnerability has been assigned with another CVE ID. The vulnerability details are available in Table 5.

Given the fact that there are extremely few CVE records caused by concurrency-vulnerabilities (e.g., 202 among 70438, based on records from CVE-2014-* to CVE-2018-*) [48], Muzz demonstrates the high capability in detecting concurrency-vulnerabilities.

[size=title] Answer to RQ2: Muzz demonstrates superiority in exercising more multithreading-relevant crashing states and detecting concurrency-vulnerabilities.

### 6.4 Concurrency-bug Revealing (RQ3)

The fuzzing phase only detects the vulnerabilities caused by crashes, but the seemingly normal seed files generated during fuzzing may still execute paths that trigger concurrency-violation conditions like data-races, deadlocks, etc. We detect concurrency-bugs in concurrency-bug revealing component ( , right-top in Figure 3). It is worth noting that our goal is not to improve the capabilities of concurrency-bug detection over existing techniques such as TSan [42], Helgrind [49], or UFO [21]. Instead, we aim to reveal as many bugs as possible within a time budget, by replaying against fuzzer-generated seeds with the help of these techniques. In practice, this component feeds the target program with the seeds that were generated during fuzzing as its inputs, and facilitate detectors such as TSan to reveal concurrency-bugs. During this evaluation, we compiled the target programs with TSan and replayed them against the fuzzer-generated multithreading-relevant seeds (corresponding to in Table 2). We did not replay with all the generated seeds (corresponding to in Table 2) since seeds not exercising multithreading context will not reveal concurrency-bugs.

We limit our replay time budget to two hours; in  §6.5.4 we discuss the rationale of this configuration. The next is to determine the replay pattern per seed to reveal more concurrency-bugs within this budget. This is necessary since TSan may fail to detect concurrency-bugs in a few runs when it does not observe concurrency violation conditions [12, 49, 42]. Meanwhile, as the time budget is limited, we cannot exhaustively replay against a given seed to see whether it may trigger concurrency-violations — in the worst case, we may waste time in executing against a seed that never violates the conditions. We provide two replay patterns.

1. [labelindent=4pt]

2. It executes against each seed in the queue once per turn in a round-robin way, until reaching the time budget.

3. It relies on in repeated execution (c.f., §5.2): each seed is executed times per turn continuously in a round-robin way. According to Equation 4, we replay 5 times per turn (40/8) for AFL generated multithreading-relevant seeds; for Muzz and MAFL, it is determined by Equation (5), with candidate values 2, 3, 4, 5.

It is fair to compare replay results w.r.t. and in that the time budget is fixed. The difference between the two patterns is that seeds’ execution orders and accumulated execution time spent on them can be rather different.

Table 4 depicts the results for concurrency-bug revealing with and . is the number of observed concurrency-violation executions and is the number of concurrency-bugs () according to their root causes. For example, it only counts one concurrency-bug (=1) even when the replaying process observes 10 data-race pairs across executions (=10), as long as the root cause of the races is unique. We analyze this table from two perspectives.

First, Muzz demonstrates superiority in concurrency-bug detection regardless of replay patterns. This is observed based on the “best results” for each metric in each pattern. Muzz achieves the best results for most projects. For example, when x264 is replayed with , 1) Muzz’s found the most violations — the values of are, Muzz: 68, MAFL: 46, AFL: 28, MOpt: 30; 2) the best result of also comes from Muzz, Muzz: 8, MAFL: 6, AFL: 4, MOpt: 5. Similar results can also be observed with for x264, where Muzz has the biggest (91) and biggest (9). The only project where MAFL achieves the best is pigz-c, where it is slightly better than Muzz.

Second, as to Muzz and MAFL, is probably better than . It is concluded based on the fact that ’s “best results” are all better than ’s. For example, as to in x264, the best result of is achieved with (: 68, : 91); similarly, the best result of also comes from (: 8, : 9). In the meantime, there seems to be no such implication inside AFL or MOpt. Besides the numbers of concurrency-violations or concurrency-bugs, §6.5.3 provides a case study on gm-cnvt that demonstrates ’s advantages over w.r.t. time-to-exposure of the concurrency-bugs.

We have reported all the newly detected 19 concurrency-bugs (excluding the 3 concurrency-bugs in vpxdec-v1.3.0-5589) to their project maintainers (c.f., Table 5 for the details).

[size=title] Answer to RQ3: Muzz outperforms competitors in detecting concurrency-bugs; the value calculated during fuzzing additionally contributes to revealing these bugs.

### 6.5 Further Discussions

This section discusses miscellaneous concerns, issues and observations for Muzz’s design and evaluation.

#### 6.5.1 Constant Parameters

Using empirical constant parameters for grey-box fuzzing is practiced by many fuzzing techniques [63, 33, 6]. For example, AFL itself has many hard-coded configurations used by default; MOpt additionally has the suggested configuration to control the time to move on to pacemaker mode (i.e., -L 0).

In Muzz, constant parameters are used in two places.

(1) The upper bounds for coverage-oriented instruction: (defaults to 0.5) and (defaults to 0.33). These default values inspire from AFL’s “selective deputy instruction instrumentation” strategy to make the instrumentation ratio to be 0.33 when AddressSanitizer is involved during instrumentation . Larger values of and increases the instrumentation ratio only if the thresholds are frequently reached. Subsequently, the instrumented program has these symptoms: a) the program size after instrumentation increases; b) the execution state feedback is potentially better; c) the instrumentation-introduced execution speed slowdown is more evident. Therefore, increasing the values of and reflects a tradeoff between precise feedback and its overhead. In our benchmarks, when we assign =0.5, =0.33,

1. [label=,noitemsep,topsep=0pt,parsep=0pt,labelindent=0pt,labelwidth=*]

2. For im-cnvt, the speed slowdown is about 15% compared to default settings, while the capability of detecting concurrency-vulnerabilities and concurrency-bugs are similar; meanwhile, there are a few more multithreading-relevant seeds () but is slightly smaller.

3. For pbzip2-c, the differences brought by changes of and from the default settings are all neglectable.

We believe there are no optimal instrumentation thresholds that work for all the projects; therefore Muzz provides the empirical values as the defaults.

(2) The seed selection probabilities , , in Algorithm 3. These constants are not introduced by Muzz, but based on AFL’s “skipping probability” to conditionally favor seeds with new coverage [63].

Since the 12 benchmarks that we chose are quite diversified (c.f., §6.1.2), it is considered fair to use default settings for these parameters, when comparing Muzz, MAFL with other fuzzers such as AFL, MOpt. In practice, we suggest keeping Muzz’s default settings to test other multithreaded programs.

#### 6.5.2 Schedule-intervention Instrumentation

The goal of Muzz’s schedule-intervention is to diversify interleavings during repeated executions in the fuzzing phase. During the evaluation, we did not separately evaluate the effects of schedule-intervention instrumentation. However, based on our observation, this instrumentation is important to achieve more stable fuzzing results. Two case studies can support this statement.

1. [label=),noitemsep,topsep=0pt,parsep=0pt,labelindent=0pt,labelwidth=*]

2. We turned off schedule-intervention instrumentation in Muzz and fuzzed lbzip2-c six times on the same machine. The calculated value of is 54.5% (= 4533/8310), which is lower than the result in Table 2 (63.6% = 5127/8056). Since 54.5% is still greater than the results of AFL (42.9%) and MOpt (41.8%), this also indicates Muzz’s other two strategies indeed benefit the multithreading-relevant seed generation for fuzzing.

3. We turned off schedule-intervention instrumentation in Muzz and fuzzed im-cnvt on a different machine. In all the six fuzzing runs it only detects three concurrency-vulnerabilities which is less than the result in Table 3 ( =4). Meanwhile, when the schedule-intervention instrumentation is re-enabled, Muzz can still detect four concurrency-vulnerabilities in that machine.

#### 6.5.3 Time-to-exposure for Concurrency-bug Revealing

In §6.4, we demonstrate ’s advantage over in terms of occurrences of concurrency-violations () and the number of categorized concurrency-bugs (). Another interesting metric is the time-to-exposure capability of these two replay patterns — given the ground truth that the target programs contain certain concurrency-bugs, the minimal time cost for each pattern to reveal all the known bugs. This metric can further distinguish the two replay patterns’ capabilities in terms of revealing concurrency-bug.

We conducted a case study on gm-cnvt. From Table 4, it is observable that with both and , TSan detected four concurrency-bugs () by replaying the MAFL generated multithreading-relevant seeds (totally 10784) from Table 2; besides, their results are also similar (: 79, : 83). We repeated six times against the 10784 seeds by applying and . When a replaying process detects all the four different ground-truth concurrency-bugs, we record the total execution time (in minutes). Table 6 shows the results.

In Table 6, compared to , we can observe that reduces the average time-to-exposure from 66.5 minutes to 34.1 minutes. This fact means, for example, given a tighter replay time budget (say, 60 minutes), has a high chance to miss some of the four concurrency-bugs. Moreover,

is more stable since the timing variance is much smaller than that of

(91.0 vs. 959.2). This result implicates that, in Table 4, for the concurrency-bug revealing capability of MAFL, the ’s result in gm-cnvt is likely to be much better than ’s.

The evaluation of time-to-exposure suggests that, given a set of seeds, is prone to expose concurrency-bugs faster and more stable. Since is closely relevant to schedule-intervention instrumentation (§4.4) and repeated execution (§5.2), this also indicates that these strategies are helpful for concurrency-bug revealing.

#### 6.5.4 Time Budget During Replaying

We chose two hours (2h) as the time budget in the reply phase during evaluation. Unlike the fuzzing phase, which aims to generate new seed files that exercise multithreading context, the replay phase runs the target program against existing seeds (generated during fuzzing). Therefore, the criterion is to 1) minimize the time for replay; 2) ensure that replay phase traverses all the generated seeds. For projects with less generated (multithreading-relevant) seeds (e.g., =126 for pbzip2-c when applying Muzz), traversing the seeds (with both and ) once are quite fast; however for projects with more generated seeds (e.g., =13774 for gm-cnvt when applying Muzz), this requires more time. To make the evaluation fair, we use the fixed time budget for all the 12 benchmarks, where seeds in projects like pbzip2-c will be traversed repeatedly until timeout. During the evaluation, we found 2h to be moderate since it can traverse all the generated multithreading-relevant seeds at least once for all the projects.

Less time budget, e.g., 1h, may make the replay phase to miss certain generated seeds triggering concurrency violation conditions. In fact, from Table 6, we see that time-to-exposure for the concurrency-bugs may take 101.5 minutes. Meanwhile, more time budget, e.g., 4h, might be a waste of time for the exercised 12 benchmarks. In fact, in a case study for gm-cnvt, when time budget is 4h, despite that is nearly doubled, the number of revealed (i.e., ) is still the same as the results in Table 4, regardless of or .

#### 6.5.5 Statistical Evaluation Results

Specific to the nature of multithreaded programs and our evaluation strategy to determine seeds’ relevance with multithreading, we decide not to provide some commonly-used statistical results [27].

First, it is unfair to track coverage over time when comparing Muzz, MAFL with AFL or MOpt due to the different meanings of “coverage”. In fact, owing to coverage-oriented instrumentation (in Muzz) and threading-context instrumentation (in Muzz and MAFL), Muzz and MAFL cover more execution states (corresponding to ), therefore naturally preserve more seeds. That is also the reason that in §6.2 the values of and are more important than .

Second, we cannot compare the multithreading-relevant paths over time among Muzz, MAFL, AFL, and MOpt. This reason is simple: we resort to a separate procedure after fuzzing to determine whether it covers thread-forking routines. We have to do so since AFL and MOpt do not provide a builtin solution to discovering seeds’ relevance with multithreading. Consequently, we cannot plot multithreading-relevant crashing states over time.

Third, despite that the statistical variance is important, it is not easy to be calculated comprehensively. During evaluation, to reduce the variance among individuals, we apply an ensemble strategy by sharing seeds among the six runs, for each of the specific fuzzers [63]. However, for multithreaded target programs, another variance comes from the thread scheduling for different threads (in our experiments, four working threads were specified). Muzz and MAFL have the schedule-intervention instrumentation to help diversify the effects, while it is absent in AFL and MOpt. In fact, from the case studies in  §6.5.2, we envision that the variance may be huge for different machines under different workloads. Due to this, providing fair statistical results w.r.t. the variance may still be impractical. Therefore, we tend to exclude variance metrics and only choose those that exhibit the “overall results”, i.e., , , , , , and . Similarly, the case studies or comparisons in  §6.2,  §6.3,  §6.4 are all based on “overall results”. During the evaluation, we indeed observed that the results of Muzz and MAFL are more stable than those of AFL and MOpt.

## 7 Related Work

### 7.1 Grey-box Fuzzing Techniques

Multithreading-relevant bugs are inherently deep. To reveal deep bugs in the target programs, some GBFs facilitate other feedback  [44, 29, 61, 7, 14, 52, 56, 55]. Angora [7] distinguishes different calling context when calculating deputy instruction transitions to keep more valuable seeds. Driller [44], QSYM [61], and Savior [8] integrate symbolic execution to provide additional coverage information to exercise deeper paths. Muzz inspires from these techniques in that it provides more feedback for multithreading context with stratified coverage-oriented and thread-context instrumentations, as well as schedule-intervention instrumentation. Other fuzzing techniques utilize the domain knowledge of the target program to generate more effective seeds [53, 54, 39]. Skyfire [53] and Superion [54] provide customized seed generation and mutation strategies on the programs that feed grammar-based inputs. SGF [39] relies on the specifications of the structured input to improve seed quality. These techniques are orthogonal to Muzz and can be integrated into seed mutation (c.f. in Figure 3).

### 7.2 Static Concurrency-bug Prediction

Static concurrency-bug () predictors aim to approximate the runtime behaviors of a concurrent program without actual execution. Several static approaches have been proposed for analyzing Pthread and Java programs  [40, 50, 45]. LOCKSMITH [40] uses existential types to correlate locks and data in dynamic heap structures for race detection. Goblint [50] relies on a thread-modular constant propagation and points-to analysis for detecting concurrent bugs by considering conditional locking schemes. [51]

scales its detection to large codebases by sacrificing soundness and suppressing false alarms using heuristic filters. FSAM

[45, 46] proposes a sparse flow-sensitive pointer analysis for C/C++ programs using context-sensitive thread-interleaving analysis. Currently, Muzz relies on flow- and context-insensitive results of FSAM for thread-aware instrumentations. We are seeking solutions to integrating other bug prediction techniques to further improve Muzz’s effectiveness.

### 7.3 Dynamic Analysis on Concurrency-bugs

There are a large number of dynamic analyses on concurrency-bugs. They can be divided into two categories: modeling concurrency-bugs and strategies to trigger these bugs.

The techniques in the first category  [12, 41, 42, 59] typically monitor the memory and synchronization events [19]. The two fundamentals are happens-before model [12] and lockset model [41]. Happens-before model reports a race condition when two threads read/write a shared memory arena in a causally unordered way, while at least one of the threads write this arena. Lockset model conservatively considers a potential race if two threads read/write a shared memory arena without locking. Modern detectors such as TSan [42], Helgrind [49] usually apply a hybrid strategy to combine these two models. Muzz does not aim to improve existing concurrency violation models; instead, it relies on these models to detect concurrency-bugs with our fuzzer-generated seeds.

The second category of dynamic analyses focuses on how to trigger concurrency violation conditions. This includes random testings that mimic non-deterministic program executions [25, 38, 4], regression testings [47, 60] that target interleavings from code changes, model checking [13, 62, 57] and hybrid constraint solving [22, 20, 21] approaches that systematically check or execute possible thread schedules, heuristically avoid fruitless executions [66, 18, 17, 10], or utilizing multicore to accelerate bug detection [37]. Our work differs from all the above, as our focus is not to test schedules with a given seed file, but to generate seed files that execute multithreading-relevant paths. In particular, our goal of schedule-intervention instrumentation is to diversify the actual schedules to help provide feedback during fuzzing.

## 8 Conclusion

This paper presented Muzz, a novel technique that empowers thread-aware seed generation to GBFs for fuzzing multithreaded programs. Our approach performs three novel instrumentations that can distinguish execution states introduced by thread-interleavings. Based on the feedback provided by these instrumentations, Muzz optimizes the dynamic strategies to stress different kinds of multithreading context. Experiments on 12 real-world programs demonstrate that Muzz outperforms other grey-box fuzzers such as AFL and MOpt in generating valuable seeds, detecting concurrency-vulnerabilities, as well as revealing concurrency-bugs.

## Acknowledgement

This research was supported (in part) by the National Research Foundation, Prime Ministers Office, Singapore under its National Cybersecurity R&D Program (Award No. NRF2018NCR-NCR005-0001), National Satellite of Excellence in Trustworthy Software System (Award No. NRF2018NCR-NSOE003-0001), and NRF Investigatorship (Award No. NRFI06-2020-0022) administered by the National Cybersecurity R&D Directorate. The research of Dr Xue is supported by CAS Pioneer Hundred Talents Program.

## References

• [1] L. O. Andersen. Program analysis and specialization for the c programming language. Technical report, DIKU, University of Copenhagen, 1994.
• [2] S. Blackshear, N. Gorogiannis, P. W. O’earn, and I. Sergey. Racerd: Compositional static race detection. OOPSLA, 2:144:1–144:28, Oct. 2018.
• [3] M. Böhme, V. T. Pham, and A. Roychoudhury.

Coverage-based greybox fuzzing as markov chain.

In CCS ’16, pages 1032–1043. ACM, 2016.
• [4] Y. Cai and W. K. Chan. Magicfuzzer: Scalable deadlock detection for large-scale applications. In ICSE ’12, pages 606–616. IEEE, 2012.
• [5] Y. Cai, B. Zhu, R. Meng, H. Yun, L. He, P. Su, and B. Liang. Detecting concurrency memory corruption vulnerabilities. In ESEC/FSE ’19, pages 706–717, 2019.
• [6] H. Chen, Y. Xue, Y. Li, B. Chen, X. Xie, X. Wu, and Y. Liu. Hawkeye: Towards a desired directed grey-box fuzzer. In CCS ’18, pages 2095–2108. ACM, 2018.
• [7] P. Chen and H. Chen. Angora: Efficient fuzzing by principled search. In SP ’18, pages 711–725, 2018.
• [8] Y. Chen, P. Li, J. Xu, S. Guo, R. Zhou, Y. Zhang, T. Wei, and L. Lu. SAVIOR: towards bug-driven hybrid testing. In SP ’20, 2020.
• [9] I. Chowdhury and M. Zulkernine. Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities. Journal of System Architecture, 57(3):294–313, Mar. 2011.
• [10] M. Christakis, A. Gotovos, and K. Sagonas. Systematic testing for detecting concurrency errors in erlang programs. In ICST 2013, pages 154–163, March 2013.
• [11] P. Di and Y. Sui. Accelerating dynamic data race detection using static thread interference analysis. In PMAM ’16, pages 30–39. ACM, 2016.
• [12] C. Flanagan and S. N. Freund. FastTrack: efficient and precise dynamic race detection. In PLDI ’09, pages 121–133. ACM, 2009.
• [13] C. Flanagan and P. Godefroid. Dynamic partial-order reduction for model checking software. In POPL ’05, pages 110–121. ACM, 2005.
• [14] S. Gan, C. Zhang, X. Qin,