Harvey: A Greybox Fuzzer for Smart Contracts

05/15/2019 ∙ by Valentin Wüstholz, et al. ∙ Max Planck Institute for Software Systems CONSENSYS 0

We present Harvey, an industrial greybox fuzzer for smart contracts, which are programs managing accounts on a blockchain. Greybox fuzzing is a lightweight test-generation approach that effectively detects bugs and security vulnerabilities. However, greybox fuzzers randomly mutate program inputs to exercise new paths; this makes it challenging to cover code that is guarded by narrow checks, which are satisfied by no more than a few input values. Moreover, most real-world smart contracts transition through many different states during their lifetime, e.g., for every bid in an auction. To explore these states and thereby detect deep vulnerabilities, a greybox fuzzer would need to generate sequences of contract transactions, e.g., by creating bids from multiple users, while at the same time keeping the search space and test suite tractable. In this experience paper, we explain how Harvey alleviates both challenges with two key fuzzing techniques and distill the main lessons learned. First, Harvey extends standard greybox fuzzing with a method for predicting new inputs that are more likely to cover new paths or reveal vulnerabilities in smart contracts. Second, it fuzzes transaction sequences in a targeted and demand-driven way. We have evaluated our approach on 27 real-world contracts. Our experiments show that the underlying techniques significantly increase Harvey's effectiveness in achieving high coverage and detecting vulnerabilities, in most cases orders-of-magnitude faster; they also reveal new insights about contract code.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Smart contracts are programs that manage crypto-currency accounts on a blockchain. Reliability of these programs is of critical importance since bugs may jeopardize digital assets. Automatic test generation has shown to be an effective approach to find such vulnerabilities, thereby improving software quality. In fact, there exists a wide variety of test-generation tools, ranging from random testing [1, 2, 3], over greybox fuzzing [4, 5], to dynamic symbolic execution [6, 7].

Random testing [1, 2, 3] and blackbox fuzzing [8, 9] generate random inputs to a program, run the program with these inputs, and check for bugs. Despite the practicality of these techniques, their effectiveness, that is, their ability to explore new paths, is limited. The search space of valid program inputs is typically huge, and a random exploration can only exercise a small fraction of (mostly shallow) paths.

At the other end of the spectrum, dynamic symbolic execution [6, 7] and whitebox fuzzing [10, 11, 12] repeatedly run a program, both concretely and symbolically. At runtime, they collect symbolic constraints on program inputs from branch statements along the execution path. These constraints are then appropriately modified and a constraint solver is used to generate new inputs, thereby steering execution toward another path. Although these techniques are very effective in covering new paths, they cannot be as efficient and scalable as other test-generation techniques that do not spend any time on program analysis and constraint solving.

Greybox fuzzing [4, 5] lies in the middle of the spectrum between performance and effectiveness in discovering new paths. It does not require program analysis or constraint solving, but it relies on a lightweight program instrumentation that allows the fuzzer to tell when an input exercises a new path. In other words, the instrumentation is useful in computing a unique identifier for each explored path in the program under test. American Fuzzy Lop (AFL) [4] is a prominent example of a state-of-the-art greybox fuzzer that has detected numerous bugs and security vulnerabilities [13].

In this paper, we present Harvey, the first greybox fuzzer for smart contracts. We report on our experience in designing Harvey, and in particular, focus on how to alleviate two key challenges we encountered when fuzzing real-world contracts. Although the challenges are not exclusive to our specific application domain, our techniques are shown to be quite effective for smart contracts and there are important lessons to be learned from our experiments.

Challenge #1. Despite the fact that greybox fuzzing strikes a good balance between performance and effectiveness, the inputs are still randomly mutated, for instance, by flipping arbitrary bits. As a result, many generated inputs exercise the same program paths. To address this problem, there have emerged techniques that direct greybox fuzzing toward low-frequency paths [14], vulnerable paths [15], deep paths [16], or specific sets of program locations [17]. Such techniques have mostly focused on which seed inputs to prioritize and which parts of these inputs to mutate.

Challenge #2. Smart contracts may transition through many different states during their lifetime, for instance, for every bet in a gambling game. The same holds for any stateful system that is invoked repeatedly, such as a web service. Therefore, detecting vulnerabilities in such programs often requires generating and fuzzing sequences of invocations that explore the possible states. For instance, to test a smart contract that implements a gambling game, a fuzzer would need to automatically create sequences of bets from multiple players. However, since the number of possible sequences grows exponentially with the sequence length, it is difficult to efficiently detect the few sequences that reveal a bug.

Our approach and lessons learned. To alleviate the first challenge, we developed a technique that systematically predicts new inputs for the program under test with the goal of increasing the performance and effectiveness of greybox fuzzing. In contrast to existing work in greybox fuzzing, our approach suggests concrete input values based on information from previous executions, instead of performing arbitrary mutations. And in contrast to whitebox fuzzing, our input-prediction mechanism remains particularly lightweight.

Inputs are predicted in a way that aims to direct greybox fuzzing toward optimal executions, for instance, defined as executions that flip a branch condition in order to increase coverage. Our technique is parametric in what constitutes an optimal execution, and in particular, in what properties such an execution needs to satisfy.

More specifically, each program execution is associated with zero or more cost metrics, which are computed automatically. A cost metric captures how close the execution is to satisfying a given property at a given program location. Executions that minimize a cost metric are considered optimal with respect to that metric. For example, a cost metric could be defined at each arithmetic operation in the program such that it is minimized (i.e., becomes zero) when an execution triggers an arithmetic overflow. Our technique uses the costs that are computed with cost metrics along executions of the program to iteratively predict inputs leading to optimal executions.

Our experiments show that Harvey is extremely successful in predicting inputs that flip a branch condition even in a single iteration (success rate of 99%). This suggests a low complexity of branch conditions in real-world smart contracts.

Although this input-prediction technique is very effective in practice, it is not sufficient for thoroughly testing a smart contract and its state space. As a result, Harvey generates, executes, and fuzzes sequences of transactions, which invoke the contract’s functions. Each of these transactions can have side effects on the contract’s state, which may affect the execution of subsequent invocations. To alleviate the second challenge of exploring the search space of all possible sequences, we devised a technique for demand-driven sequence fuzzing, which avoids generating transaction sequences when they cannot further increase coverage.

Our experiments show that 74% of bugs in real smart contracts require generating more than one transaction to be found. This highlights the need for techniques like ours that are able to effectively prune the space of transaction sequences.

In total, we evaluate Harvey on 27 Ethereum smart contracts. Our fuzzer’s underlying techniques significantly increase its effectiveness in achieving high coverage (by up to 3x) and detecting vulnerabilities, in most cases orders-of-magnitude faster.

Contributions. We make the following contributions:

  • We present Harvey, the first greybox fuzzer for smart contracts, which is being used industrially by one of the largest blockchain-security consulting companies.

  • We describe our architecture and two key techniques for alleviating the important challenges outlined above.

  • We evaluate our fuzzer on 27 real-world benchmarks and demonstrate that the underlying techniques significantly increase its effectiveness.

  • We distill the main lessons learned from fuzzing smart-contract code.

Ii Background

In this section, we give background on standard greybox fuzzing and smart contracts.

Ii-a Greybox Fuzzing

Alg. 1 shows how greybox fuzzing works. (The grey boxes should be ignored for now.) The fuzzer takes as input the program under test and a set of seeds . It starts by running the program with the seeds, and during each program execution, the instrumentation is able to capture the path that is currently being explored and associate it with a unique identifier (line 1). Note that the data structure is a key-value store from a to an input that exercises the path associated with . Next, an input is selected for mutation (line 3), and it is assigned an “energy” value that denotes how many times it should be fuzzed (line 5).

The input is mutated (line 12), and the program is run with the new input (line 13). If the program follows a path that has not been previously explored, the new input is added to the test suite (lines 14–15). The above process is repeated until an exploration bound is reached (line 2). The fuzzer returns a test suite containing one test for each explored path.

Ii-B Smart Contracts

Ethereum [18, 19] is one of the most popular blockchain-based [20, 21, 22], distributed-computing platforms [23]. It supports two kinds of accounts, user and contract accounts, both of which store a balance, are owned by a user, and publicly reside on the blockchain.

In contrast to a user account, a contract account is managed through code that is associated with it. The contract code captures agreements between users, for example, to encode the rules of an auction. A contract account also has persistent state where the code may store data, such as auction bids.

Contract accounts, their code, and persistent state are called smart contracts. Programmers may write the code in several languages, like Solidity or Vyper, all of which compile to the Ethereum Virtual Machine (EVM) [24] bytecode.

To interact with a contract, users issue transactions that call its functions, for instance, to bid in an auction, and are required to pay a fee for transactions to be executed. This fee is called gas and is roughly proportional to how much code is run.

Iii Overview

We now give an overview of our approach focusing on the challenges we aim to alleviate.

Iii-a Challenge #1: Random Input Mutations

Fig. 1 shows a constructed smart-contract function @ifdisplaystylebaz (in Solidity) that takes as input three (256-bit) integers @ifdisplaystylea, @ifdisplaystyleb, and @ifdisplaystylec and returns an integer. There are five paths in this function, all of which are feasible. Each path is denoted by a unique return value. (The grey boxes should be ignored for now.)

When running AFL, a state-of-the-art greybox fuzzer, on (a C version of) function @ifdisplaystylebaz, only four out of five paths are explored within 12h. During this time, greybox fuzzing constructs a test suite of four inputs, each of which exploring a different path. The path with return value 2 remains unexplored even after the fuzzer generates about 311M different inputs. All but four of these inputs are discarded as they exercise a path in @ifdisplaystylebaz that has already been covered by a previous test.

1function baz(int256 a, int256 b, int256 c)
2                          returns (int256) {
3  int256 d = b + c;
4  

  
5  

 
6  if (d < 1) { 
7    

 
8    

 
9    if (b < 3) { 
10      return 1;
11    }
12    

 
13    

 
14    if (a == 42) { 
15      return 2; 
16    }
17    return 3;
18  } else {
19    

 
20    

 
21    if (c < 42) {
22      return 4;
23    }
24    return 5;
25  }
26}
Fig. 1: Example for fuzzing with input prediction.

The path with return value 2 is not covered because greybox fuzzers randomly mutate program inputs (line 12 of Alg. 1). It is generally challenging for fuzzers to generate inputs that satisfy “narrow checks”, that is, checks that only become true for very few input values (e.g., line 1 of Fig. 1

). In this case, the probability that the fuzzer will generate value 42 for input

@ifdisplaystylea is 1 out of for 256-bit integers. Even worse, to cover the path with return value 2 (line 1), the sum of inputs @ifdisplaystyleb and @ifdisplaystylec also needs to be less than 1 (line 1) and @ifdisplaystyleb must be greater than or equal to 3 (line 1). As a result, several techniques have been proposed to guide greybox fuzzing to satisfy such narrow checks, e.g., by selectively applying whitebox fuzzing [25].

Fuzzing with input prediction. In contrast, our technique for input prediction is more lightweight, without requiring any program analysis or constraint solving. It does, however, require additional instrumentation of the program to collect more information about its structure than standard greybox fuzzing, thus making fuzzing a lighter shade of grey. This information captures the distance from an optimal execution at various points in the program and is then used to predict inputs that guide exploration toward optimal executions.

Our fuzzer takes as input a program and seeds . It also requires a partial function that maps execution states to cost metrics. When execution of reaches a state , the fuzzer evaluates the cost metric . For example, the grey boxes in Fig. 1 define a function for @ifdisplaystylebaz. Each @ifdisplaystyleminimize statement specifies a cost metric at the execution state where it is evaluated. Note that constitutes a runtime instrumentation of —we use @ifdisplaystyleminimize statements only for illustration. A compile-time instrumentation would increase gas usage of the contract and potentially lead to false positives when detecting out-of-gas errors.

The cost metrics of Fig. 1 define optimal executions as those that flip a branch condition. Specifically, consider an execution along which variable @ifdisplaystyled evaluates to 0. This execution takes the then-branch of the first if-statement, and the cost metric defined by the @ifdisplaystyleminimize statement on line 1 evaluates to 1. This means that the distance of the current execution from an execution that exercises the (implicit) else-branch of the if-statement is 1. Now, consider a second execution that also takes this then-branch ( @ifdisplaystyled evaluates to –1). In this case, the cost metric on line 1 evaluates to 2, which indicates a greater distance from an execution that exercises the else-branch.

Based on this information, our input-prediction technique is able to suggest new inputs that make the execution of @ifdisplaystylebaz take the else-branch of the first if-statement and minimize the cost metric on line 1 (i.e., the cost becomes zero). For instance, assume that the predicted inputs cause @ifdisplaystyled to evaluate to 7. Although the cost metric on line 1 is now minimized, the cost metric on line 1 evaluates to 7, which is the distance of the current execution from an execution that takes the then-branch.

Similarly, the @ifdisplaystyleminimize statements on lines 11, 11, and 11 of Fig. 1 define cost metrics that are minimized when an execution flips a branch condition in a subsequent if-statement. This instrumentation aims to maximize path coverage, and for this reason, an execution can never minimize all cost metrics. In fact, the fuzzer has achieved full path coverage when the generated tests cover all feasible combinations of branches in the program; that is, when they minimize all possible combinations of cost metrics.

The fuzzer does not exclusively rely on prediction to generate program inputs, for instance, when there are not enough executions from which to make a good prediction. In the above example, the inputs for the first two executions (where @ifdisplaystyled is 0 and –1) are generated by the fuzzer without prediction. Prediction can only approximate correlations between inputs and their corresponding costs; therefore, it is possible that certain predicted inputs do not lead to optimal executions. In such cases, it is also up to standard fuzzing to generate inputs that cover any remaining paths.

For the example of Fig. 1, Harvey explores all five paths within 0.27s and after generating only 372 different inputs.

Iii-B Challenge #2: State Space Exploration

Fig. 2 shows a simple contract @ifdisplaystyleFoo. The constructor on line 2 initializes variables @ifdisplaystylex and @ifdisplaystyley, which are stored in the persistent state of the contract. In function @ifdisplaystyleBar, the failing assertion (line 2) denotes a bug. An assertion violation causes a transaction to be aborted, and as a result, users lose their gas. Triggering the bug requires a sequence of at least three transactions, invoking functions @ifdisplaystyleSetY(42), @ifdisplaystyleCopyY(), and @ifdisplaystyleBar(). (Note that a transaction may directly invoke up to one contract function.) The assertion violation may also be triggered by calling @ifdisplaystyleIncX 42 times before invoking @ifdisplaystyleBar.

1contract Foo {
2  int256 private x;
3  int256 private y;
4
5  constructor () public { 
6    x = 0;
7    y = 0;
8  }
9
10  function Bar() public returns (int256) {
11    if (x == 42) {
12      assert(false); 
13      return 1;
14    }
15    return 0;
16  }
17
18  function SetY(int256 ny) public { y = ny; }
19
20  function IncX() public { x++; }
21
22  function CopyY() public { x = y; }
23}
Fig. 2: Example for demand-driven sequence fuzzing.

There are three ways to test this contract with a standard greybox fuzzer. First, each function could be fuzzed separately without considering the persistent variables of the contract as fuzzable inputs. For example, @ifdisplaystyleBar would be executed only once—it has zero fuzzable inputs. No matter the initial values of @ifdisplaystylex and @ifdisplaystyley, the fuzzer would only explore one path in @ifdisplaystyleBar.

Second, each function could be fuzzed separately while considering the persistent variables as fuzzable inputs. The fuzzer would then try to explore both paths in @ifdisplaystyleBar by generating values for @ifdisplaystylex and @ifdisplaystyley. A problem with this approach is that the probability of generating value 42 for @ifdisplaystylex is tiny, as discussed earlier. More importantly however, this approach might result in false positives when the persistent state generated by the fuzzer is not reachable with any sequence of transactions. For example, the contract would never fail if @ifdisplaystyleSetY ensured that @ifdisplaystyley is never set to 42 and @ifdisplaystyleIncX only incremented @ifdisplaystylex up to 41.

Third, the fuzzer could try to explore all paths in all possible sequences of transactions up to a bounded length. This, however, means that a path would span all transactions (instead of a single function). For example, a transaction invoking @ifdisplaystyleBar and a sequence of two transactions invoking @ifdisplaystyleCopyY and @ifdisplaystyleBar would exercise two different paths in the contract, even though from the perspective of @ifdisplaystyleBar this is not the case. With this approach, the number of possible sequences grows exponentially in their length, and so does the number of tests in the test suite. The larger the test suite, the more difficult it becomes to find a test that, when fuzzed, leads to the assertion in @ifdisplaystyleFoo, especially within a certain time limit.

We propose a technique for demand-driven sequence fuzzing that alleviates these limitations. First, it discovers that the only branch in @ifdisplaystyleFoo that requires more than a single transaction to be covered is the one leading to the assertion in @ifdisplaystyleBar. Consequently, Harvey only generates transaction sequences whose last transaction invokes @ifdisplaystyleBar. Second, our technique aims to increase path coverage only of the function that is invoked by this last transaction. In other words, the goal of any previous transactions is to set up the state, and path identifiers are computed only for the last transaction. Therefore, reaching the assertion in @ifdisplaystyleBar by first calling @ifdisplaystyleSetY(42) and @ifdisplaystyleCopyY() or by invoking @ifdisplaystyleIncX() 42 times both result in covering the same path of the contract.

Harvey triggers the above assertion violation in about 18s.

Iv Fuzzing with Input Prediction

In this section, we present the technical details of how we extend greybox fuzzing with input prediction.

Iv-a Algorithm

The grey boxes in Alg. 1 indicate the key differences. In addition to the program under test and a set of seeds , Alg. 1 takes as input a partial function that, as explained earlier, maps execution states to cost metrics. The fuzzer first runs the program with the seeds, and during each program execution, it evaluates the cost metric for every encountered execution state in the domain of (line 1). Like in standard greybox fuzzing, each explored path is associated with a unique identifier . Note, however, that the data structure now maps a

both to an input that exercises the corresponding path as well as to a cost vector, which records all costs computed during execution of the program with this input. Next, an input is selected for mutation (line 3) and assigned an energy value (line 5).

The input is mutated (line 12), and the program is run with the new input (line 13). We assume that the new input differs from the original input (which was selected for mutation on line 3) by the value of a single input parameter—an assumption that typically holds for mutation-based greybox fuzzers. As usual, if the program follows a path that has not been explored, the new input is added to the test suite (lines 14–15).

On line 17, the original and the new input are passed to the prediction component of the fuzzer along with their cost vectors. This component inspects and to determine the input parameter by which they differ. Based on the cost vectors, it then suggests a new value for this input parameter such that one of the cost metrics is minimized. In case a new input is predicted, the program is tested with this input, otherwise the original input is mutated (lines 8–10). The former happens even if the energy of the original input has run out (line 7) to ensure that we do not waste predicted inputs.

The above process is repeated until an exploration bound is reached (line 2), and the fuzzer returns a test suite containing one test for each program path that has been explored.

Example. In Tab. I, we run our algorithm on the example of Fig. 1 step by step. The first column of the table shows an identifier for every generated test, and the second column shows the path that each test exercises identified by the return value of the program. The highlighted boxes in this column denote paths that are covered for the first time, which means that the corresponding tests are added to the test suite (lines 14–15 of Alg. 1). The third column shows the test identifier from which the value of variable is selected (line 3 of Alg. 1). Note that, according to the algorithm, is selected from tests in the test suite.

Input: Program , Seeds ,

1 RunSeeds

2while Interrupted() do
3    

PickInput
4     0
5     AssignEnergy
6    

7    while 

 do
8       

       
9          

10          

       
11       

       
12           FuzzInput        
13       

Run

14       if IsNew(then
15           Add

       
16       

       
17          

       
18            

Output: Test suite Inputs

Algorithm 1 Greybox fuzzing with input prediction.

The fourth column shows a new input for the program under test; this input is either a seed or the value of variable in the algorithm, which is obtained with input prediction (line 9) or fuzzing (line 12). Each highlighted box in this column denotes a predicted value. The fifth column shows the cost vector that is computed when running the program with the new input of the fourth column. Note that we show only non-zero costs and that the subscript of each cost denotes the line number of the corresponding @ifdisplaystyleminimize statement in Fig. 1. The sixth column shows which costs (if any) are used to predict a new input, and the last column shows the current energy value of the algorithm’s (lines 4 and 18). For simplicity, we consider of Alg. 1 (line 5) to always have value 2 in this example. Our implementation, however, incorporates an existing energy schedule [14].

We assume that the set of seeds contains only the random input (test #1 in Tab. I). This input is then fuzzed to produce (test #2), that is, to produce a new value for input parameter @ifdisplaystyleb. Our algorithm uses the costs computed with metric to predict a new value for @ifdisplaystyleb. (We explain how new values are predicted in the next subsection.) As a result, test #3 exercises a new path of the program (the one with return value 3). From the cost vectors of tests #1 and #3, only the costs computed with metric may be used to predict another value for @ifdisplaystyleb; costs and are already zero in one of the two tests, while metric is not reached in test #1. Even though the energy of the original input (from test #1) has run out, the algorithm still runs the program with the input predicted from the costs (line 7). This results in covering the path with return value 4.

Next, we select an input from tests #1, #3, or #4 of the test suite. Let’s assume that the fuzzer picks the input from test #3 and mutates the value of input parameter @ifdisplaystylea. Note that the cost vectors of tests #3 and #5 differ only with respect to the costs, which are therefore used to predict a new input for @ifdisplaystylea. The new input exercises a new path of the program (the one with return value 2). At this point, the cost vectors of tests #3 and #6 cannot be used for prediction because the costs are either the same ( and ) or they are already zero in one of the two tests ( and ). Since no input is predicted and the energy of the original input (from test #3) has run out, our algorithm selects another input from the test suite.

Test Path Input New Input Costs Prediction Energy
from Test a b c Cost
1 0
1
2 1 1 0
3 1 1
3 3
4 1 2
4 6
5 3 3 7 3 0
6 3 3 1
2 42
7 4 4 6 0 0
8 4 6 1
5 42
TABLE I: Running Alg. 1 on the example of Fig. 1.

This time, let’s assume that the fuzzer picks the input from test #4 and mutates the value of input parameter @ifdisplaystylec. From the cost vectors of tests #4 and #7, it randomly selects the costs for predicting a new value for @ifdisplaystylec. The predicted input exercises the fifth path of the program, thus achieving full path coverage of function @ifdisplaystylebaz by generating only 8 tests.

Note that our algorithm makes several non-systematic choices, which may be random or based on heuristics, such as when function

PickInput picks an input from the test suite, when FuzzInput selects which input parameter to fuzz, or when Predict decides which costs to use for prediction. For illustrating how the algorithm works, we made “good” choices such that all paths are exercised with a small number of tests. In practice, the fuzzer achieved full path coverage of function @ifdisplaystylebaz with 372 tests, instead of 8, as we discussed in Sect. III-A.

Iv-B Input Prediction

Our algorithm passes to the prediction component the input vectors and and the corresponding cost vectors and (line 17 of Alg. 1). The input vectors differ by the value of a single input parameter, say and . Now, let us assume that the prediction component selects a cost metric to minimize and that the costs that have been evaluated using this metric appear as and in the cost vectors. This means that cost is associated with input value , and with .

As an example, let us consider tests #3 and #5 from Tab. I. The input vectors differ by the value of input parameter @ifdisplaystylea, so (value of @ifdisplaystylea in test #3) and (value of @ifdisplaystylea in test #5). The prediction component chooses to make a prediction based on cost metric since the cost vectors of tests #3 and #5 differ only with respect to this metric, so (value of in test #3) and (value of in test #5).

Using the two data points and , the goal is to find a value such that the corresponding cost is zero. In other words, our technique aims to find a root of the unknown, but computable, function that relates input parameter @ifdisplaystylea to cost metric . While there is a wide range of root-finding algorithms, Harvey uses the Secant method. Like other methods, such as Newton’s, the Secant method tries to find a root by performing successive approximations.

Its basic approximation step considers the two data points as x-y-coordinates on a plane. Our technique then fits a straight line through the points, where is the slope of the line and is a constant. To predict the new input value, it determines the x-coordinate where the line intersects with the x-axis (i.e., where the cost is zero).

From the points and defined by tests #3 and #5, we compute the line to be . Now, for the cost to be zero, the value of parameter @ifdisplaystylea must be 42. Indeed, when @ifdisplaystylea becomes 42 in test #6, cost metric is minimized.

This basic approximation step is precise if the target cost metric is indeed linear (or piece-wise linear) with the input parameter for which we are making a prediction. If not, the approximation may fail to minimize the cost metric. In such cases, Harvey applies the basic step iteratively (as the Secant method). Our experiments show that one iteration is typically sufficient for real contracts.

Iv-C Cost Metrics

We now describe the different cost metrics that our fuzzer aims to minimize: (1) ones that are minimized when execution flips a branch condition, and (2) ones that are minimized when execution is able to modify arbitrary memory locations.

Branch conditions. We have already discussed cost metrics that are minimized when execution flips a branch condition in the example of Fig. 1. Here, we show how the cost metrics are automatically derived from the program under test.

For the comparison operators @ifdisplaystyle== (), @ifdisplaystyle< (), and @ifdisplaystyle<= (), we define the following cost functions:

Function from above is non-zero when a branch condition @ifdisplaystyle== holds; it defines the cost metric for making this condition false. On the other hand, function defines the cost metric for making the same branch condition true. The arguments and denote the left and right operands of the operator. The notation is similar for all other functions.

Based on these cost functions, our instrumentation evaluates two cost metrics before every branch condition in the program under test. The metrics that are evaluated depend on the comparison operator used in the branch condition. The cost functions for other comparison operators, i.e., @ifdisplaystyle!= (), @ifdisplaystyle> (), and @ifdisplaystyle>= (), are easily derived from the functions above, and our tool supports them. Note that our implementation works on the bytecode, where logical operators, like @ifdisplaystyle&& or @ifdisplaystyle||, are expressed as branch conditions. We, thus, do not define cost functions for such operators, but they are also straightforward.

Observe that the inputs of the above cost functions are the operands of comparison operators, and not program inputs. This makes the cost functions precise, that is, when a cost is minimized, the corresponding branch is definitely flipped. Approximation can only be introduced when computing the correlation between a program input and a cost (Sect. IV-B).

Memory accesses. To illustrate the flexibility of our cost metrics, we now show another instantiation that targets a vulnerability specific to smart contracts. Consider the example in Fig. 3. (The grey box should be ignored for now.) It is a simplified version of code submitted to the Underhanded Solidity Coding Contest (USCC) in 2017 [26]. The USCC is a contest to write seemingly harmless Solidity code that, however, disguises unexpected vulnerabilities.

1contract Wallet {
2  address private owner; 
3  uint[] private bonusCodes; 
4
5  constructor() public { 
6    owner = msg.sender;
7    bonusCodes = new uint[](0);
8  }
9
10  function() public payable { } 
11
12  function PushCode(uint c) public {
13    bonusCodes.push(c);
14  }
15
16  function PopCode() public {
17    require(0 <= bonusCodes.length); 
18    bonusCodes.length--; 
19  }
20
21  function SetCodeAt(uint idx, uint c) public {
22    require(idx < bonusCodes.length);
23    

 
24    bonusCodes[idx] = c; 
25  }
26
27  function Destroy() public { 
28    require(msg.sender == owner); 
29    selfdestruct(msg.sender); 
30  }
31}
Fig. 3: Example of a memory-access vulnerability.

The contract of Fig. 3 implements a wallet that has an owner and stores an array (with variable length) of bonus codes (lines 33). The constructor (line 3) initializes the owner to the caller’s address and the bonus codes to an empty array. The empty function (line 3) ensures that assets can be payed to the wallet. The other functions allow bonus codes to be pushed, popped, or updated. The last function (line 3) must be called only by the owner and causes the wallet to self-destruct—to transfer all assets to the owner and destroy itself.

The vulnerability in this code is caused by the precondition on line 3, which should require the array length to be greater than zero (not equal) before popping an element. When the array is empty, the statement on line 3 causes the (unsigned) array length to underflow; this effectively disables the bound-checks of the array, allowing elements to be stored anywhere in the persistent state of the contract. Therefore, by setting a bonus code at a specific index in the array, an attacker could overwrite the address of the owner to their own address. Then, by destroying the wallet, the attacker would transfer all assets to their account. In a more optimistic scenario, the owner could be accidentally set to an invalid address, in which case the assets in the wallet would become inaccessible.

To detect such vulnerabilities, a greybox fuzzer can, for every assignment to the persistent state of a contract, pick an arbitrary address and compare it to the target address of the assignment. When these two addresses happen to be the same, it is very likely that the assignment may also target other arbitrary addresses, perhaps as a result of an exploit. A fuzzer without input prediction, however, is only able to detect these vulnerabilities by chance, and chances are extremely low that the target address of an assignment matches an arbitrarily selected address, especially given that these are 32 bytes long. In fact, when disabling Harvey’s input prediction, the vulnerability in the code of Fig. 3 is not detected within 12h.

To direct the fuzzer toward executions that could reveal such vulnerabilities, we define the following cost function:

Here, denotes the address of the left-hand side of an assignment to persistent state (that is, excluding assignments to local variables) and an arbitrary address. Function is non-zero when and are different, and therefore, optimal executions are those where the assignment writes to the arbitrary address, potentially revealing a vulnerability.

Our instrumentation evaluates the corresponding cost metric before every assignment to persistent state in the program under test. An example is shown on line 3 of Fig. 3. (We use the @ifdisplaystyle& operator to denote the address of @ifdisplaystylebonusCodes[idx], and we do not show the instrumentation at every assignment to avoid clutter.) Our fuzzer with input prediction detects the vulnerability in the contract of Fig. 3 within a few seconds.

Detecting such vulnerabilities based on whether an assignment could target an arbitrary address might generate false positives when the address is indeed an intended target of the assignment. However, the probability of this occurring in practice is extremely low (again due to the address length). We did not encounter any false positives during our experiments.

In general, defining other cost functions is straightforward as long as there is an expressible measure for the distance between a current execution and an optimal one.

V Demand-Driven Sequence Fuzzing

Recall from Sect. III-B that Harvey uses demand-driven sequence fuzzing to set up the persistent state for testing the last transaction in the sequence. The goal is to explore new paths in the function that this transaction invokes, and thus, detect more bugs. As explained earlier, directly fuzzing the state, for instance, variables @ifdisplaystylex and @ifdisplaystyley of Fig. 2, might lead to false positives. Nonetheless, Harvey uses this aggressive approach when fuzzing transaction sequences to determine whether a different persistent state can increase path coverage.

The key idea is to generate longer transaction sequences on demand. This is achieved by fuzzing a transaction sequence in two modes: regular, which does not directly fuzz the persistent state, and aggressive, which is enabled with probability 0.125 and may fuzz the persistent state directly. If Harvey is able to increase coverage of the last transaction in the sequence using the aggressive mode, the corresponding input is discarded (because it might lead to false positives), but longer sequences are generated when running in regular mode in the future.

For instance, when fuzzing a transaction that invokes @ifdisplaystyleBar from Fig. 2, Harvey temporarily considers @ifdisplaystylex and @ifdisplaystyley as fuzzable inputs of the function. If this aggressive fuzzing does not discover any more paths, then Harvey does not generate additional transactions before the invocation of @ifdisplaystyleBar. If, however, the aggressive fuzzing does discover new paths, our tool generates and fuzzes transaction sequences whose last transaction calls @ifdisplaystyleBar. That is, longer transaction sequences are only generated when they might be able to set up the state before the last transaction such that its coverage is increased.

For our example, Harvey generates the sequence @ifdisplaystyleSetY(42), @ifdisplaystyleCopyY(), and @ifdisplaystyleBar() that reaches the assertion in about 18s. At this point, the fuzzer stops exploring longer sequences for contract @ifdisplaystyleFoo because aggressively fuzzing the state cannot further increase the already achieved coverage.

We make two important observations. First, Harvey is so quick in finding the right argument for @ifdisplaystyleSetY due to input prediction. Second, demand-driven sequence fuzzing relies on path identifiers to span no more than a single transaction. Otherwise, aggressive fuzzing would not be able to determine if longer sequences may increase coverage of the contract.

Mutation operations. To generate and fuzz sequences of transactions, Harvey applies three mutation operations to a given transaction : (1) fuzz transaction , which fuzzes the inputs of its invocation, (2) insert a new transaction before , and (3) replace the transactions before with another sequence.

Harvey uses two pools for efficiently generating new transactions or sequences, respectively. These pools store transactions or sequences that are found both to increase coverage of the contract under test and to modify the persistent state in a way that has not been explored before. Harvey selects new transactions or sequences from these pools when applying the second and third mutation operations.

Vi Experimental Evaluation

In this section, we evaluate Harvey on real-world smart contracts. First, we explain the benchmark selection and setup. We then compare different Harvey configurations to assess the effectiveness of our two fuzzing techniques. At the same time, we highlight key insights about smart-contract code.

Vi-a Benchmark Selection

We collected all contracts from 17 GitHub repositories. We selected the repositories based on two main criteria to obtain a diverse set of benchmarks. On one hand, we picked popular projects in the Ethereum community (e.g., the Ethereum Name Service auction, the ConsenSys wallet, and the MicroRaiden payment service) and with high popularity on GitHub (4’857 stars in total on 2019-05-07, median 132). Most contracts in these projects have been reviewed by independent auditors and are deployed on the Ethereum blockchain, managing significant amounts of crypto-assets on a daily basis. On the other hand, we also selected repositories from a wide range of application domains (e.g., auctions, token sales, payment networks, and wallets) to cover various features of the EVM and Solidity. We also included contracts that had been hacked in the past (The DAO and the Parity wallet) and five contracts (incl. the four top-ranked entries) from the repository of the USCC to consider some malicious or buggy contracts.

BIDs Name Functions LoSC Description
1 ENS 24 1205 ENS domain name auction
2–3 CMSW 49 503 ConsenSys multisig wallet
4–5 GMSW 49 704 Gnosis multisig wallet
6 BAT 23 191 BAT token (advertising)
7 CT 12 200 ConsenSys token library
8 ERCF 19 747 ERC Fund (investment fund)
9 FBT 34 385 FirstBlood token (e-sports)
10–13 HPN 173 3065 Havven payment network
14 MR 25 1053 MicroRaiden payment service
15 MT 38 437 MOD token (supply-chain)
16 PC 7 69 Payment channel
17–18 RNTS 49 749 Request Network token sale
19 DAO 23 783 The DAO organization
20 VT 18 242 Valid token (personal data)
21 USCC1 4 57 USCC’17 entry
22 USCC2 14 89 USCC’17 (honorable mention)
23 USCC3 21 535 USCC’17 (3rd place)
24 USCC4 7 164 USCC’17 (1st place)
25 USCC5 10 188 USCC’17 (2nd place)
26 PW 19 549 Parity multisig wallet
27 BNK 44 649 Bankera token
Total 662 12564
TABLE II: Overview of benchmarks.

From each of the selected repositories, we identified one or more main contracts that would serve as contracts under test, resulting in a total of 27 benchmarks. Note that many repositories contain several contracts (including libraries) to implement a complex system, such as an auction. Tab. II gives an overview of all benchmarks and the projects from which they originate. The first column lists the benchmark IDs and the second the project name. The third and fourth columns show the number of public functions and the lines of Solidity source code (LoSC) in each benchmark. The appendix provides details about the tested changesets.

To select our benchmarks, we followed published guidelines on evaluating fuzzers [27]. We do not simply scrape contracts from the blockchain since most are created with no quality control and many contain duplicates—contracts without assets or users are essentially dead code. Moreover, good-quality contracts typically have dependencies (e.g., on libraries or other contracts) that would likely not be scraped with them.

In terms of size, note that most contracts are a few hundred lines of code. Nonetheless, they are complex programs, each occupying at least a couple of auditors for weeks. More importantly, their size does not necessarily represent how difficult it is for a fuzzer to test all paths. For instance, Fig. 1 is very small, but AFL fails to cover all paths within 12h.

Vi-B Experimental Setup

We ran different configurations of Harvey and compared the achieved coverage and required time to detect a bug.

For simplicity, our evaluation focuses on detecting two types of bugs. First, we detect crashes due to assertion violations (SWC-110 according to the Smart Contract Weakness Classification [28]); in addition to user-provided checks, these include checked errors, such as division by zero or out-of-bounds array access, inserted by the compiler. At best, these bugs cause a transaction to be aborted and waste gas fees. In the worst case, they prevent legitimate transactions from succeeding, putting assets at risk. For instance, a user may not be able to claim an auctioned item due to an out-of-bounds error in the code that iterates over an array of bidders to determine the winner. Second, we detect memory-access errors (SWC-124 [28]) that may allow an attacker to modify the persistent state of a contract (Fig. 3). In practice, Harvey covers a wide range of test oracles111SWC-101, 104, 107, 110, 123, 124, 127, such as reentrancy and overflows.

For bug de-duplication, it uses a simple approach (much more conservative than AFL): two bugs of the same type are duplicates if they occur at the same program location.

For each configuration, we performed 24 runs, each with independent random seeds, an all-zero seed input, and a time limit of one hour; we report medians unless stated otherwise. In addition, we performed Wilcoxon-Mann-Whitney U tests to determine if differences in medians are statistically significant and report the computed p-values.

We used an Intel® Xeon® CPU @ 2.90GHz 36-core machine with 60GB running Ubuntu 18.04.

Vi-C Results

We assess Harvey’s effectiveness by evaluating three research questions. The first two focus on our input-prediction technique and the third on demand-driven sequence fuzzing.

Our baselines implement standard greybox fuzzing within Harvey. Since there are no other greybox fuzzers for smart contracts, we consider these suitable. Related work (e.g., [29, 30]) either uses fundamentally different bug-finding techniques (like blackbox fuzzing or symbolic execution) or focuses on detecting different types of bugs. Our evaluation aims to demonstrate improvements over standard greybox fuzzing, the benefits of which have been shown independently.

BID Bug ID SWC ID p
2 990d9524 SWC-110 22.27 0.21 107.35 0.000 0.00 1.00
2 b4f9a3d6 SWC-110 41.59 1.27 32.80 0.000 0.05 0.95
3 c56e90ab SWC-110 8.83 0.21 42.40 0.000 0.00 1.00
3 cb2847d0 SWC-110 13.56 0.85 15.95 0.000 0.04 0.96
4 306fa4fe SWC-110 34.47 2.62 13.16 0.000 0.04 0.96
4 57c85623 SWC-110 17.34 0.16 106.62 0.000 0.00 1.00
5 51444152 SWC-110 11.79 0.14 83.23 0.000 0.00 1.00
5 f6ee56cd SWC-110 14.85 1.14 13.09 0.000 0.06 0.94
8 c9c0b2f4 SWC-110 3600.00 90.66 39.71 0.000 0.00 1.00
13 341911e4 SWC-110 21.27 4.19 5.07 0.000 0.05 0.95
13 1cd27b5d SWC-110 30.56 4.60 6.65 0.000 0.05 0.95
13 26aee7ba SWC-110 20.23 4.11 4.92 0.000 0.05 0.95
13 d7d04622 SWC-110 18.42 4.14 4.45 0.000 0.06 0.94
15 dec48390 SWC-110 2823.62 1779.98 1.59 0.362 0.42 0.58
15 193c72a2 SWC-110 3600.00 17.65 204.02 0.000 0.00 1.00
15 7c3dd9f4 SWC-110 3600.00 221.12 16.28 0.000 0.04 0.96
15 65aa7261 SWC-110 3600.00 3600.00 1.00 0.338 0.48 0.52
17 21646ab7 SWC-110 3600.00 273.17 13.18 0.000 0.02 0.98
18 3021c487 SWC-110 9.59 0.54 17.68 0.000 0.03 0.97
18 ed97030c SWC-110 3600.00 3600.00 1.00 0.917 0.51 0.49
19 e3468a11 SWC-110 7.98 0.12 64.62 0.000 0.00 1.00
19 b359efbc SWC-110 8.63 0.09 94.28 0.000 0.01 0.99
19 9e65397d SWC-110 22.86 0.47 48.96 0.000 0.01 0.99
19 4063c80f SWC-110 20.45 0.46 44.45 0.000 0.00 1.00
19 49e4a70e SWC-110 55.38 2.55 21.70 0.000 0.05 0.95
19 ee609ac1 SWC-110 16.13 0.71 22.73 0.000 0.00 1.00
19 21f5c23f SWC-110 23.72 2.52 9.40 0.000 0.07 0.93
22 f3bf5e12 SWC-110 13.13 0.40 33.05 0.000 0.07 0.93
22 577a74af SWC-124 3600.00 899.60 4.00 0.000 0.04 0.96
23 1c8acd5e SWC-110 3600.00 193.66 18.59 0.000 0.00 1.00
23 fda2cafa SWC-110 3600.00 218.57 16.47 0.000 0.00 1.00
24 c837a34b SWC-110 1.85 0.04 42.33 0.000 0.01 0.99
24 d602954b SWC-110 3.97 0.12 34.30 0.000 0.01 0.99
24 863f9452 SWC-110 4.76 0.18 25.96 0.000 0.04 0.96
24 9774d846 SWC-110 14.41 0.43 33.43 0.000 0.05 0.95
24 123bf172 SWC-110 238.63 3.01 79.22 0.000 0.04 0.96
24 a97971ca SWC-110 145.79 4.43 32.90 0.000 0.05 0.95
24 9a771b96 SWC-110 69.35 3.62 19.14 0.000 0.03 0.97
24 dc7bf682 SWC-110 3600.00 0.69 5246.66 0.000 0.00 1.00
26 ccf7bc67 SWC-110 61.42 1.68 36.59 0.000 0.01 0.99
27 f1c8e169 SWC-110 112.98 16.26 6.95 0.000 0.16 0.84
27 44312719 SWC-110 77.18 1.50 51.36 0.000 0.01 0.99
27 33c32ef9 SWC-110 60.52 1.08 56.25 0.000 0.07 0.93
27 d499f535 SWC-110 3600.00 7.36 489.38 0.000 0.00 1.00
27 4fb4fa53 SWC-110 3600.00 67.29 53.50 0.000 0.00 1.00
27 47c60a93 SWC-110 3600.00 141.16 25.50 0.000 0.02 0.98
27 6f92fdea SWC-110 3600.00 3600.00 1.00 0.041 0.42 0.58
Median 41.59 2.55 25.96
TABLE III: Comparing time-to-bug between configuration A (w/o input prediction) and B (w/ input prediction).

RQ1: Effectiveness of input prediction. To evaluate input prediction, we compare with a baseline (configuration A), which only disables prediction. The first column of Tab. III identifies the benchmark, the second the bug, and the third the bug type according to the SWC (110 stands for assertion violations and 124 for memory-access errors).

The fourth and fifth columns show the median time (in secs) after which each unique bug was found by configurations A and B within the time limit—B differs from A only by enabling input prediction. Configuration B finds 43 out of 47 bugs significantly faster than A. We report the speed-up factor in the sixth column and the significance level, i.e., p-value, in the seventh (we use ). As shown in the table, configuration B is faster than A by a factor of up to 5’247 (median 25.96). The last two columns compute the Vargha-Delaney A12 effect sizes [31]. Intuitively, these show the probability of configuration A being faster than B and vice versa. Note that, to compute the median time, we conservatively counted 3’600s for a given run even if the bug was not found. However, on average, B detects 10 more bugs.

Tab. IV compares A and B with respect to instruction coverage. For 23 out of 27 benchmarks, B achieves significantly higher coverage. The results for path coverage are very similar.

Input prediction is very effective in both detecting bugs faster and achieving higher coverage.

RQ2: Effectiveness of iterative input prediction. Configuration C differs from B in that it does not iteratively apply the basic approximation step of the Secant method in case it fails to minimize a cost metric. For artificial examples with non-linear branch conditions (e.g., a^4 + a^2 == 228901770), we were able to show that this configuration is less efficient than B in finding bugs. However, for our benchmarks, there were no significant time differences between B and C for detecting 45 of 47 bugs. Similarly, there were no significant differences in instruction coverage.

During our experiments with C, we measured the success rate of one-shot cost minimization to range between 97% and 100% (median 99%). This suggests that complex branch conditions are not very common in real-world smart contracts.

Even one iteration of the Secant method is extremely successful in predicting inputs. This suggests that the vast majority of branch conditions are linear (or piece-wise linear) with respect to the program inputs.

BID p
1 3868.00 3868.00 1.00 0.010 0.38 0.62
2 3064.00 4005.50 1.31 0.000 0.00 1.00
3 2575.00 3487.00 1.35 0.000 0.00 1.00
4 2791.00 3773.00 1.35 0.000 0.00 1.00
5 2567.00 3501.00 1.36 0.000 0.00 1.00
6 1832.00 1949.00 1.06 0.000 0.00 1.00
7 1524.00 1524.00 1.00 0.000 0.50 0.50
8 1051.00 2205.00 2.10 0.000 0.00 1.00
9 2694.00 3468.00 1.29 0.000 0.00 1.00
10 6833.00 7360.50 1.08 0.000 0.00 1.00
11 7295.00 8716.00 1.19 0.000 0.00 1.00
12 2816.00 5165.00 1.83 0.000 0.00 1.00
13 1585.00 4510.00 2.85 0.000 0.00 1.00
14 3822.00 4655.00 1.22 0.000 0.00 1.00
15 3489.00 5078.50 1.46 0.000 0.00 1.00
16 496.00 496.00 1.00 0.000 0.50 0.50
17 1832.00 2754.00 1.50 0.000 0.00 1.00
18 2766.00 2930.00 1.06 0.000 0.00 1.00
19 2411.00 2611.00 1.08 0.000 0.00 1.00
20 1635.00 3018.00 1.85 0.000 0.00 1.00
21 349.00 434.00 1.24 0.000 0.00 1.00
22 919.00 1274.00 1.39 0.000 0.00 1.00
23 1344.00 2095.00 1.56 0.000 0.00 1.00
24 687.00 754.00 1.10 0.001 0.22 0.78
25 1082.00 1192.00 1.10 0.000 0.00 1.00
26 1606.00 1606.00 1.00 0.000 0.50 0.50
27 4232.00 5499.50 1.30 0.000 0.00 1.00
Median 2411.00 3018.00 1.29
TABLE IV: Comparing instruction coverage for configurations A (w/o input prediction) and B (w/ input prediction).

RQ3: Effectiveness of demand-driven sequence fuzzing. To evaluate this research question, we compare configuration A with D, which differs from A by disabling demand-driven sequence fuzzing. In particular, D tries to eagerly explore all paths in all possible transaction sequences, where paths span all transactions. Tab. V shows a comparison between A and D with respect to time-to-bug for bugs that were found by at least one configuration. As shown in the table, A is significantly faster than D in detecting 7 out of 35 bugs, with a speed-up of up to 20x. Note that all 7 bugs require more than a single transaction to be detected. Instruction coverage is very similar between A and D (slightly higher for A), but A achieves it within a fraction of the time for 19 out of 27 benchmarks.

In total, 26 out of 35 bugs require more than one transaction to be found. This suggests that real contracts need to be tested with sequences of transactions, and consequently, there is much to be gained from pruning techniques like ours. Our experiments with D also confirm that, when paths span all transactions, the test suite becomes orders-of-magnitude larger.

Demand-driven sequence fuzzing is effective in pruning the search space of transaction sequences; as a result, it detects bugs and achieves coverage faster. Such techniques are useful since most bugs require invoking multiple transactions to be revealed.

BID Bug ID SWC ID p
2 990d9524 SWC-110 22.27 15.85 1.41 0.415 0.43 0.57
2 b4f9a3d6 SWC-110 41.59 66.31 0.63 0.529 0.55 0.45
3 c56e90ab SWC-110 8.83 8.63 1.02 0.643 0.46 0.54
3 cb2847d0 SWC-110 13.56 12.72 1.07 0.749 0.53 0.47
4 306fa4fe SWC-110 34.47 59.90 0.58 0.477 0.56 0.44
4 57c85623 SWC-110 17.34 14.96 1.16 0.942 0.49 0.51
5 51444152 SWC-110 11.79 12.01 0.98 0.585 0.55 0.45
5 f6ee56cd SWC-110 14.85 17.69 0.84 0.529 0.55 0.45
13 341911e4 SWC-110 21.27 25.28 0.84 0.749 0.53 0.47
13 1cd27b5d SWC-110 30.56 28.54 1.07 0.359 0.58 0.42
13 26aee7ba SWC-110 20.23 23.12 0.87 0.571 0.55 0.45
13 d7d04622 SWC-110 18.42 19.05 0.97 0.942 0.51 0.49
15 dec48390 SWC-110 2823.62 3600.00 0.78 0.000 0.84 0.16
18 3021c487 SWC-110 9.59 9.84 0.97 0.338 0.58 0.42
18 ed97030c SWC-110 3600.00 3600.00 1.00 0.028 0.65 0.35
19 e3468a11 SWC-110 7.98 5.96 1.34 0.439 0.43 0.57
19 b359efbc SWC-110 8.63 8.25 1.05 0.288 0.41 0.59
19 9e65397d SWC-110 22.86 22.06 1.04 0.781 0.52 0.48
19 4063c80f SWC-110 20.45 15.82 1.29 0.177 0.39 0.61
19 49e4a70e SWC-110 55.38 55.67 0.99 0.585 0.55 0.45
19 ee609ac1 SWC-110 16.13 19.45 0.83 0.877 0.51 0.49
19 21f5c23f SWC-110 23.72 16.43 1.44 0.718 0.47 0.53
22 f3bf5e12 SWC-110 13.13 9.46 1.39 0.718 0.47 0.53
24 c837a34b SWC-110 1.85 1.98 0.94 0.673 0.54 0.46
24 d602954b SWC-110 3.97 4.56 0.87 0.877 0.51 0.49
24 863f9452 SWC-110 4.76 4.46 1.07 0.959 0.51 0.49
24 9774d846 SWC-110 14.41 12.72 1.13 0.845 0.48 0.52
24 123bf172 SWC-110 238.63 3600.00 0.07 0.001 0.77 0.23
24 a97971ca SWC-110 145.79 2946.47 0.05 0.005 0.74 0.26
24 9a771b96 SWC-110 69.35 1087.31 0.06 0.010 0.72 0.28
26 ccf7bc67 SWC-110 61.42 426.17 0.14 0.000 0.80 0.20
27 f1c8e169 SWC-110 112.98 504.67 0.22 0.002 0.76 0.24
27 44312719 SWC-110 77.18 522.75 0.15 0.018 0.70 0.30
27 33c32ef9 SWC-110 60.52 83.35 0.73 0.464 0.56 0.44
27 47c60a93 SWC-110 3600.00 3600.00 1.00 0.026 0.65 0.35
Median 21.27 19.45 0.97
TABLE V: Comparing time-to-bug between configuration A (w/ demand-driven sequence fuzzing) and D (w/o demand-driven sequence fuzzing).

Vi-D Threats to Validity

External validity. Our results may not generalize to all smart contracts or program types [32]. However, we evaluated our technique on a diverse set of contracts from a wide range of domains. We, thus, believe that our selection significantly helps to ensure generalizability. To further improve external validity, we also provide the versions of all contracts in the appendix. Moreover, our comparisons focus on a single fuzzer—we discuss this below.

Internal validity. Another potential issue has to do with whether systematic errors are introduced in the setup [32]. When comparing configurations, we always used the same seed inputs in order to avoid bias in the exploration.

Construct validity. Construct validity ensures that the evaluation measures what it claims. We compare several configurations of Harvey, and thus, ensure that any improvements are exclusively due to techniques enabled in a given configuration.

Vii Related Work

Harvey is the first greybox fuzzer for smart contracts. It incorporates two key techniques, input prediction and demand-driven sequence fuzzing, that improve its effectiveness.

Greybox fuzzing. There are several techniques that aim to direct greybox fuzzing toward certain parts of the search space, such as low-frequency paths [14], vulnerable paths [15], deep paths [16], or specific sets of program locations [17]. There are also techniques that boost fuzzing by smartly selecting and mutating inputs [33, 34, 35], or by searching for new inputs using iterative optimization algorithms, such as gradient descent [36], with the goal of increasing branch coverage.

In general, input prediction could be used in combination with these techniques. In comparison, our approach predicts concrete input values based on two previous executions. To achieve this, we rely on additional, but still lightweight, instrumentation.

Whitebox fuzzing. Whitebox fuzzing is implemented in many tools, such as EXE [37], jCUTE [38], Pex [39], BitBlaze [40], Apollo [41], S2E [42], and Mayhem [43], and comes in different flavors, such as probabilistic symbolic execution [44] or model-based whitebox fuzzing [45].

As discussed earlier, our input-prediction technique does not rely on any program analysis or constraint solving, and our instrumentation is more lightweight, for instance, we do not keep track of a symbolic store and path constraints.

Hybrid fuzzing. Hybrid fuzzers combine fuzzing with other techniques to join their benefits and achieve better results. For example, Dowser [46] uses static analysis to identify code regions with potential buffer overflows. Similarly, BuzzFuzz [12] uses taint tracking to discover which input bytes flow to “attack points”. Hybrid Fuzz Testing [47] first runs symbolic execution to find inputs that lead to “frontier nodes” and then applies fuzzing on these inputs. On the other hand, Driller [25] starts with fuzzing and uses symbolic execution when it needs help in generating inputs that satisfy complex checks.

In contrast, input prediction extends greybox fuzzing without relying on static analysis or whitebox fuzzing. Harvey could, however, benefit from hybrid-fuzzing approaches.

Optimization in testing. Miller and Spooner [48] were the first to use optimization methods in generating test data, and in particular, floating-point inputs. It was not until 1990 that these ideas were extended by Korel for Pascal programs [49]. Such optimization methods have recently been picked up again [50], enhanced, and implemented in various testing tools, such as FloPSy [51], CORAL [52], EvoSuite [53], AUSTIN [54], CoverMe [55], and Angora [36]

. Most of these tools use fitness functions to determine the distance from a target and attempt to minimize them. For instance, Korel uses fitness functions that are similar to our cost metrics for flipping branch conditions. The search is typically iterative, e.g., by using hill climbing, simulated annealing, or genetic algorithms 

[56, 57, 58].

Our prediction technique is inspired by these approaches but is applied in the context of greybox fuzzing. When failing to minimize a cost metric, Harvey falls back on standard greybox fuzzing, which is known for its effectiveness. We show that input prediction works particularly well in the domain of smart contracts, where even one iteration is generally enough for minimizing a cost metric.

Method-call sequence generation. For testing object-oriented programs, it is often necessary to generate complex input objects using sequences of method calls. There are many approaches [59, 60, 61, 3, 62, 63, 64, 65] that automatically generate such sequences using techniques such as dynamic inference, static analysis, or evolutionary testing.

In contrast, demand-driven sequence fuzzing only relies on greybox fuzzing and targets smart contracts.

Program analysis for smart contracts. There exist various applications of program analysis to smart contracts, such as symbolic execution, static analysis, and verification [30, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]. The work most closely related to ours is the blackbox fuzzer ContractFuzzer [29] and the property-based testing tool Echidna [80]. In contrast, our technique is the first to apply greybox fuzzing to smart contracts.

Viii Conclusion

We presented Harvey, an industrial greybox fuzzer for smart contracts. During its development, we encountered two key challenges that we alleviate with input prediction and demand-driven sequence fuzzing. Our experiments show that both techniques significantly improve Harvey’s effectiveness and highlight certain insights about contract code.

In future work, we plan to further enhance Harvey by leveraging complementary techniques, such as static analysis and lightweight dynamic symbolic execution.

References

  • [1] K. Claessen and J. Hughes, “QuickCheck: A lightweight tool for random testing of Haskell programs,” in ICFP.   ACM, 2000, pp. 268–279.
  • [2] C. Csallner and Y. Smaragdakis, “JCrasher: An automatic robustness tester for Java,” SPE, vol. 34, pp. 1025–1050, 2004.
  • [3] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-directed random test generation,” in ICSE.   IEEE Computer Society, 2007, pp. 75–84.
  • [4] “Technical “whitepaper” for AFL,” http://lcamtuf.coredump.cx/afl/technical_details.txt.
  • [5] “Libfuzzer—A library for coverage-guided fuzz testing,” https://llvm.org/docs/LibFuzzer.html.
  • [6] P. Godefroid, N. Klarlund, and K. Sen, “DART: Directed automated random testing,” in PLDI.   ACM, 2005, pp. 213–223.
  • [7] C. Cadar and D. R. Engler, “Execution generated test cases: How to make systems code crash itself,” in SPIN, ser. LNCS, vol. 3639.   Springer, 2005, pp. 2–23.
  • [8] “Peach Fuzzer Platform,” https://www.peach.tech/products/peach-fuzzer/peach-platform/.
  • [9] “zzuf—Multi-Purpose Fuzzer,” http://caca.zoy.org/wiki/zzuf.
  • [10] P. Godefroid, M. Y. Levin, and D. A. Molnar, “Automated whitebox fuzz testing,” in NDSS.   The Internet Society, 2008, pp. 151–166.
  • [11] C. Cadar, D. Dunbar, and D. R. Engler, “KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs,” in OSDI.   USENIX, 2008, pp. 209–224.
  • [12] V. Ganesh, T. Leek, and M. C. Rinard, “Taint-based directed whitebox fuzzing,” in ICSE.   IEEE Computer Society, 2009, pp. 474–484.
  • [13] “The AFL vulnerability trophy case,” http://lcamtuf.coredump.cx/afl/#bugs.
  • [14]

    M. Böhme, V. Pham, and A. Roychoudhury, “Coverage-based greybox fuzzing as Markov chain,” in

    CCS.   ACM, 2016, pp. 1032–1043.
  • [15] S. Rawat, V. Jain, A. Kumar, L. Cojocar, C. Giuffrida, and H. Bos, “VUzzer: Application-aware evolutionary fuzzing,” in NDSS.   The Internet Society, 2017, pp. 1–14.
  • [16] S. Sparks, S. Embleton, R. Cunningham, and C. C. Zou, “Automated vulnerability analysis: Leveraging control flow for evolutionary input crafting,” in ACSAC.   IEEE Computer Society, 2007, pp. 477–486.
  • [17] M. Böhme, V. Pham, M. Nguyen, and A. Roychoudhury, “Directed greybox fuzzing,” in CCS.   ACM, 2017, pp. 2329–2344.
  • [18] “Ethereum white paper,” 2014, https://github.com/ethereum/wiki/wiki/White-Paper.
  • [19] “Ethereum,” https://github.com/ethereum.
  • [20] M. Swan, Blockchain: Blueprint for a New Economy.   O’Reilly Media, 2015.
  • [21] S. Raval, Decentralized Applications: Harnessing Bitcoin’s Blockchain Technology.   O’Reilly Media, 2016.
  • [22] D. Tapscott and A. Tapscott, Blockchain Revolution: How the Technology Behind Bitcoin is Changing Money, Business, and the World.   Penguin, 2016.
  • [23] M. Bartoletti and L. Pompianu, “An empirical analysis of smart contracts: Platforms, applications, and design patterns,” in FC, ser. LNCS, vol. 10323.   Springer, 2017, pp. 494–509.
  • [24] G. Wood, “Ethereum: A secure decentralised generalised transaction ledger,” 2014, http://gavwood.com/paper.pdf.
  • [25] N. Stephens, J. Grosen, C. Salls, A. Dutcher, R. Wang, J. Corbetta, Y. Shoshitaishvili, C. Kruegel, and G. Vigna, “Driller: Augmenting fuzzing through selective symbolic execution,” in NDSS.   The Internet Society, 2016.
  • [26] “Underhanded solidity coding contest,” http://u.solidity.cc/.
  • [27] G. Klees, A. Ruef, B. Cooper, S. Wei, and M. Hicks, “Evaluating fuzz testing,” in CCS.   ACM, 2018, pp. 2123–2138.
  • [28] “Smart Contract Weakness Classification,” https://smartcontractsecurity.github.io/SWC-registry.
  • [29] B. Jiang, Y. Liu, and W. K. Chan, “ContractFuzzer: Fuzzing smart contracts for vulnerability detection,” in ASE.   ACM, 2018, pp. 259–269.
  • [30] L. Luu, D. Chu, H. Olickel, P. Saxena, and A. Hobor, “Making smart contracts smarter,” in CCS.   ACM, 2016, pp. 254–269.
  • [31] A. Vargha and H. D. Delaney, “A critique and improvement of the CL common language effect size statistics of McGraw and Wong,” JEBS, vol. 25, pp. 101–132, 2000.
  • [32] J. Siegmund, N. Siegmund, and S. Apel, “Views on internal and external validity in empirical software engineering,” in ICSE.   IEEE Computer Society, 2015, pp. 9–19.
  • [33] M. Woo, S. K. Cha, S. Gottlieb, and D. Brumley, “Scheduling black-box mutational fuzzing,” in CCS.   ACM, 2013, pp. 511–522.
  • [34] A. Rebert, S. K. Cha, T. Avgerinos, J. Foote, D. Warren, G. Grieco, and D. Brumley, “Optimizing seed selection for fuzzing,” in Security.   USENIX, 2014, pp. 861–875.
  • [35] S. K. Cha, M. Woo, and D. Brumley, “Program-adaptive mutational fuzzing,” in SP.   IEEE Computer Society, 2015, pp. 725–741.
  • [36] P. Chen and H. Chen, “Angora: Efficient fuzzing by principled search,” in SP.   IEEE Computer Society, 2018, pp. 711–725.
  • [37] C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler, “EXE: Automatically generating inputs of death,” in CCS.   ACM, 2006, pp. 322–335.
  • [38] K. Sen and G. Agha, “CUTE and jCUTE: Concolic unit testing and explicit path model-checking tools,” in CAV, ser. LNCS, vol. 4144.   Springer, 2006, pp. 419–423.
  • [39] N. Tillmann and J. de Halleux, “Pex—White box test generation for .NET,” in TAP, ser. LNCS, vol. 4966.   Springer, 2008, pp. 134–153.
  • [40] D. X. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. G. Kang, Z. Liang, J. Newsome, P. Poosankam, and P. Saxena, “BitBlaze: A new approach to computer security via binary analysis,” in ICISS, ser. LNCS, vol. 5352.   Springer, 2008, pp. 1–25.
  • [41] S. Artzi, A. Kiezun, J. Dolby, F. Tip, D. Dig, A. M. Paradkar, and M. D. Ernst, “Finding bugs in web applications using dynamic test generation and explicit-state model checking,” TSE, vol. 36, pp. 474–494, 2010.
  • [42] V. Chipounov, V. Kuznetsov, and G. Candea, “S2E: A platform for in-vivo multi-path analysis of software systems,” in ASPLOS.   ACM, 2011, pp. 265–278.
  • [43] S. K. Cha, T. Avgerinos, A. Rebert, and D. Brumley, “Unleashing Mayhem on binary code,” in SP.   IEEE Computer Society, 2012, pp. 380–394.
  • [44] J. Geldenhuys, M. B. Dwyer, and W. Visser, “Probabilistic symbolic execution,” in ISSTA.   ACM, 2012, pp. 166–176.
  • [45] V. Pham, M. Böhme, and A. Roychoudhury, “Model-based whitebox fuzzing for program binaries,” in ASE.   ACM, 2016, pp. 543–553.
  • [46] I. Haller, A. Slowinska, M. Neugschwandtner, and H. Bos, “Dowsing for overflows: A guided fuzzer to find buffer boundary violations,” in Security.   USENIX, 2013, pp. 49–64.
  • [47] B. S. Pak, “Master’s thesis “Hybrid Fuzz Testing: Discovering Software Bugs via Fuzzing and Symbolic Execution”,” 2012, school of Computer Science, Carnegie Mellon University, USA.
  • [48] W. Miller and D. L. Spooner, “Automatic generation of floating-point test data,” TSE, vol. 2, pp. 223–226, 1976.
  • [49] B. Korel, “Automated software test data generation,” TSE, vol. 16, pp. 870–879, 1990.
  • [50] P. McMinn, “Search-based software test data generation: A survey,” Softw. Test., Verif. Reliab., vol. 14, pp. 105–156, 2004.
  • [51] K. Lakhotia, N. Tillmann, M. Harman, and J. de Halleux, “Flopsy—Search-based floating point constraint solving for symbolic execution,” in ICTSS, ser. LNCS, vol. 6435.   Springer, 2010, pp. 142–157.
  • [52] M. Souza, M. Borges, M. d’Amorim, and C. S. Pasareanu, “CORAL: Solving complex constraints for symbolic PathFinder,” in NFM, ser. LNCS, vol. 6617.   Springer, 2011, pp. 359–374.
  • [53] G. Fraser and A. Arcuri, “EvoSuite: Automatic test suite generation for object-oriented software,” in ESEC/FSE.   ACM, 2011, pp. 416–419.
  • [54] K. Lakhotia, M. Harman, and H. Gross, “AUSTIN: An open source tool for search based software testing of C programs,” IST, vol. 55, pp. 112–125, 2013.
  • [55] Z. Fu and Z. Su, “Achieving high coverage for floating-point code via unconstrained programming,” in PLDI.   ACM, 2017, pp. 306–319.
  • [56] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equation of state calculations by fast computing machines,” The Journal of Chemical Physics, vol. 21, pp. 1087–1092, 1953.
  • [57] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680, 1983.
  • [58] R. P. Pargas, M. J. Harrold, and R. Peck, “Test-data generation using genetic algorithms,” Softw. Test., Verif. Reliab., vol. 9, pp. 263–282, 1999.
  • [59] P. Tonella, “Evolutionary testing of classes,” in ISSTA.   ACM, 2004, pp. 119–128.
  • [60] T. Xie, D. Marinov, W. Schulte, and D. Notkin, “Symstra: A framework for generating object-oriented unit tests using symbolic execution,” in TACAS, ser. LNCS, vol. 3440.   Springer, 2005, pp. 365–381.
  • [61] K. Inkumsah and T. Xie, “Evacon: A framework for integrating evolutionary and concolic testing for object-oriented programs,” in ASE.   ACM, 2007, pp. 425–428.
  • [62] S. Thummalapenta, T. Xie, N. Tillmann, J. de Halleux, and W. Schulte, “MSeqGen: Object-oriented unit-test generation via mining source code,” in ESEC/FSE.   ACM, 2009, pp. 193–202.
  • [63] S. Thummalapenta, T. Xie, N. Tillmann, J. de Halleux, and Z. Su, “Synthesizing method sequences for high-coverage testing,” in OOPSLA.   ACM, 2011, pp. 189–206.
  • [64] S. Zhang, D. Saff, Y. Bu, and M. D. Ernst, “Combined static and dynamic automated test generation,” in ISSTA.   ACM, 2011, pp. 353–363.
  • [65] P. Garg, F. Ivančić, G. Balakrishnan, N. Maeda, and A. Gupta, “Feedback-directed unit test generation for C/C++ using concolic execution,” in ICSE.   IEEE Computer Society/ACM, 2013, pp. 132–141.
  • [66] K. Bhargavan, A. Delignat-Lavaud, C. Fournet, A. Gollamudi, G. Gonthier, N. Kobeissi, N. Kulatova, A. Rastogi, T. Sibut-Pinote, N. Swamy, and S. Zanella-Béguelin, “Formal verification of smart contracts: Short paper,” in PLAS.   ACM, 2016, pp. 91–96.
  • [67] N. Atzei, M. Bartoletti, and T. Cimoli, “A survey of attacks on Ethereum smart contracts,” in POST, ser. LNCS, vol. 10204.   Springer, 2017, pp. 164–186.
  • [68] T. Chen, X. Li, X. Luo, and X. Zhang, “Under-optimized smart contracts devour your money,” in SANER.   IEEE Computer Society, 2017, pp. 442–446.
  • [69] I. Sergey and A. Hobor, “A concurrent perspective on smart contracts,” in FC, ser. LNCS, vol. 10323.   Springer, 2017, pp. 478–493.
  • [70] K. Chatterjee, A. K. Goharshady, and Y. Velner, “Quantitative analysis of smart contracts,” in ESOP, ser. LNCS, vol. 10801.   Springer, 2018, pp. 739–767.
  • [71] S. Amani, M. Bégel, M. Bortin, and M. Staples, “Towards verifying Ethereum smart contract bytecode in Isabelle/HOL,” in CPP.   ACM, 2018, pp. 66–77.
  • [72] L. Brent, A. Jurisevic, M. Kong, E. Liu, F. Gauthier, V. Gramoli, R. Holz, and B. Scholz, “Vandal: A scalable security analysis framework for smart contracts,” CoRR, vol. abs/1809.03981, 2018.
  • [73] N. Grech, M. Kong, A. Jurisevic, L. Brent, B. Scholz, and Y. Smaragdakis, “MadMax: Surviving out-of-gas conditions in Ethereum smart contracts,” PACMPL, vol. 2, pp. 116:1–116:27, 2018.
  • [74] S. Grossman, I. Abraham, G. Golan-Gueta, Y. Michalevsky, N. Rinetzky, M. Sagiv, and Y. Zohar, “Online detection of effectively callback free objects with applications to smart contracts,” PACMPL, vol. 2, pp. 48:1–48:28, 2018.
  • [75] S. Kalra, S. Goel, M. Dhawan, and S. Sharma, “ZEUS: Analyzing safety of smart contracts,” in NDSS.   The Internet Society, 2018.
  • [76] I. Nikolic, A. Kolluri, I. Sergey, P. Saxena, and A. Hobor, “Finding the greedy, prodigal, and suicidal contracts at scale,” pp. 653–663, 2018.
  • [77] P. Tsankov, A. M. Dan, D. Drachsler-Cohen, A. Gervais, F. Bünzli, and M. T. Vechev, “Securify: Practical security analysis of smart contracts,” in CCS.   ACM, 2018, pp. 67–82.
  • [78] “Manticore,” https://github.com/trailofbits/manticore.
  • [79] “Mythril,” https://github.com/ConsenSys/mythril-classic.
  • [80] “Echidna,” https://github.com/trailofbits/echidna.

Appendix A Smart Contract Repositories

All tested smart contracts are open source. Tab. VI provides the changeset IDs and links to their repositories.

BIDs Name Changeset ID Repository
1 ENS 5108f51d656f201dc0054e55f5fd000d00ef9ef3 https://github.com/ethereum/ens
2–3 CMSW 2582787a14dd861b51df6f815fab122ff51fb574 https://github.com/ConsenSys/MultiSigWallet
4–5 GMSW 8ac8ba7effe6c3845719e480defb5f2ecafd2fd4 https://github.com/gnosis/MultiSigWallet
6 BAT 15bebdc0642dac614d56709477c7c31d5c993ae1 https://github.com/brave-intl/basic-attention-token-crowdsale
7 CT 1f62e1ba3bf32dc22fe2de94a9ee486d667edef2 https://github.com/ConsenSys/Tokens
8 ERCF c7d025220a1388326b926d8983e47184e249d979 https://github.com/ScJa/ercfund
9 FBT ae71053e0656b0ceba7e229e1d67c09f271191dc https://github.com/Firstbloodio/token
10–13 HPN 540006e0e2e5ef729482ad8bebcf7eafcd5198c2 https://github.com/Havven/havven
14 MR 527eb90c614ff4178b269d48ea063eb49ee0f254 https://github.com/raiden-network/microraiden
15 MT 7009cc95affa5a2a41a013b85903b14602c25b4f https://github.com/modum-io/tokenapp-smartcontract
16 PC 515c1b935ac43afc6bf54fcaff68cf8521595b0b https://github.com/mattdf/payment-channel
17–18 RNTS 6c39082eff65b2d3035a89a3f3dd94bde6cca60f https://github.com/RequestNetwork/RequestTokenSale
19 DAO f347c0e177edcfd99d64fe589d236754fa375658 https://github.com/slockit/DAO
20 VT 30ede971bb682f245e5be11f544e305ef033a765 https://github.com/valid-global/token
21 USCC1 3b26643a85d182a9b8f0b6fe8c1153f3bd510a96 https://github.com/Arachnid/uscc
22 USCC2 3b26643a85d182a9b8f0b6fe8c1153f3bd510a96 https://github.com/Arachnid/uscc
23 USCC3 3b26643a85d182a9b8f0b6fe8c1153f3bd510a96 https://github.com/Arachnid/uscc
24 USCC4 3b26643a85d182a9b8f0b6fe8c1153f3bd510a96 https://github.com/Arachnid/uscc
25 USCC5 3b26643a85d182a9b8f0b6fe8c1153f3bd510a96 https://github.com/Arachnid/uscc
26 PW 657da22245dcfe0fe1cccc58ee8cd86924d65cdd https://github.com/paritytech/contracts
27 BNK 97f1c3195bc6f4d8b3393016ecf106b42a2b1d97 https://github.com/Bankera-token/BNK-ETH-Contract
TABLE VI: Smart contract repositories.