Ethereum can be viewed as a transaction-based state machine (Wood2014EthereumAS), whose foundation is transaction execution. There are about 500,000 transactions (TxOnEtherscan) running on Ethereum every day, most of which involve the execution of smart contracts. Over the past few years, the safety and security problems of the blockchain transaction caused by smart contracts have emerged endlessly. In June 2016, Blockchain industry’s largest crowdfunding project — , was attacked due to a serious flaw in function, resulting in more than three million Ether loss (dao). Then, many efforts are devoted to safeguarding the transaction security by ensuring rigorous code logic of smart contracts. For instance, Oyente (Luu2016MakingSC) and MAIAN (Nikolic2018FindingTG) use symbolic execution techniques to find potential security vulnerabilities in Solidity smart contracts. Zeus (Kalra2018ZEUSAS) applies abstract interpretation to analyze smart contracts.
How to define general scenarios and evaluation metrics?
At present, EVM has at least 10 widely used implementations of different programming language (impl), all of which are based on the standards of Ethereum Yellow Paper (Wood2014EthereumAS). For example, the amount of EVM code in Geth (geth) platform is about 5,475 lines of Go, accounting for 27.6% of the core infrastructure implementation of Geth; the data in Parity (parity) is about 18,912 lines of Rust, accounting for 31.7% of the core infrastructure implementation. The implementation of EVM is complex, involving a large number of control structures and storage structures. Because of the EVM version diversity and code complexity, it is necessary to define meaningful and general evaluation metric for the testing of different EVMs.
How to generate test cases that trigger EVM bugs?
The Ethereum Yellow Paper provides the basis for all implementations of EVM, in which the functions and attributes are defined by formulas and rules. But there is neither authoritative test suites and benchmarks nor the common vulnerabilities and defects of EVM platforms sorted out by authority, which makes EVM testing loses the basis and target of detection. Moreover, there is no mature testing tool for EVM which makes it difficult to employ EVM testing on a large scale in the short term. Due to the lack of official benchmarks and widely used EVM testing frameworks, we urgently need to find out a solution which is able to perform efficient and accurate EVM testing, automatically generating EVM inputs fast and effective.
To address these challenges, we implement EVMFuzz, which aims to automatically generate adversarial test inputs for EVMs.
First, we define a general evaluation metric for the differential fuzzing of EVMs. As most EVMs are implemented as a transaction-based state machine, and the change of state depends on the opcode sequence to be executed, the input parameters and gas limit, hence, we use the opcode sequence executed and gas used as two important indicators to evaluate EVMs’ performance on each test contract. EVMFuzz integrates different EVMs and creates a unified running environment for them. In this way, it takes the natural advantages of multiple versions to quickly discover the output inconsistencies without manual checking. Then, our seed contract mutation and selection algorithms can continuously generate contracts that enlarge the metric difference, so that EVMFuzz can efficiently mine cases that trigger differential performance of EVMs and try to get those corner cases with inconsistent execution output.
For evaluation, we firstly conducted empirical studies on 36,259 real-world smart contracts from Etherscan (etherscan) and found that 24,000 contracts triggered metric inconsistencies among different EVMs, even with the same execution output. Through guided fuzzing, 1,596 variant contracts successfully triggered inconsistent execution output among different EVMs. With manual analysis, we found 5 previously unknown security bugs in different widely used EVMs, and all had been included in Common Vulnerabilities and Exposures (CVE) database (cve).
Contributions We make the following main contributions:
We introduce an evaluation metric for EVM differential fuzzing, define 8 mutators for seed contract generation and design dynamic priority scheduling algorithm for seed contract selection.
We implement EVMFuzz, an automated differential fuzz testing framework, to efficiently expose the differences and vulnerabilities of different EVMs.
We apply EVMFuzz to test some most widely used EVM versions, many inconsistencies and security bugs are detected, and 5 vulnerabilities have been assigned with unique CVE IDs.
Paper Organization The rest of this paper is organized as follows. Section 2 introduces some background and gives a motivating example. We provide a high-level overview of EVMFuzz in Section 3, and the details implementation are described in Section 4. The evaluation results are shown in Section LABEL:Evaluation. Section LABEL:Discussion outlines the limitations and proposes future directions for improvement. Finally, we survey related work in Section LABEL:Related-work and conclude in Section LABEL:Conclusion.
2.1. The Ethereum Virtual Machine
Ethereum Virtual Machine (EVM) is the heart of the Ethereum, which is often called the operating system of the Ethereum technology and is responsible for the execution and maintenance of smart contracts. It is the bedrock on which smart contracts are built.
The formal definition of the EVM is specified in the Ethereum Yellow Paper (Wood2014EthereumAS). EVM is a simple stack-based architecture, whose word size (size of stack items) is 256-bit. According to the predefined execution environment and execution steps, such as exception halting and jump destination validity, it completes the state transition of each Ethereum block. EVM handles the execution of bytecode and the calculation of gas consumption. In general, EVM is a powerful, sandboxed virtual stack embedded within each full Ethereum node, responsible for executing contract bytecode.
In accordance with the standards of Ethereum Yellow Paper, EVMs have been successfully implemented in various programming languages including C++, Go and many others (impl). There are tens of thousands of people doing transactions via the clients based on these EVM implementations everyday. Therefore, the vulnerability hidden in any EVM version might result in serious consequences.
2.2. A Motivating Example
To investigate the effectiveness of EVMFuzz, we use a simple example presented below. is a simple contract with a function , whose data structure is a loop. When the input parameter is less than , variable will continually increase. Strictly speaking, the implementation of function has a serious problem and may not appear in real life, if the input parameter satisfies the loop condition, it will result in an infinite loop. However, such contract can pass the check of most existing contract testing and verification tools, and our experiment proves that it can trigger different behaviors of multiple EVM implementations and even cause the denial of service problem.
This example shows that some contracts containing corner cases can trigger the boundary condition of EVM implementations and expose unexplored defects. But these contracts involving extreme circumstances are often inconsistent with logic programming rules, which require artificial construction, in other word, contract mutation. Furthermore, for massive mutated contracts, test oracle is difficult to artificially define, and some extreme cases are not designated in EVM design specifications. Therefore, it is an efficient and effective way to apply differential fuzzing.
3. Approach Overview
In this section, we briefly introduce the workflow of EVMFuzz. Our goal is to apply differential fuzz testing on EVMs. The concept of differential fuzz testing is very simple, that is, to continuously provide invalid, unexpected or random data as inputs to several programs with the same functions. These programs are then monitored for catching "different act" on some inputs, if so, we may find a bug in some of the programs. In this paper, our test object are the same functional EVM platforms implemented by different programming languages, and the test input is the mutated smart contract. An overview of EVMFuzz is given in Fig. 1, which consists of two major components, i.e., seed contract generation based on static analysis and unified EVM execution based on fuzzing loop. We will also introduce the evaluation metrics for EVM differential fuzzing.
3.1. Seed Contract Generation
The input for the seed generation module is the smart contract file, and the output is a contract variant whose key property has been modified by specific mutators. First, we precisely construct the Critical locations identified Abstract Syntax Tree (CAST) of the seed contract (§4.1), for facilitating subsequent mutation and analysis. Then the seed contract will be put into the seed pool. EVMFuzz will rank the candidate contracts as a prioritized queue under the guidance of dynamic priority, and the contract in the first place will be selected as the next subject (§4.2). After choosing the contract for mutation, EVMFuzz uses 8 predefined mutators and the combined strategy to guide mutation (§LABEL:ContractMutation) and obtains the input for unified EVM execution module. The goal is to generate contracts that can increase the degree of metric difference and trigger different execution output.
3.2. Unified EVM Execution
EVM execution module provides a unified runtime environment for various EVMs (§LABEL:UnifiedExecution). After receiving the contract file from the seed generation module, it compiles the seed into EVM bytecode. The input parameter is generated according to the data type of the called function, thus the uniform input for each EVM is obtained. Then EVMFuzz automatically runs all EVMs, calculates the difference information according to the test metric, and compares the execution output results. Finally, according to the seed’s ability to enhance the degree of metric difference, EVMFuzz decides whether to put the seed contract into the seed pool where high-quality seeds preserved (§LABEL:SeedSelection). Besides, when the execution output is inconsistent, this module will also record the potential exception for manual root cause analysis.
3.3. Metrics Formulation
To evaluate the performance of each EVM on the test contract, we define the metric on two general indicators. As most EVMs are implemented as a transaction-based state machine, and the change of state depends on the sequence of opcode to be executed, the input parameter and gas limit, hence, we use the internal opcode sequence executed and gas used as the two indicators.
opcode sequence. Opcode is short for operation code, which is used to describe the part of machine code that performs some sort of operations in machine language instructions. From the perspective of computer instruction execution, each function call is completed by a series of opcode execution. The opcode sequence clearly shows a complete process of contract operation, which can be used to check the execution correctness of each step. For platform , we define as the length of opcode sequence of when executing contract .
gasUsed. is the total number of gas consumed by all operations in a transaction or message. The value of is vitally interrelated with the success of transaction execution, and is also directly related to the transaction fee that users ultimately need to pay. Here we use to represent total gas consuming of platform after running contract .
Based on these two indicators, we further define the evaluation metric of difference information. When given an input parameter, the normal execution of a transaction on a dedicated EVM platform is determined by a confirmed and unique execution sequence, and the total gas consumption is also calculated. Therefore, we construct an evaluation metric to measure the difference among different EVMs execution(§LABEL:SeedSelection
). The greater the metric difference, the higher probability the inconsistent execution output. Execution output is the return value after all executions, thatis defined as the returns of ’s execution on EVM . For a function call, it is the returned data, and for a transaction, it is the balance. While the metric defined on the two internal indicators reflects the implementation and execution difference of different EVMs, the execution output can intuitively reflect whether those EVMs are running consistently or correctly.
4. EVMFuzz Design
In this section, we will elaborate on the key components in Fig. 1.
4.1. CAST Construction
Before EVMFuzz starts the entire procedure of fuzzing, it first carries out static analysis on initial seed contracts and generates the CAST structure for further mutation.
A CAST of a smart contract is a structured tree representation of the abstract syntactic structure of Solidity source code. Each node of the tree denotes a construct occurring in the source code. CAST can define and decompose properties in all statements of contract. Transforming a contract into CAST structure can help us complete the subsequent contract mutation operations. It can directly search, replace, delete or insert operators according to the key attributes.
Furthermore, CAST identifies critical locations of a seed contract, which are the subtree of statements related to ether transaction. It mainly involves six statement symbols — , , , , and . Based on CAST, we can guide pre-defined mutators to select the structures that are identified as critical locations in order to test the core functions of EVM. A simple example is presented in Fig.2, including the source code and the corresponding CAST structure, where the shaded nodes are regarded as the critical locations, for the reason that they are all under the subtree of the statement.
4.2. Seed Contract Prioritization
In seed contract pool, the importance of each candidate contract is different. In general, the contract that makes the metric difference among EVMs larger should be the benchmark for the next mutation iteration. But at the same time, in order to ensure the diversity, other contracts should also have a certain probability of being selected. Therefore, we use the dynamic priority scheduling algorithm to maintain a candidate queue. For each contract, we give it an initial priority, and then its value changes with the increasing of waiting time to ensure that every seed has the chance to be selected.
As Algo. 1 shows, the priority of each seed contract consists of two parts (Algorithm 1 line 3-4). The first part is metric difference priority, and the initial value is a number between 0 and 10, which is proportional to the value of difference; the second part is time priority, and the initial value is 0. Then, all candidate seed contracts are sorted according to the priority value, and the contract with the highest integrated priority is selected as the next mutation object, and the time priority of other seed contracts is increased for next iteration (Algorithm 1 line 9-11).