Towards Safer Smart Contracts: A Sequence Learning Approach to Detecting Vulnerabilities

11/16/2018 ∙ by Wesley J T, et al. ∙ National University of Singapore Nanyang Technological University 0

Symbolic analysis of security exploits in smart contracts has demonstrated to be valuable for analyzing predefined vulnerability properties. While some symbolic tools perform complex analysis steps (which require predetermined invocation depth to search the execution paths), they employ fixed definitions of these vulnerabilities. However, vulnerabilities evolve. The number of contracts on blockchains like Ethereum has increased 176 fold since December 2015. If these symbolic tools fail to update over time, they could allow entire classes of vulnerabilities to go undetected, leading to unintended consequences. In this paper, we aim to have smart contracts that are less vulnerable to a broad class of emerging threats. In particular, we propose a novel approach of sequential learning of smart contract vulnerabilities using machine learning --- long-short term memory (LSTM) --- that perpetually learns from an increasing number of contracts handled over time, leading to safer smart contracts. Our experimental studies on approximately one million smart contracts for learning revealed encouraging results. A detection accuracy of 97 learning approach also correctly detected 76 would otherwise be deemed as false positive errors by a symbolic tool. Last but not least, the proposed approach correctly identified a broader class of vulnerabilities when considering a subset of 10,000 contracts that are sampled from unflagged contracts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

Safe-SmartContracts

A Sequence Learning Approach to Detecting Vulnerabilities


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Smart contracts provide automated peer-to-peer transactions while leveraging on the benefits of the decentralization provided by blockchains. As smart contracts are able to hold virtual coins worth upwards of hundreds of USD each, they have allowed the automated transfer of monetary values or assets via the logic of the contract while having the correctness of its execution governed by the consensus protocol (Nakamoto_bitcoin). The inclusion of automation in blockchain resulted in rapid adaptation of the technology in various sectors such as finance, healthcare, and insurance (zile2018blockchain). Ethereum, the most popular platform for smart contracts had a market capitalization upwards of billion USD (ethereumprice). Due to the fully autonomous nature of smart contracts, exploits are especially damaging as they are largely irreversible due to the immutability of blockchains. On Ethereum alone, over 3.6 million Ether (virtual coins used by Ethereum) were stolen from a decentralized investment fund called The DAO (Decentralized Autonomous Organization) in June 2016, incurring losses of up to million USD (TheDAO). In November 2017, million USD was frozen because of Parity’s MultiSig wallet (Parity). Both hacks that occurred were due to exploitable logic within smart contracts themselves, and these incidents highlighted a strong imperative for the security of smart contracts.

The tools used in smart contract symbolic analysis are mainly based on formal methods of verification. While most analysis tools have applied dynamic analysis to automatically detect bugs in smart contracts (Luu:2016:MSC:2976749.2978309; DBLP:conf/ndss/KalraGDS18; Tsankov:2018:SPS:3243734.3243780), some have focused on finding vulnerabilities across multiple invocations of a contract (Liu:2018:RFR:3183440.3183495). Oyente is one such example of an automatic bug detector. It was proposed to act as a form of pre-deployment mitigation, by analyzing smart contracts for vulnerabilities at a bytecode level (Luu:2016:MSC:2976749.2978309). It uses symbolic execution to capture traces that match the characteristics of the classes of vulnerabilities as defined. However, it is not complete as confirmations of flagged contracts being vulnerable were only done manually in the presence of contract source code.

Recently, it has been shown that Maian, a tool for precisely specifying and reasoning about trace properties, which employs inter-procedural symbolic analysis and concrete validation for exhibiting real exploits (Nikolic:2018:FGP:3274694.3274743), was able to capture many well-known examples of unreliable bugs. Using predefined execution trace vulnerabilities directly from the bytecode of Ethereum smart contracts, Maian labels vulnerable contracts as one or two of the three categories—suicidal, prodigal, and greedy. Maian is able to detect different classes of vulnerabilities that may only appear after multiple invocations while verifying its results on a private fork of Ethereum. However, the degree of accuracy in its detection is limited by its invocation depth, whereby states that vulnerabilities may occur in were not reached due to a tradeoff between analysis time and exhaustiveness of search. In addition, concrete validation of contracts can only be performed by Maian either on flagged contracts that are alive within the forked Ethereum chain or on contracts with existing source code readily available.

In the field of machine learning, recurrent neural networks are exceptionally expressive and powerful models adapted to sequential data. The long short-term memory (LSTM) model is a compelling variant of recurrent networks mainly used to solve difficult sequential problems such as speech recognition 

(hinton2012deep; Dahl:2012:CPD:2335874.2336015), machine translation (cho-al-emnlp14; DBLP:journals/corr/WuSCLNMKCGMKSJL16)

, and natural language processing 

(journals/corr/Graves13; luong-etal-2015-effective). In recent years, there has been an increasing interest in the security of smart contracts and the application of machine learning in computer security, with papers on automated exploit analysis (217650), neural networks for guessing passwords (204155), exploits for a contract developed from bytecode (217464), and taxonomy of common programming pitfalls (Atzei:2017:SAE:3080353.3080363).

In this work, we introduce an LSTM model for detecting smart contract security threats at an opcode level. To the best of our knowledge, this is the first machine learning approach to smart contract exploit detection. We study the applicability of using an LSTM model to detect smart contract security threats. As smart contracts become available in sequential order, they could be used to update the LSTM model for future contracts at each point in time. Since only around of all smart contracts have available Solidity source code (we refer to Etherscan (etherscan)), it highlights the utility of our LSTM learning model as a smart contract security tool which operates solely at the opcode level.

Contributions.

Our contributions in this work are as follows:

  • We show that our LSTM model is able to outperform symbolic analysis tool Maian (Nikolic:2018:FGP:3274694.3274743) in detecting smart contract security vulnerabilities.

  • We experimentally demonstrate that the LSTM performance improves with new contracts, and eventually achieved

    1. [leftmargin=2]

    2. detection test accuracy of

    3. score of

  • We show that our approach detected up to of challenging contracts that were false positive (FP) errors made by Maian.

  • We show that the LSTM, which only requires constant analysis time as smart contracts grow in complexity, can easily scale to process a large number of smart contracts.

  • By demonstrating that the proposed LSTM tool is a competitive alternative to symbolic analysis tools, we set a benchmark for future work on machine learning models that ensure smart contracts security.

2. Background

2.1. Smart Contracts

Smart contracts are autonomous, state-based executable code that is stored, verified, and executed on the blockchain. Ethereum smart contracts are predominantly written in a high-level programming language—Solidity, which is then compiled to a stack-based bytecode format. A smart contract is deployed on the Ethereum blockchain in the form of a transaction by a sender, where an address is assigned to the contract. Each smart contract contains a state (account balance and private storage) and executable code. Once deployed, a smart contract is immutable and no modifications can be made to the contract. However, it may be killed if a Suicide instruction in the contract is executed.

Contracts, once deployed on the blockchain, may be invoked by sending transactions to the contract addresses, along with input data and gas (“fuel” for smart contract execution). In Ethereum, gas is assigned proportionately to the amount of computation required for each instruction in its instruction set (wood2014ethereum). This gas is used as an incentive within the proof-of-work system for executing the contracts. If gas is insufficient or exhausted before the end of execution, no gas is refunded to the caller and the transaction (including state) is reverted. No transactions can be sent to or from a killed contract.

In Ethereum, an invocation of a smart contract is executed by every fullnode in the network, taking into account both the current state of the blockchain and the state of the executing contract, to reach consensus on the output of the execution. The contract would then update the contract state, transfer values to other contract addresses, and/or execute functions of other contracts.

2.2. Contracts with Vulnerabilities

Due to the autonomy and immutability of smart contracts, once an attack is executed successfully on a contract, it is impossible for the transaction to be reversed without performing a hardfork (hard_fork)

on the underlying blockchain. As the distribution of smart contracts within Ethereum is heavily skewed towards the financial sector (primarily used for the transfer of assets or funds) 

(DBLP:conf/fc/BartolettiP17a), some of the past attacks have incurred multimillion-dollar losses. This highlights a strong need for security of smart contracts. Although there are several existing studies and analyses of exploit categories in smart contracts (Luu:2016:MSC:2976749.2978309; DBLP:conf/ndss/KalraGDS18; SmartInspect), we primarily focus on the classes defined in (Nikolic:2018:FGP:3274694.3274743), due to their extensive coverage and availability of the open source tool Maian. We will briefly go over some of the concepts and the exploit categories highlighted in the paper.

An execution trace of a smart contract is a series of contract invocations that occurred during its lifetime. Exploits that happen over a sequence of contract invocations are known as trace exploits. In (Nikolic:2018:FGP:3274694.3274743)

, the exploits in Ethereum smart contracts are classified under three categories—suicidal, prodigal, and greedy.

Suicidal Contracts.

Smart contracts that can be killed by any arbitrary address are classified as suicidal. Although some contracts have an option to kill themselves as mitigation against attacks, if improperly implemented, the same feature may allow any other user the option of killing the contract as well. This occurred during the ParitySig attack (Parity), where an arbitrary user managed to gain ownership of a library contract and killed it, rendering any other contract that relied on this library useless and effectively locking their funds.

Prodigal Contracts.

Smart contracts classified as prodigal are ones that can leak funds to arbitrary addresses, which either (a) do not belong to the owner of the contract, or (b) have not deposited Ether to the contract. Contracts often have internal calls to send funds to other contracts or addresses. However, if there are insufficient mechanisms in place to guard the availability of such calls, attackers may be able to exploit this call to funnel Ether to their own accounts, draining the vulnerable contract of its funds.

Greedy Contracts.

Smart contracts that are unable to release Ether are classified as greedy. Following the ParitySig attack (Parity)

, many accounts dependent on the library contract were unable to release funds, resulting in an estimated loss of $30 million USD. Within the greedy class, the vulnerable contracts are subdivided into two categories—(a) contracts that accept Ether but completely lack instructions to send funds, and (b) contracts that accept Ether and contain instructions to send funds, but are unable to perform the task.

2.3. Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are powerful machine learning models adapted to sequence data. These models can learn and achieve outstanding performance on many hard sequential learning problems such as speech recognition, machine translation, and natural language processing. These neural networks possess a remarkable ability to learn highly accurate models using only two hidden layers (conf/ijcnn/NguyenW90). However, standard RNNs are hard to properly train in practice. The main reason why the model is so unmanageable is that it suffers from both exploding and vanishing gradients (BengioSimardFrasconi94). Both issues are due to the RNN’s recurrent nature.

While the exploding gradients problem is relatively easy to solve by simply shrinking gradients with norms passing a certain threshold, a method known as gradient clipping 

(pascanu2013difficulty; mikolov2012statistical), the vanishing gradient issue is much more challenging. This is because vanishing gradients do not cause the gradient itself to be small. In fact, the gradient’s component in directions that correspond to short-term dependencies is large, while the component in directions that correspond to long-term dependencies is small. As a result, recurrent networks are able to easily learn short-term dependencies but not the long-term ones.

2.4. Long Short-Term Memory

In order to address the vanishing gradient and long-term dependency issues of standard RNNs, the long short-term memory (LSTM) network was proposed (Gers:99c; hochreiter1997long). In the LSTM, gate functions were recommended to be used for controlling information flow in any given recurrent unit—an input gate, a forget gate, and an output gate. An input gate functions as a gatekeeper to allow relevant signals through into the hidden context. On the other hand, the forget gate is used to determine the amount of prior information remembered for the current time-step, and the output gate functions as a prediction mechanism. By introducing such information gate controls, the LSTM almost always performs much better than standard RNNs.

RNNs take a sequence as input and construct a corresponding sequence of hidden states (or representations)

. In the simplest case, a single-layer recurrent network uses the hidden representations

for estimation and prediction. In deep RNNs, each hidden layer uses the hidden states of the previous layer as inputs. That is, the hidden states in layer are used as inputs to layer . In RNNs, every hidden state in each layer performs memory-based learning to place importance on relevant features of the task using previous inputs. Previous hidden states and current inputs are transformed into a new hidden state, and it is achieved through a recurrent operator that takes in , such as:

where , , and are parameters of the layer and represents the standard hyperbolic tangent function.

The LSTM architecture is specifically designed to handle recurrent operations. In this architecture, a memory cell , as shown in Figure 1, is introduced for internal long-term storage. As we recall that the hidden state is an approximate representation of state at time-step , both and are computed via three gate functions to retain both long and short term storage of information. The forget gate , via an element-wise product, directly connects to the memory cell of the previous time-step. Using large values for the forget gates would cause the cell to retain almost all of its previous values. In addition, input gate and output gate

control the flow of information within themselves. Each gate function has its own weight matrix and a bias vector. We denote the parameters with subscripts

for the forget gate function, for the input gate function, and for the output gate function respectively (e.g., , , and are parameters of the forget gate function).

Tanh

+

Tanh

Cell

Hidden

Input

Figure 1. Schematic of a Long Short-Term Memory Cell.

Practitioners across various fields in sequence modeling use slightly different LSTM variants. In this work, we follow the model of leading natural language processing research (journals/corr/Graves13), used to handle complex sequences with long-range structure. The following is the formal definition of our full LSTM architecture, without peep-hole connections,

(1)
(2)
(3)
(4)
(5)
(6)

where

is the sigmoid function,

is the hyperbolic tangent function, and denotes element-wise product.

3. Learning Smart Contract Threats

In this section, we propose the modeling of smart contract exploits using a sequential machine learning approach, and explain how an LSTM learning model handles the semantic representations of smart contract opcode. We present the security threat detection objective, properties of smart contract opcode as a sequence, and opcode embedding representation.

3.1. Classification of Contract Threats

The objective of our LSTM learning model is to perform a two-class classification, in order to detect if any given smart contract contains security threats. Motivated by the concepts in optimization, the objective in LSTM learning is to minimize the detection loss function, in order to maximize classification accuracy. Through the loss provided by each training data point, we ideally expect the sequence model to learn from the errors. Loss functions of learning models are mostly application specific and are selected based on how they affect the performance of the classifiers 

(Altun:2003:ILF:1119355.1119374; bishop1995neural). The most common ones used to measure the performance of a classification model are the cross-entropy loss (logarithmic loss), softmax, and squared loss.

Figure 2. LSTM Smart Contract Vulnerability Classification.

In our case, we have chosen the logarithmic loss or the binary cross-entropy loss function. It is preferred as we formalized the smart contract threat detection into a binary classification problem. Let us proceed to define the derivation of a binary cross-entropy loss function :

(7)

where is the total number of contract opcodes in the training dataset, the sum over all training opcodes, , where is the weighted sum of the inputs, and the corresponding desired threat estimate. As the network improves its estimation of desired outputs for all training opcodes , the summation of cross-entropy loss tends toward zero. This means that as a model learns to be more accurate in classifying smart contracts over time, it minimizes the distance between output estimate and the desired output . A perfect classifier would achieve a log loss of precisely zero.

3.2. Sequential Modeling of Smart Contracts

In this section, we first introduce the Ethereum opcode sequence processed by the LSTM model, followed by the usage of smart contract opcode sequence as input for our learning model to detect security threats, Figure 2.

3.2.1. Ethereum Opcode Sequence

Smart contract threat detection, like many sequence learning tasks, involves processing sequential opcode data. More precisely, opcodes are a sequence of numbers interpreted by the machine (virtual or silicon) that represents the type of operations to be executed. In the Ethereum environment, opcodes are a string of low-level human-readable instructions specified in the yellow paper (wood2014ethereum). The machine instruction language is processed by Ethereum Virtual Machine (EVM)—a stack-based architecture with a word size of 256-bit. Each instruction is defined with an opcode (value), name (mnemonic), value, value, and a description. For each instruction, the value is the number of additional items placed on the stack for that instruction. Similarly, the value is the number of items required on the stack for that instruction.

60 60 52 36 15 61 57 60 35 7c 90 04 63 16 80 63 14 61 57 80 63 14 61 57 80 63 14 61 57 5b 61 5b 60 60 90 54 90 61 0a 90 04 73 16 73 16 34 60 51 80 90 50 60 60 51 80 83 03 81 85 87 61 5a 03 f1 92 50 50 50 15 61 57 7f 60 60 90 54 90 61 0a 90 04 73 16 60 51 80 82 73 16 73 16 81 52 60 01 91 50 50 60 51 80 91 03 90 a1 61 56 5b 60 60 fd 5b 5b 56 5b 00 5b 34 15 61 57 fe 5b 61 60 80 80 35 73 16 90 60 01 90 91 90 50 50 61 56 5b 00 5b 34 15 61 57 fe 5b 61 60 80 80 35 73 16 90 60 01 90 91 90 50 50 61 56 5b 00 5b 61 60 80 80 35 73 16 90 60 01 90 91 90 80 35 90 60 01 90 82 01 80 35 90 60 01 90 80 80 60 01 60 80 91 04 02 60 01 60 51 90 81 01 60 52 80 93 92 91 90 81 81 52 60 01 83 83 80 82 84 37

Figure 3. Sample opcode sequence used as input data to the LSTM learning model.

To generate the labels required for supervised machine learning, the contracts were processed by passing bytecodes through Maian to obtain vulnerability classifications. In the process, opcodes were also retrieved. A sample EVM opcode thus produced, which the LSTM model takes as input is shown in Figure 3. The addresses of the contracts were saved, along with the valid corresponding EVM opcodes, and threat classifications (categories) into a data-frame, Figure 4.

Figure 4. Dataset: Contract address, opcode, and category.

We then use these smart contract opcodes as input to our sequence learning model. Our choice of using opcodes for learning smart contract threats is based on the long-proven capability of machine learning malware detection in both Windows and Android systems. In malware detection, models typically learn from opcode features to achieve impressive detection accuracy (Ngram_kang; Abou-Assaleh:2004:NDN:1025118.1025582)

. This approach of learning from opcode features prevails over traditional malware detection approaches such as signature-based detection and heuristic-based detection, even offering the added benefit of being able to learn from existing patterns at a binary level to classify unknown threats 

(Shabtai:2009:DMC:1550969.1551289). In this study, we propose a similar approach by applying machine learning to opcodes derived from Ethereum smart contracts.

3.2.2. Opcode Sequence for Threat Detection

Numerous tasks with sequential inputs and/or sequential outputs can be modeled with RNNs (karpathy2015unreasonable). For our application in smart contract opcode security threat detection, where inputs consist of a sequence of opcodes, opcodes are typically fed into the network in consecutive time steps. The most straightforward way to represent opcodes is to use a binary vector with length equal to the size of machine instruction list for each opcode in the directory—one-hot encoding, as shown in Figure 5.

Figure 5. Left to right: one-hot vectors representing the first, second, third, and last opcodes in the instruction list, respectively.

Such a simple encoding (Elman90findingstructure) has many disadvantages. First, it is an inefficient way of representing opcodes, as large sparse vectors are created when the number of instructions increases. On top of that, one-hot vectors do not capture any measure of functional similarity between opcodes in the encoding. Hence, we model opcodes with code vectors. It represents a significant leap forward in advancing the ability to analyze relationships between individual opcodes and opcode sequences. Code vectors are able to capture potential relationships in sequences, such as syntactic structure, semantic meaning, and contextual closeness. The LSTM learns these relationships when given a collection of supervised smart contract opcode data to initialize the vectors using an embedding algorithm (Tomas2013).

The embedding, shown in Figure 6, is a dense matrix in a linear space, which achieves two important functions. Firstly, by using an embedding with a much smaller dimension than the directory, it reduces the dimension of opcode representations from the size of the directory to the embedding size (), where and are the embedding and directory sizes respectively. Secondly, learning the code embedding helps in finding the best possible representations, and groups similar opcodes in a linear space.

Figure 6. Example of Opcode Embedding.

A special case of the logistic function with output values from 0 to 1, sigmoid function, is used for the output layer. Intuitively, the outputs correspond to the probability that each opcode sequence is categorized as either one of the predicted class.

4. Implementation

Next, we turn to a discussion of how we implemented our LSTM detection tool in practice. We start by introducing the data source we used for vulnerable and not-vulnerable smart contracts. We then analyze the features and explain how we processed the contracts. Last we give details of how we trained the LSTM machine learning model.

We trained and tested the proposed LSTM model on 620,000 contracts, by obtaining 920,179 existing smart contracts from the Google BigQuery (GoogleBigQuery) Ethereum blockchain dataset. This dataset includes the first block of Ethereum, up until block 4,799,998, which was the last block mined on December 26, 2017.

4.1. Data Source

We used the Ethereum dataset downloaded from Google BigQuery. We then parsed the smart contracts’ bytecode into opcode using the EVM instruction list (wood2014ethereum).

4.1.1. Safe or Vulnerable

In order to obtain labels for smart contracts in blocks 0 to 4,799,998, we ran the contracts through the Maian tool. A total of 920,179 contracts were processed, producing a number of flagged contracts. Processing our dataset using the Maian tool, we collected the sequential opcodes, which are instructions found in the EVM list of execution code, as inputs for our LSTM learning model. We then removed the wrongly flagged prodigal and suicidal contracts (false positives) identified by Nikolic et al. (Nikolic:2018:FGP:3274694.3274743). The brief overview of the experiments to identify the false positives that were performed by the team of Nikolic et al. are as follows:

  • Concrete validation for prodigal and suicidal contracts was performed by running the flagged contracts along with its sequence of invocations produced by Maian on a private fork of Ethereum, effectively ensuring the reproducibility of the vulnerabilities. Contracts that were not exploitable but were flagged by Maian were categorized as false positives.

  • For contracts categorized as greedy (recall from section 2.2), concrete validation was performed in a similar procedure for category (a) of greedy contracts by sending Ether to the flagged contracts and ensuring that no instructions exist within the contract that allowed the ether to be transferred out. For category (b), where instructions exist that allowed the possibility of Ether being transferred out, manual analysis was performed on the contracts which have source code available—none of them were identified to be true positives.

Since no data were available for the wrongly flagged greedy contracts, we assumed all contracts in category (b) of the greedy contracts as false positives and removed them from our dataset, in accordance with findings presented in Nikolic et al. (Nikolic:2018:FGP:3274694.3274743). After cleaning and processing our data, we report the number of distinct contracts, calculated by comparing the contract opcodes. Given the large number of flagged contracts, we proceeded to check for duplicates. We found 8640 distinct contracts that were flagged as suicidal (1207), prodigal (1461), greedy (5801), and both suicidal and prodigal (171).

Category
Reported by (Nikolic:2018:FGP:3274694.3274743)
Maian Paper
Processed by us
Maian Tool
Distinct
(flagged)
Contracts Processed 970,898 920,179
Suicidal 1495 1544 1378
Prodigal (Leak) 1504 1786 1632
Greedy (Lock) 31,201 17,084 5801
Table 1. Processed and categorized contracts by Maian.

Table 1 is a summary of data processed using the Maian tool. The difference of 50,719 processed contracts between the 970,898 contracts previously reported (Nikolic:2018:FGP:3274694.3274743) and the 920,179 processed by us was due to empty contracts. In addition, we believe that version updates of the Maian tool since the numbers were last reported in March 2018 contributed to this difference.

Using this dataset, we trained and tested the LSTM learning model on 8640 flagged contracts and 416,944 unflagged contracts, from which we removed invalid opcode instructions and duplicates. While Maian classifies vulnerable smart contracts into three categories of exploits, we considered these vulnerabilities as one class—vulnerable. In this two-class setting, each contract that is labeled as ”0” in the category field, Figure 4, indicates that it is not vulnerable. Otherwise, a vulnerable contract is labeled as ”1”.

We chose this security exploit detection task with this specific subset of the entire Ethereum blockchain dataset because of the public availability of smart contracts data, which has been symbolically analyzed (Nikolic:2018:FGP:3274694.3274743), and it serves as a baseline for our model.

4.1.2. Opcode Features

As stated above in Section 2.1, an Ethereum smart contract is a series of low-level EVM code that resides in the Ethereum blockchain. The EVM code, also known as bytecode, is a hexadecimal representation of a contract, which is something only the EVM can understand. Hence, we use a high-level language—Solidity to write smart contracts effectively. In order to deploy a smart contract, we compile the solidity code using a compiler, and it will translate our source code into bytecode. We then convert the bytecode into opcode, a human-readable format that is similar to any natural language.

In the appendix of the Ethereum yellow paper (wood2014ethereum), it contains a complete list of the EVM bytecode and its corresponding opcode. A bytecode to opcode disassembler 111https://etherscan.io/opcode-tool can be used on any smart contract on the Ethereum blockchain to obtain the opcode. A fixed directory of 150 execution instructions for the smart contracts opcode is defined in the Ethereum yellow paper. Since 150 is a relatively small number when compared with most language sequence tasks in machine learning, all unique instructions were included for learning.

The EVM opcode is a machine language instruction that specifies the operations to be performed, and it reflects the logic of each smart contract. Opcodes have been successfully used in previous work to analyze various underlying issues of smart contracts (Chen:2018:DPS:3178876.3186046; 10.1007/978-3-030-15032-7_46)

. Therefore, we expect that learning from a sequence of features extracted from opcodes is capable of detecting latent smart contract vulnerabilities.

4.1.3. Structural Properties

Figure 7. Histogram of the length of smart contract opcodes from original dataset. Most of the smart contracts contain less than 1500 opcodes.

Figure 7 shows some interesting characteristics of the length of the smart contracts we consider. Specifically, the histogram of the length of opcode each contract contains ranges from . From the original dataset of approximately smart contracts, the opcode length statistics has a mode, median, and mean of , , and respectively. For contracts that are not vulnerable, the statistics are very close to the population statistics, as roughly of the original dataset is made up of these contracts.

On the other hand, the statistics of vulnerable contracts are significantly different than the not-vulnerable ones. Vulnerable contracts are reported to have a mode of , a median of , and a mean of . Hence, we decided to set the maximum length of LSTM opcode input to 1600. It is a design choice that would sufficiently cover most smart contracts, as 1600 is higher than the mode length of the entire population. Moreover, most vulnerable contracts would be detected since 1600 is much larger than both the mode and median length of the set of vulnerable contracts.

4.2. Data Processing

While we collected a moderately large training set, it was highly imbalanced. It is an issue with classification problems where the classes are not represented equally, and one class outnumbers the other classes by a large proportion. Based on the distribution of the original dataset, of the contracts are labeled as not-vulnerable by Maian, while only of contracts are either greedy, suicidal, and/or prodigal. In order to handle the imbalanced set, we grouped all vulnerable contracts together to retrieve 8640 samples in one class, and samples not classified in any of the vulnerable categories were grouped into another class. Hence, samples are labeled as one of two classes, vulnerable or not-vulnerable.

Next, we resampled the dataset to achieve a balanced distribution, where half of the contracts are from the not-vulnerable class and the other half from the vulnerable class. We randomly sampled contracts from the not-vulnerable class set and created an equal number of synthetic vulnerable samples. Using a popular method to oversample minority classes, the Synthetic Minority Oversampling Technique (SMOTE) (Chawla02smote:synthetic), we oversampled the minority (vulnerable) class and undersampled the majority (not-vulnerable) class. After performing the resampling, we ended up with a balanced dataset consisting of vulnerable and not-vulnerable contracts.

In order to create synthetic samples, we first train a representative embedding using the original dataset. The embedding is a dense matrix that is a learned representation of the different opcodes. A smart contract sequence of opcode is then converted into to one-hot vectors. Using the learned embedding, we then perform the matrix dot product operation on the one-hot vectors, Figure 6. The resulting contract is now in dense code vectors. Finally, SMOTE oversampling is implemented on the code vectors of the vulnerable contracts, in order to generate synthetic samples.

4.3. Training Details

We trained our model with two LSTM layers of 128 and 64 hidden units respectively to learn from smart contracts with opcode length of 1600, and overcome both the vanishing gradient and long-term dependency issues, Section 2.4

. The layers consist of a 150-dimensional word embedding with an input vocabulary of 150 opcode instructions. We found that our model was fairly easy to train on the balanced dataset. The classification task is based on a binary output using the sigmoid activation function. The LSTM generalizes well over our rebalanced training dataset, and it does not overfit the training samples. The resulting LSTM has 184,258 parameters, with training details as follows:

  • We divided the vulnerable class dataset of 8640 unique smart contracts into training, validation, and test.

  • We oversample the vulnerable training smart contracts into 200,000 samples.

  • We then undersample an equal number of unique not-vulnerable contracts and add them to the training set.

  • We use a total number of 620,000 smart contracts.

  • We have a balanced training dataset of size 400,000, a validation set of size 100,000 (98,672 not-vulnerable, 1328 vulnerable), and a test set of size 120,000 (118,272 not-vulnerable, 1728 vulnerable).

  • We use Adam (kingma:adam)

    as the adaptive gradient descent optimizer, and trained our LSTM model for a total of 256 epochs.

  • We use batches of 256 smart contracts for the stochastic gradient descent optimizer to achieve speedy convergence.

  • We use binary cross-entropy loss (log loss), which measures the performance of the classification model with output of a soft value between 0 and 1.

  • We set the maximum input length to 1600 and zero-pad the contracts that were shorter than that.

4.4. Evaluation Results

In this section, we illustrate the experimental performance and results of our LSTM learning model on smart contract security threat detection tasks. Source code is available here 222https://github.com/wesleyjtann/Safe-SmartContracts.

4.4.1. Test Performance

We use our LSTM learning model for evaluation and report the accuracy, recall, precision,

, and area under the curve of Receiver Operating Characteristic (AUC ROC) scores on the test dataset. The confusion matrix

, Figure LABEL:fig:confusion, is used to evaluate classifier output quality. In binary classification, is the count of true positives, false positives, false negatives, and true negatives.